Last week I introduced Composite Domains in Data Quality Services and showed that you must first create individual domains on the columns that you want to include in the composite. When you use the composite many people may at first assume that somehow DQS stores value pairs from each domain from each row of the source table to create the composite domain. They then think that they can somehow work with those pairs to cleanse other data sources with the same fields (domains).
After some testing, what I found is that each of the fields in the composite domain is treated as the separate domain that was used to define the composite. When the composite domain is used to cleanse data in a new table, it is not the combination of the fields that DQS uses to cleanse the data, but the individual domains. Let me use an example to make that clear.
Suppose you have a table of addresses that you want to use with knowledge discovery to create domains for the fields CITY, STATE, and ZIP. Each record in the source data represents a specific place that can be identified by the combination of these three fields. Furthermore, your address data set may have thousands of records from the same state. However, that state value is stored in the knowledge base only a single time. Similarly, there may be hundreds of records with the same zip code and that zip code is stored in the zip domain one once for each unique value. Finally, each unique city name is stored once in the city domain even though that city may appear in dozens of states. For example, I live in Orlando, but which one? Most people would immediately assume Orlando FL, home of Disney World. However did you know that all of the following cities also exist.
- Orlando, Kentucky
- Orlando, Oklahoma
- Orlando, West Virginia
- Orlando, Arkansas
- Orlando, North Carolina
- Orlando, New York
- Orlando, Virginia
…and I may have missed a few.
Furthermore, some small towns near a big city have their own zip code, but you can also use the same zip code with the name of the larger city and mail will still be routed correctly. (Because mail is routed by zip code, not by the city name on the envelope.) For example, you could say Belle Isle, FL 32809 or Orlando, FL 32809. However, you cannot say Belle Isle, FL 32751 because that zip code belongs to Maitland, FL.
So how do you cleanse address data? Well actually, the best way is to use a service such as one the address databases found in the Azure Marketplace. However, if you do not have access to the Azure Marketplace or you do not have the money to spend on some of these services, can you roll your own solution?
The compound domain containing city, state, and zip will not satisfy your need. The problem is that Belle Isle is a valid city in the cities domain and 32751 is a valid zip code in the zip code domain. Therefore, the composite domain will not by itself identify this combination as invalid. However, you can start to build rules within the composite domain that can identify combinations of values from different domains and ‘correct’ one or both values. Let’s see how to do that.
In the following figure, I am in the Domain Management section of an existing knowledge base that already had domains for city and zip that I previously combined into the composite domain CityZip. If I click on the CD Rules tab, the grid that appears is initially empty. However, I can click on the Add a new Domain Rule icon found in the upper right corner above the grid.
In the top grid, I can enter a name for the rule along with a description. It is always a good practice to keep the name short and the description more details. Also I want to make sure to check the Active checkbox if I want the rule to be used the next time I use the domain to cleanse a data file.
In the lower grid, I visually build the rule definition. The left side of the rule definition can be thought of as the IF portion of the rule and the right side is the THEN portion. A rule must begin with a reference to one of the domains contained in the composite domain. Then I select what I want to do with the domain value. You can see from the dropdown list that the number of actions available is large. I can compare domain values to other values, I can check the length of a domain value, or I can even check of the domain value exists among many other actions.
In this case, I want to look for records that have the city name Taft and the zip code 32824 and change the name of the city from Taft to Orlando. Therefore, I select the connector: Value is equal to and supply the value: Taft.
With the current clause still selected, I can click the Adds a new condition to the selected clause icon in the upper right corner of the left panel. I can define multiple clauses for each domain linking the clauses with either an AND or an OR connector.
On the other hand, I may want to add one of the other domains to the rule. Remember that each domain can appear once and only once in the left panel of the rule. To add another domain to the rule, you must make sure that none of the current domains are selected. A clause is selected within a domain when you see a blue box around the clause expression. You can tell if a domain is currently selected by the vertical line that drops down from the domain line on the left side of the clauses. If this line is dark blue, the domain is selected. If this line is light blue-grey, it is not selected. One way to make sure that no domain or domain clause is selected is to click in the blank area on the right side of the left panel. You should then be able to click the Adds a new condition to the selected clause which adds a new clause to the rule.
If I want to assume that there may be some zip codes for the city of Taft that are valid and that I do not want to change them, I need to create a compound condition to find only the Taft records with a zip code of 32824. I can do this by right clicking outside of the first rule in the empty are and add a new condition. In this second condition, I select the Local_Zip domain and compare it to the value of 32724.
Finally, I can define the action that I want to take when a record is found with both the selected city and zip code. On the right side of the lower grid, I can select the domain Local_City again set the value equal to Orlando. This rule when applied during the cleansing process of a source table will look at the field associated with domain Local_City and if the city is Taft and if the field associated with the domain Local_Zip is 32824, it will change the value of the city field in the cleansed data to Orlando.
I can now click the Finish button for this rule. Similarly I can create similar rules and can even create more complex rules to handle other special situations. I can even create more complex rules to check a single domain than a simple replacement of a synonym value would allow. Just look back at the different connectors that are available to see how unique you can make your matching criteria before you assign a new value to a field.
For the sake of this example, I will not add any more rules, but will instead use my updated domain to cleanse a file that includes address information using a project as shown in the following figure (I previously talked about how to create a new project and use an existing knowledge base with its domains to cleanse the data and will not repeat those steps here. For help, refer back to March 30th blog entry.
The following figure shows the first step of the Cleanse process for my project. You can see that the source table has 282 records in which it corrected one record.
When I click the Next button at the bottom of the Cleanse step, I can see a tabbed form in which each tab shows me records that DQS made suggested changes (usually as a result of the spell checker), new records that it never saw before, invalid records which have no match and no rule to fix the values, corrected records which were updated either with synonyms or rules like the one we created here, and correct values.
By clicking on the Corrected tab, I can see exactly what DQS has attempted to do. Most importantly, I can choose to accept or reject the correction. The radio buttons for these options appear to the far right in the top grid (not visible in this picture).
If I accept the corrections, I can then in the next step output the corrected data to an existing or a new SQL table or Excel file as shown in the following figure. As shown before, I can save just the updated data or the updated data with additional information about the cleansing of any updated record.
So the value of a compound domain is to create rules that use two or more domains to make a decision about a change or may make a change to two or more domains based on a condition in one or more source fields. It should be clear that defining compound domain rules is a complex task, more complex than merely defining a set of synonyms and a leading value for similar column values. Furthermore, I would not recommend executing rules for compound domains until I have first cleaned up individual domain values as described in earlier weeks.
Cleansing data should be thought of as an iterative process. Begin with the easiest data problems to identify and fix synonyms and work your way through more complex updates such as conditional replacements in subsequent passes. Next time I will take a look at matching records using DQS. Matching helps you remove duplicate records from a table. While DQS will not force you to perform cleansing tasks before performing a de-duplication, it is highly recommended and I will try to show you why.
Oh, one final thing. While I showed the cleansing process last week and this week with simple address data so that you can easily follow the techniques, remember that you may save a substantial amount of time validating addresses against national address databases that may be available on the Internet and the Azure Marketplace.
C’ya next time.