In the last several installments, I have looked at using Data Quality Services to clean individual data columns. However, sometimes you can have correct data for two or more columns individually but the combination is still wrong. What do I mean by that? Let me keep it simply by just referring to two common pieces of information in an address: City and Zip Code. It is possible to have valid cities in the the City column along with valid zip codes in the ZipCode column, but the zip code and city values on any one record within the table may be correct. So what can do to check for this problem? Let’s begin by creating a few domains, the first being for zip code
In this example, I began by creating a new Knowledge Base. Using the Knowledge Discovery option along with an existing table of data, I started by selecting the source column: ZIP from the ZIPCITY table.
Since this is a new Knowledge Base, I have no existing domain to select from the dropdown in the Domain column of the Mappings grid. Instead, I clicked the Create a Domain icon immediately above the right side of the mappings grid. I want to create a new domain called Local_Zip as shown in the above figure. I did not change any of the other default values other than to turned off the Enable Speller option since that functionality does not apply to a zip code.
I then repeated this process to create a domain called Local_City to store the valid city names from my discovery database. Since we have seen these steps in previous weeks, I will skip the details here. If you need help defining new domains, refer back to the blogs from March 23, 2013 and March 30, 2013.
Next, I click on the Create a composite domain icon to the immediate right of the Create Domain icon as shown in the following figure.
I need to create a composite domain that consists of both the city and zip code fields. In the resulting dialog, shown below, I named the composite domain: ZipCity.
Then from the Domains List that shows the existing domains defined in the current knowledge base, I select the domains I want to work with and move them over to the Domains in Composite Domain list. The order of the domains here does not matter.
After clicking OK in this dialog, I now see a new domain in the Knowledge Base details panel. You see the first two individual field domains: Local_City and Local_Zip followed by the ZipCity domain. You may also notice that the icon to the left of the domain name is also different from the other domains indicating that this domain is a compound domain. If the domain is not expanded to show the individual domains that compose it, you can click the small blue box to the left of the domain name with the plus sign to open the domain.
I could click the Next button at the bottom right of the screen to proceed to the Discover step. On this page, I can click the Start button in the upper left to have DQS begin to examine my data source to populate my domains.
After a few seconds (or minutes depending on the table size), the grid at the bottom of the page refreshes to show me the results of the data imported from my data source. As we saw in previous weeks, I can see the number of new values as well as the number of unique values for each of my domains. In the figure below, you can see that my table of 49 records had 44 unique zip code values and only 12 unique city values.
When I proceed to the next page (Manage Domain Values), I can review the domain values discovered for each of the single column domains. I can make corrections at this point to any misspellings or incorrect values. On the left side of the screen, you see the list of available domains and the number of new values added to the domain. You may have noticed that the composite domain, ZipCity, does not have a count of new values. That is because the composite domain is not stored as a combined field. Rather, it merely references the two domains: Local_City and Local_Zip. It is not a new combined field. That distinction will be important in understanding the next step.
The following figure shows the correction of a zip code that had an extra digit. While simple in of itself, it shows how easily I can correct values beging added to the domain. Alternately, I could have simply flagged this record as bad data, but since I am building a domain, I chose to correct the value instead.
Next, I open a data cleansing project to clean up my table of city and zip code values. Again I will skip some of the steps here for brevity since they have been covered in previous Saturday blogs on this site. I begin by mapping the fields from the table I want to clean to my available domains. However, I cannot map multiple source columns to a composite domain. Instead, click the button: View/Select Composite Domains from the bottom of the screen that displays a dialog displaying the composite domains. In my case, there is only one composite domain. I make sure that I have selected the checkbox to the left of this domain because I want to ‘fix’ any problems with my cities and zip codes.
Clicking first OK to close the Composite Domain dialog and then the Next button to go to the Cleanse step, I click Start to run the cleansing process. Depending on the amount of data in the table I am trying to clean, this could take from a few seconds to enough time to duck out to the kitchen to get a cup of coffee.
After the cleansing process finishes, I see only one record in my grid at the bottom of the Cleanse screen that lists the field City,Zip and shows the number of corrected and suggested values and I see that both have values of zero. Does that mean that all of the data in the cleanse table was correct to begin with and where are the other domains?
Let me answer the second question first. When I clicked the checkbox two figures ago to use the composite domains, this turns off the use of the individual domains. Similarly, if I uncheck this box to not use the composite domains, the individual domains are used instead.
As to the first question, the absence of corrected and suggested values does not actually mean that all the combinations of cities and zip codes are correct. At this point, all that DQS has done has to verify that the cities in the source table exist in the Local_Cities domain and that the zip codes found in the source table exist in the Local_Zip domain. It does not have a way to associate a specific city with a specific zip code since city and zipcode are not really defined as a single field, but rather as a composite domain consisting of two independent domains. So I may have a valid city name in my source table with a valid zipcode even though that zipcode is not valid for that city. This is a problem known as cross-domain validation. By itself, a composite domain does not solve cross-domain validation issues. However, as we will see in part 2 next week, I can use the composite domain together with some addition rules to solve selected cross-domain validation issues. When I first encountered this issue, I was surprised that composite domains did not solve this problem, but the more I thought about it and thought about the way the composite domain is defined, the more sense this restriction made.Then I started looking for alternate solutions.
Before I end this week, note that the completeness cell indicates that only 91% of the records are complete and 9% are missing. In my case, this meant that some of the records in my source table did not have cities or zip codes or possibly both. Completeness is another component of data quality and having incomplete records distorts your analysis. Therefore, I will want to either ‘fix’ these records or remove them from the final analysis table.
Next time in part 2 of this this topic, I look at some ways to address the issue of cross domain validation through the use of rules.