No, this is not an on-line dating service. Rather, I am trying to analyze a table with duplicate records. These duplicate records can be just as bad for my pivot table analysis as having data with bad values. Earlier this year I showed the basic of how to use Data Quality Services in SQL Server 2012 to create Knowledge Bases that could then be used to clean other data tables. However, cleaning bad data out of my analysis tables does not alone guarantee good results. Another potential problem with data could be duplicate data.
Duplicate data can come about for a variety of reasons. Someone may load the same table more than one time into a master table. Depending on the way indexes are defined in the master table, an error may not be thrown. How does that happen if I have a primary key? Perhaps my master table generates a unique surrogate key for each record added to the table rather than use the existing primary keys. This technique is often used when adding data from multiple data sources in which the unique primary key in those data sources may no longer be unique across all of the combined data. Therefore, I cannot use the primary key from the individual data sources in the master table and instead generate a unique surrogate key.
Another reason might be that I am pulling data from a legacy sales system that recorded the customer’s address with every purchase. If any one customer made two or more purchases, their address data would appear multiple times. In the new system, I want to maintain a central customer table to store the customer’s information once. Therefore, I have to identify the duplicate customer addresses and save only one address per customer.
I might also have a table of businesses who have contributed to my charity. Over the years, I may have received donations from the same business multiple times. Therefore, there are multiple records in the contributions table, one for each year. Now I want to consolidate the businesses that I have received donations from so that they only appear once in the final table
Let’s assume that I previously loaded a master business table and I have already gone through the DQS process to create a clean (or as clean as reasonably possible) knowledge base. The next step is to open the DQS Client and click on the arrow to the right of my Knowledge Base (Business Master in this case) to open the popup menu shown in the next figure.
The next step is to select a table to use to define and test a matching policy. This table can be the original table used by the knowledge discovery step to create the original knowledge base. Alternately, it can be the table in which I want to search for duplicates. Even if I opt to use the table that contains the duplicates that I want to fix, I often create a representative subset of the full table as a test table. I do this so that defining and testing the matching policy can execute faster. Just be careful when selecting a representative subset of data to not skew the analysis by accidentally selecting data that came only from a single data source and thus may not show the duplicates. For this reason, I would probably select all of the business in which the business name began with a specific letter or two. I could also use business type (if that were a field in my data) to select only businesses of a specific type. Any way to narrow down the number of records will help testing the matching policy execute faster.
In the follow screen, you see that I have selected the Business table from the Customer database in SQL Server. Note that DQS can also work with data from an Excel spreadsheet as the data source.
Next in the Mappings section of the screen, I need to select all the source columns that I want to use in my matching policy. Note that I must use at least one column and could use all of the columns in the source table. However, realistically, a smaller subset of columns are used to define a matching policy than may be used during full data cleansing.
For each source column selected, I must select a Domain from my Knowledge base to use. The way DQS uses the domain is to clean the selected column data against the knowledge base domain before attempting to match records. By cleaning the data first, the matching process is more accurate.
After selecting all the columns I want to use in the matching policy, I click Next in the lower right to go to the Matching Policy definition page. Of course, no matching policy exists yet so I must click the Create a matching policy icon on the left side shown in the next figure to create a matching policy rule. Note that there is only a single rule per matching policy. After clicking the option to create a matching rule, the icon to add a matching rule will disable.
By default, the Matching Rule gets the name Matching Rule 1. However, you can change this name in the Rule Details section as shown below. You can also provide a description for the matching rule. While I usually add a description that defines my ‘strategy’ for the matching in this area, I will leave it blank for this illustration. You can also select the minimum matching score for two records to be considered a match. The default value of 80 is the minimum score that DQS allows. However, you can set the score to a higher value.
On the right side of the figure, you can see that the Rule Editor is looking for at least one domain element for the rule. You can add your first domain element by clicking the Add New Domain Element icon to the top right of this area.
For this example, I chose to add Business State as my first domain element. When defining matching rules, a domain can be either an exact match or a similar match. An exact match, as implied by the name, must be spelled exactly the same in both records. On the other hand, a similar match does not have to be spelled the same. Microsoft uses an internal algorithm to assign a value from 0-100 to each set of matching values based on the similarity. DQS uses this value to calculate the overall matching score.
One twist to the exact match is that it can be viewed two different ways. You could say that an exact match is a prerequisite for the two records to be considered a match. When Exact is used as a Prerequisite, if the two values are not exact, DQS does not even evaluate the other domains in the rule to see if they match because quite frankly, it does not matter. On the other hand, if Exact is not used as a prerequisite, then the value for this domain comparison will be either 0 or 100, but DQS will continue to evaluate the other domains in the rule. After adding all the domain match values together, if the weighted average is greater than the minimum matching score, the records can still be considered a match, even if the domain with the exact match requirement fails.
Ok, I’m going to stop here for this week because that last paragraph may sound a bit confusing and I want to give you a few days to think about it so that it begins to make sense. I’ll mention it again next week as we continue defining the matching policy and show how Exact can work two different ways.
C’ya next time.