Last week I left you with the thought that an exact match can be either a prerequisite or just one of several domains used to match records. The purpose of using an exact match as a prerequisite is that the domain is so crucial to determining whether two or more records match that if the value of this domain is different between two records, there is no possibility for the records to match. Therefore, there is no need to check any of the other domains when this prerequisite domain is different. This may seem counter-intuitive at first, but we must remember that we are not looking for unique records here, but matching records and eliminating records by using a prerequisite will speed up the matching process.
While you can have more than one prerequisite domain, not every domain that you are trying to match on should be a prerequisite field. In fact, DQS requires that you have at least one domain which is not a prerequisite. Fortunately, fields like business name and business address should not be prerequisites because there is a high probability that these could be misspelled or formatted differently. For example, Adani & Baez Computers Inc. is probably the same business as Adani and Baez Computers Inc. and the same as Adani & Baez Inc. even though they are not exact matches. If we used the Business Name field as a prerequisite exact match, these three names would not be considered a match. A similar argument could be made for Business Address and even Business City.
Therefore, we would add these additional domain elements in our matching rule. Each domain element other than any prerequisite domains must have a weight factor. The weight factor determines how important matching domain values are in determining whether two records are a match. Initially, you may have no real preference for how to weight the additional domain values. A good place to start however is to make all the domain weights equal. Therefore, if you have five additional domains, make each one 20%. Note the total of the weighting factors must add to 100%.
After evaluating all the domains for each set of records, DQS calculates a number between 0 and 100% to represent the chance that the two records are a match. Of course, a score of 100 would indicate an exact match across all the domains used in the rule. Of course, most pairs of records will probably have a value less than 100. But how much less than 100 still indicates a match? DQS uses a Minimum Matching Score to filter the potential matches. For reasons beyond the scope of this blog, DQS places a minimum score limit of 80% on matching records. While I can change the Minimum Matching Score to any value between 80 and 100, I cannot set it any lower than 80. This value can be found below the Rule Name and Description as seen in the following figure.
Another option you must set is whether a record can appear in only a single matching set of records or whether it can belong to more than one set of matching records. The first option is called non overlapping clusters and the second option is called overlapping clusters. With non overlapping clusters, each record appears in only one cluster or is considered a unique record. This is the option I would usually select if my goal was to eliminate matching records. However, using overlapping clusters might be a good way to analyze for other relationships between records. For my analysis here, I will use non overlapping records.
Next, I click the Start button shown in the previous figure to begin the analysis. This process can take a substantial amount of time if I am working with a large database table. Therefore I recommend creating a smaller representative table to test out my matching rule definition before running the final rule against my full table. After the matching process completes, a grid appears at the bottom of the screen. In this grid, which is filtered on the matching records, you can see how DQS has grouped similar records based on the matching rule. Each matching groups is called a cluster and is given a cluster id that has the same value as the first record id added to the cluster.
Clicking on the Profiler tab provides some interesting results. In the figure below, I can see that my test table had nearly 8000 records. In the grid to the right, I can see the number of new and unique values for each of my domains. New values are domain values that I may want to go back to add into the knowledge databases for the domains if they truly are unique and not just bad data values. The Unique column tells me how many of the 8000 records had unique domain values. I can see that there are only 5129 unique business names. That means that I should have nearly 3000 potential duplicate records based only on the Business Name. Looking at Business Address, there are 6683 unique addresses. This tells me that there are at least some business records with the same business name, but different addresses. This is entirely possible if the business has more than one location and may not be an error. On the other hand, of 8000 records, there are only 224 unique city values, 32 unique states, and 465 unique zip codes.
The low rate of unique values for these three fields tells me that their contribution in determining matching records is low. At the same time, it makes each of these domains candidates for an exact match or even a prerequisite match. In fact, that is why I selected Business States to be an exact prerequisite match.
If I previously cleaned my data elements to remove inconsistent abbreviations and spelling problems, I would want to look at the fields with the most unique values to be potentially more important when determining matching records. However, just because a domain has a high uniqueness value, does not make it a good matching value. Take Business Phone as an example of a poor domain to determine matching records. Because a business could have more than one phone number at a given address, different phone numbers do not necessarily mean that the records are not a match. Even a different address may not be as important as having the same business name. Therefore, I use the uniqueness values as a guide, not an absolute to determine how I might want to reassign weights to each domain in the rule instead of weighing them all equally.
Another thing to consider is to examine each cluster and determine why some of the matching scores are less than 100%. In the figure below, I have highlighted a record with only a 96% match score. To find out why this score is not 100% I can right click on the row to display the details.
The Matching Score Details dialog shows the two records that are being compared at the top with the overall matching score. In the second grid, I see a grid of the other domain with their contributing score. In the figure below, I can see that phone number score was only 79%. However, looking at the two phone number I can see that difference was most likely a key entry mistake. In any case, the 79% score on this domain is only worth 20% of the total matching score. Since the other domains are exact matches, I have:
0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 79 = 96
But perhaps more importantly, this tells me that the phone number domain may not have been cleaned. If I do not have a reliable way to do this without spending a lot of time or money, I may simply want to deemphasize the importance of the Business Phone domain and rerun the matching rule.
Finally when I am done, I can proceed to the next page of the defining the matching policy where I will run the rule one last time.
Clicking Finish will allow me to publish the matching rule back to the Knowledge Base as shown in the following dialog.
After the publishing process completes, the final dialog informs me of its success as shown below.
I now have a matching policy saved with the business address domains that I can use in a project with the full database table.
C’ya next time.