Manually Building a Domain for DQS
This week I return to my series on using DQS, Data Quality Services, which is a few functionality found in SQL Server 2012. I previously provided an introduction to DQS on 01/26/2013 through 02/23/2013. So if you did not read those blog entries, you may want to start there. Anyway, for the rest of you, I want to emphasize that learning to use DQS is essential if you are serious about doing BI Analytics. In fact, with modified sample data from AdventureWorks on which this discussion is based, I will show you how ‘bad’ data can make interpreting your pivot tables and pivot charts more difficult, and how data cleaned with DQS simplifies and clarifies your results. So let’s get started.
Let’s say I create a modified version of the AdventureWorks HumanResources.Department table called simply Dept which has the following set of records:
Notice that the sixteen unique departments are actually grouped together with the GroupName column. However, to make this example interesting, I changed some of the GroupName values so that while similar, they do not exactly match each other. For example, the following set of GroupName values probably should all be set to Research and Development.
- Research and Development
- Research & Development
Similarly, similar related group names appear for Executive General and Administrative, Inventory Management, Quality Assurance, and Sales and Marketing.
Let’s create a domain for our department group names so I can define correct values and then use that domain to clean new datasets.
Open the Data Quality Services Client as I described several weeks ago so that the main dialog appears as shown below.
I could create a new knowledge base for our new data domain, or I could add the department domain to one of the existing knowledge bases that I previously created. In this case, let me choose the previously created Customers Knowledge base. I can do this either by clicking the Open Knowledge Base button from the above main dialog and then select Customers from the list of available knowledge bases followed by selecting Domain Management under the Select Activity and then click the Next button as shown in the next figure.
Or I could simply click on the Customers (Domain Management) link at the bottom of the first column on the main dialog. Either way, I will next see the Domain Management screen shown below.
In this dialog, I can see the domain created a few weeks ago called CustomerNames. This week, we want to create a new domain in the knowledge base. To do that, click the first of the icons on the left above the Domain list.
The Create Domain dialog, shown above, prompts me for a name for the new domain along with an optional description. One of the first properties is the domain type. For the GroupName, I will keep the default data type: String. I also need to choose Use Leading Values. Earlier I said that several group names have multiple similar values. Each of these values will be defined as synonyms. I could choose to allow each of the synonyms in the final corrected output dataset. However, if I want to use the group name to aggregate other data using a pivot table or pivot chart, have multiple values that mean the same thing will make interpreting the results difficult. Instead, I want to choose a leading value for each set of synonyms to replace all of the other synonyms. Therefore, I need to select this checkbox.
The Normalize String option allows me to ignore any punctuation in the values. For example, the string ‘Inv. Mgmt.’ would be treated the same as the string ‘Inv Mgmt’. While I may not have that problem in my test data set, it could occur in other data sets that I may want to clean so I will check this option.
The Format Option allows me to pass values through as is without any change. I could also change the string to all upper case, all lower case, or I could simply capitalize the values. If I choose English as the language, I can use the built in spell checker that is essentially the same as the one found in other Microsoft products such as Word to correct misspelled words for me. Afterall, I would not necessarily want to include as synonyms every possible misspelling of a value. However to use the spell checker, I also have to click the next check box, Enable Speller.
Finally, I can also disable syntax errors which I’ll save for a future blog discussion because it really doesn’t apply in my example.
Next, rather than use an existing table to discover possible values as we did with Customer Names a few weeks ago, I am going to build the domain manually by going to the Domain Values tab and selecting the icon to Add A New Domain Value.
I’ll start by adding my first value: Executive General and Administration
Next, I add the first of the synonyms as a new domain value. In fact, I continue to add all of my synonyms as shown below. I then select all of the rows and right click on the selected rows to display a popup menu of actions. From the actions, I select Set as Synonyms defining all four values as related values.
By default, DQS lists the synonyms alphabetically and makes the first item in the list the leading value. Remember that a leading value is the one that all the other synonyms translate into.
But as is often the case, the first item alphabetically is not the value I want as the leading value. To change the leading value, I right click the row I do want, in this case: Executive General and Administration and select the action: Set As Leading.
DQS shuffles the list of synonyms to put the leading value at the top of the group. I now have a set of synonyms that I can translate to a single leading value.
I can do the same thing for all the other group name values so that my final list of group names looks something like the following.
I’m not done yet defining this domain, but I will stop here for this week and I pick up the rest of the definition and show you how it works next time.
Until then, C’ya.