Last week I showed how to build a new domain for my knowledge base, but the domain contained a lot of similar values. In some cases the customer name was ordered as first name then last name. In others, the last name was first followed by the first name. Some names consisted of only the person’s last name. Others included a salutation like Mr., Ms., Miss, or Mrs. A few names included commas and many had obvious misspellings.
When I first began, it was difficult to pick a version of a name that might be correct. As you can see in the following figure, many of the names have multiple variations. One way to get started is to designate one version as being correct (leaving the green check mark) and changing all the other versions to the Type: Error (a red ‘X’). You can see and modify the value type by using the dropdown in the Type column. Remember from last week, that when you designate a value as an error, you need to specify a value in the fourth column to tell DQS what to change the current domain value to when it is found.
I started reviewing the name list and was immediately faced with some uncertainty about whether the customer’s name was Cicily Coney or Coney Cicily. Because alternate values can be spread throughout the domain that initially sorts the data by the original values, it is usually impossible to see all variations of a name at once. I initially made a guess that the name of one customer was Coney Cicily. To do this, I selected one of the variations and changed the Type to Error and then entered Coney Cicily in the Correct to column. Since this value did not exist in the domain, DQS created a new domain record with this value. However, also because this value did not exist before, the Frequency column shows a value of zero (0). Often this is a hint that I may be on the wrong track because I would expect to find at least once occurrence of the correct customer name in the table or spreadsheet. However, it is possible that the correct name was never entered previously.
In the figure below, you can see the new domain value for Coney Cicily has been added and you can also see two synonyms that have been updated to use this value as the Correct to value.
It is also possible to add a new domain value without going through the process of correcting an existing domain value. Click the icon as highlighted in the following figure to add a new domain record to the end of the grid table and enter the new value in the first column. In this figure, I added a domain value for the name: Amanda Patrick. In both cases, new domain values are automatically assigned the Type: Correct as shown by the green check.
If you make a mistake as I did with Coney Cicily, you can fix the mistake by right clicking on the row that defines the domain value and that is currently marked as correct. Then select Delete from the popup menu. By deleting the domain value marked as correct, DQS automatically deletes this value from the Correct to column of any synonym row in which it appears. Only domain values defined as correct can appear in the Correct to column. I can then add a new domain record for Cicily Coney and use this new value in the Correct to column for all the synonyms.
The process of manually identifying the correct values for the domain and then assigning one of the correct values to each of the other synonyms is time-consuming. But when you are done, the domain list will group synonyms together. I can then use the icons above the grid to expand and collapse the synonym list. By collapsing the list to show only the unique domain values as shown in the following figure, I can also see using the Frequency column the number of times that each domain value actually appears in the source data.
By clicking the expand icon, I display the synonyms associated with each domain value and again the Frequency column identifies the number of times that synonym appears in the source data.
After I finish reviewing all domain values and either designate them as correct or if they are an synonym, identifying them as and error with a Correct to value I can click the Finish button in the bottom right corner of the dialog.
I can also save my intermediate corrections perhaps so I can go home for the night and continue tomorrow by clicking the Close button in the bottom right corner of the screen. On the main screen of the DQS client in the Knowledge Base Management section, you will see the name of the knowledge base along with the step it is currently in within parentheses. In addition, the icon to the left of the knowledge base will display a gold lock indicating that the knowledge base is currently locked for editing. When you want to return to the finish managing the domain values, open the DQS client main screen and click on the link for the knowledge base and DQS returns you directly to the Manage Domain Values page to pick up where you left off.
When I am done managing the domain values and click Finish, DSQ displays a dialog asking if I want to publish the data to the Knowledge Base. Since I do, I click the Publish button. When I click the Publish button, I see the dialog shown below telling me that the Knowledge base has been successfully updated.
At this point, I can use the published domain to clean other data. However, I can also re-open the domain to add new values from a new data source or even manually add more domain values.
Next time, I will show how to use this new domain to clean data from an Excel file.