Last time I showed how I can manually create a domain that could be used to clean SQL or Excel data. Did anybody try using it? I modified the AdventureWork Department table by making a few changes to the group names. I then performed an interactive data cleansing using my new department group domain. My results follow:
Looking first at the tabs, I can see that I had 4 new values (the number in parenthesis after the tab name is the count) that my group name domain did not know anything about. I also had 6 values that were correct as found. The corrected tab tells me that five records were corrected. This is the tab that I opened in the previous figure so I could see what information was corrected.
The first column shows the original value found in the source table. The second column tells me the number of records that had this value. In this case, each value occurred only once. However, if a value occurs multiple times, rather than displaying multiple lines with that value, DQS shows only unique values in this grid but tells us how many times that value occurs in the source table.
The third column, Correct to, shows the new value assigned to the current value. You can go back to last week’s blog to see that these were the values defined as the leading values in each of the individual synonym groups. In fact, the fifth column tells us that the reason for the correction was to set the value to the leading value. Because these are direct matching replacements, DQS’s confidence in these substitutions is 100%.
Finally the last two columns contain radio buttons. As with any set of radio buttons, I can only one button in a set. In this case, a set consists of an Approve and a Reject button. By default, DQS preselects the Approve buttons. However, if I wanted to reject any of the replacements, I could simply click the Reject button for that row and the update would not occur. Note that by clicking the Reject button, I do not have to separately deselect the Approve button. The deselection is automatic as soon as I click the Reject button.
That data cleansing process fixed many of the records that had values different from the leading values. However, I still have a few new values which I can view by selecting the New tab.
These values appear to be acronyms for the group names. To provide corrections for these values, I can click in the row for any of the values. For example, if I click in the QA row, a second grid appears below the first one as shown in the following figure that displays all of the rows that have this value along with other column values. In this case, because the abbreviation occurs only once, the second grid only displays a single record. However, in other scenarios, there could be many possible rows that have this value.
Why display a top grid with unique values for the group name and a second grid with a separate row for each instance in the table where that value occurs? If I wanted to always change QA to another value such as Quality Assurance, I could simply click in the third column of the row with the value of QA and enter Quality Assurance and click the Approve button. When I click the Approve button in the upper grid, DQS makes the changes to all records that have a value matching the value in column 1. It then moves the record from the New tab to the Corrected tab.
Alternately, I could update individual occurrences of the source value by clicking on the Approve button associated with specific records in the lower grid. While this may not be common, it can occur when an acronym appears to be the same for two different expanded names such as the Central Intelligence Agency and the Culinary Institute of America. In this case, I could update individual records in the lower grid by select the Approve button for only selected rows after enter a corrected value in the upper grid. After approving some of the records, I could change the corrected value in the upper grid and approve the remaining records in the lower grid.
After correcting all of the abbreviations and clicking Next, DQS shows a final screen as shown below.
The left panel shows the original columns from the input source along with the new GroupName_Output column. You can see that the data now has consistently named groups that can better represent what each groups is doing in a pivot table or pivot chart. In the right panel, I can save the data back to SQL Server. I could replace the data in the existing table by selecting the same table name as the data source as shown in the next figure.
DQS prompts me whether I really want to replace the existing data, but at least I can update the existing table with the new data. Some of you may prefer to create a new table so that you can go back and forth between the original table and the new table to verify the corrections. Note also that replacing an existing table will also remove any custom property information. Alternately, you see the Output Format option has two radio buttons, Data Only and Data and Cleansing Info. If you export with the Data and Cleansing Info option selected, the resulting table or Excel spreadsheet will include columns that show you both the before and after values as well as the correct reasons.
After you gain confidence working the DQS, creating that extra table may not seem as important, but while you are getting accustom to working with DQS, I would recommend placing the corrected data in a new table. Again remember, that if you replace a table, you may need to reset properties and indexes on the ‘new’ table.
After writing the data back to the database, DQS display one final confirmation dialog:
Next time, we will take a look at how to replace common abbreviations even when they are embedded in a larger string value rather than creating synonyms for each value in which they occur.
Until then, C’ya.