With my initial customer domain created, I am ready to start cleaning customer names in Excel files or SQL tables. To begin, I open the DQS client home screen and select the option: New Data Quality Project from the middle group of options as shown in the following figure.
For a new project, I need to name the project and optionally provide a description to explain what the project will accomplish. I then specify the name of the Knowledge Base I want to use, in this case: Customers. DQS can now show the available domains in the panel on the right side of the page as shown in the following figure. For my activity, I select Cleaning because I want to use my selected domain to clean the values in my project source table.
After clicking Next, I need to select my data source and then map the source column to an existing domain. This screen is similar to the one used to discover the domain values that I covered two weeks ago. As with data discovery, cleansing can work against either SQL Server data or an Excel spreadsheet. In the case of an Excel spreadsheet, I must specify the location of the Excel file. For a SQL Server source, I need to specify the database and the table. Note that the database must be on the DQS Server itself.
Next, I need to define the column mappings beginning with the Source Column from my data source and associating it with a domain from my selected Knowledge Base. I can do this with multiple source columns each mapping to its own knowledge base domain. However, each domain can only be mapped to a single source column within a project, but you could have multiple projects to map a single domain to multiple source columns, one column per project.
Clicking Next to move to the next page of the project dialog, I can begin the data cleansing by clicking the Start button.
After a relatively small amount of time, the Cleanse page displays the results of the processing including the number of records processed, the number of correct values from the source, the number of corrected values, the number of suggestions and the number of invalid records. DQS uses an algorithm to determine how confident it is of the correction. This means that it sometimes can correct values that do not appear in the domain, but are similar enough that DQS thinks it may know how to correct the value. I can review the suggested changes as well as corrections on a subsequent screen.
Clicking Next takes me to the Manage and View results screen. On this screen, I can use the tabs in the right panel to see the suggested changes and decide whether to allow the change or to reject it. I can also see the number of new values and what those values are. (I may want to add the new values to my domain for future cleansing operations.) I can also see grids of the invalid values as well as corrected values and even all of the correct records that did not need to be updated.
Using each tab, I can review the changes made to the data and even substitute an updated Correct to value if necessary. After reviewing the results, I click the Next button one last time to display the Export page. On this page I begin by selecting the Destination. Again like readying source data I only have two choices currently, SQL Server and Excel. If I choose Excel as shown in the following figure, I need to specify the URL of the Excel file that DQS creates. By selecting only the option button Data Only, DQS replaces the existing values in the selected Excel column with new cleansed values. The option to export data and cleansing info adds columns to the output Excel spreadsheet to show before and after values as well as additional information about whether the data was already correct or corrected and the confidence level of any corrections.
Of course, clicking Export to complete the process creates the output Excel spreadsheet as shown in the following figure and the Finish button ends the project activity.
So now you can create a basic cleansing domain and use it to clean data sources consisting of spreadsheets and SQL tables. But DQS can do more than just cleanse data. In future weeks, I will return to DQS to show more features that you can use to ‘Clean Up Your Data’ so that BI analysis, reports, cubes, and pivot tables are more accurate and concise.
C’ya next time.