This week I want to take a look at how I can clean a column of names in an Excel spreadsheet using Data Quality Services (DQS). The fact that I am going to clean an Excel column instead of a SQL Server table column has little effect on the procedure or steps other than defining the source and destination files. I have a lot of ground to cover so let’s get started.
First, when I open the client component of DQS, I need to define specify the server where I installed the Data Quality Server. Last week I explained how to create the DQS server. Because I installed the server and client components on the same machine, my server name is: LOCALHOST.
On the main page of the client tool, there are three panels. I showed last week how to open the built-in domains included with DQS. However, this week, I need to create a new domain with data from my Excel spreadsheet. To do this, click the button: New Knowledge Base in the first panel.
This action opens the dialog shown below. You can have a separate knowledge base for each domain or you can have several domains in the same knowledge base. Which way is best? If you need multiple domains to validate different columns in the same database or even the same table, you may want to put these domains in a single common knowledge base. On the other hand, if you have a domain that can be used in different tables perhaps in different databases, you may want to store those domains in a separate knowledge base. The good news is the choice is yours, but for this week, it will not matter. I am going to name my first knowledge base Customers because it will contain domains used to validate customer data. Descriptions for your knowledge base are optional, but I do recommend them.
Since I am going to use the actual data from an existing spreadsheet to build my knowledge base domain, I select Knowledge Discovery from the Select Activity on the bottom right of the dialog before clicking Next.
On the next screen of the dialog, I can select the Data Source I want to use to discover my domain values. At this time, DQS only supports two data sources: SQL Server and Excel. For today’s example, I will use an Excel file. Next, I need to either enter the full filename of the Excel file or I can use the Browse button to find the file on my computer or a connected network drive.
Note: This version of DQS does not support using .xlsx format Excel files. It only supports .xls format files. If I need to begin from an .xlsx file, I can first load it into Excel and then save it back to a new .xls file which DQS can use.
After l select the Excel file, I need to create a domain that will hold my customer name values that I pull from the Excel file. In the area immediately above the Mappings grid there are a series of icons. The third icon from the left is the Create a domain icon as shown in the following figure.
The Create Domain dialog requires a name for the domain. The obvious name here is CustomerName. I can also include a description of the domain that could include addition information about either the source of unique values for the domain and/or where the domain can be used. The Data Type for the values in the domain is typically defined by the data source but it also determines where I can use the domain. In this case, the following figure shows that I identify customer names with the string data type.
There are some additional parameters such as leading values and normalization of the string which are optional. The Use Leading Values option allows you to define synonyms for a value and then replace the synonym with that value. That is what I am going to do here so I check that checkbox. The Normalize String option simply replaces special characters with spaces which should not matter in my case, but I will leave this checkbox checked anyway.
The Format Output To option allows you to force the capitalization to all uppercase, all lowercase, capitalization, or none. Because I am dealing with names, I have chosen Capitalization.
Of course my language is English and I also have enabled the Spell checker . However, I did not select the Disable Syntax Error Algorithms checkbox. This option is useful when checking complex id values.
After creating a domain to store my knowledge base values, I need to select the column from the Excel spreadsheet that I will use to fill that domain. In my case, the name of that column is Customer Name. Now with the mapping between my source column and my domain complete, I can click the Next button. However, be aware that you can map more than one column to individual domains to clean each of the columns at one time.
The following figure shows the next screen in the dialog that starts the data discovery process by clicking the Start button. At this point, the statistics information at the bottom of the screen will be empty. However, after I click the Start button I will see the statistics being updated live as the data from the source is read into DQS.
The following image shows the results after finishing the data discovery task. In the Source Statistics table and grid, you can see that 1341 records were read resulting in 1341 new values and 274 unique values. Note that in this case, I am loading my Excel data into a new domain, so all the records are new values by definition. However, if I were to come back later to this domain with a new Excel spreadsheet that included both new and old records, the number of read records would not be equal to the number of new records. The fact that the number of unique records is far less than the number of new records indicates that many of the unique record values appear multiple times.
Now that I have loaded my data, I can click Next to manage the domain values I loaded. In the next figure, you can see a small sample of the data added to my new domain. Each unique value in the domain appears once along with a count of the frequency of that value. This grid also contains columns to identify the value type which as we will see can either be correct, error, or invalid and a column for a corrected value.
First let me just say that the difference between an error value and an invalid value is that an error value, represented by a red ‘X’, can be corrected and should have a corrected value in the fourth column. An invalid value, represented by a yellow triangle, is a value for which there is no correction. A good example of an invalid value in my names table would be a record in which the customer name was a number. Unless I have a table that associates a customer id with a customer and somehow this information appears in my source data instead, a number value typically cannot be converted to a customer and is therefore invalid.
Next time I will cover more details about how to fine-tune my domain so that I can use it with other data sources to ‘fix’ invalid names.