Putting Quality Into Your Data – Part 1

Over the next several weeks, I want to explore with you a new feature of SQL Server 2012 called Data Quality Services.  To anyone who may ask whether this represents a digression from the months I have spent talking about PowerPivot and Analysis Services this past year, I just want to say that I do not see it that way.  You cannot get reliable results from your data analysis if you begin with bad data.  Some of you may be familiar with the acronym GIGO, meaning Garbage-In, Garbage-Out.  Performing analysis on data that has missing or incorrect data can lead to erroneous results and decisions.  Therefore cleaning your data first before performing your analysis becomes a natural and necessary first step before importing that data into your PowerPivot or Analysis Services model.

Many things can go wrong with your source data.  While I may not be able to identify all of the causes, allow me to categories some of the more common reasons for data issues that could affect your analysis.

Missing data – Unless every field in a table is a required field, users are likely to skip those ‘optional’ fields either accidentally or on purpose.  On the other hand, the solution may not be as simple as making all fields required.  For example, suppose you have a table of customer information including the customer’s address.  One of the address fields is ‘State’.  Since most of your customers are from the United States, you might be tempted to think: Why not just make the State field required.’  For most of your customers, this change would guarantee that each address had a state associated with it.  However, you also have sales that come from other countries, countries that do not have states or even their equivalent.  If we made the state field required, we would have to enter bogus information into this field for customers who did not have states and then in the reports somehow eliminate the bogus data without eliminating the customers themselves.  On the other hand, leaving the field blank means we could have missing information such as a United States customer who did not enter their state or it could be a customer from a country without states.  The point is that we might not now which case is current, at least not without additional analysis time and perhaps expense.  Using bogus data in the report could lead to ‘odd’ results and ignoring it could skew the analysis results as well.

Inconsistent data often results from fields in which users enter values for a field using a textbox rather than selecting values from a list of possible values.  It would be nice to limit users to a list of possible values in all cases, but at times the number of possible values does not make that feasible.  While the list of possible states in the United States is a relatively small list that lends itself to a dropdown control, suppose for a second that you left this field open for free form input. For my adopted state of Florida, I could get records with values such as: Florida, FL and Fla.  None of these values are strictly speaking wrong, but any analysis on state would require complex algorithms to group records with these three state name values together.

Invalid data is what most people think of when you mention data quality problems.  This is data that is just wrong.  Again using the state field in a customer table as a free form input box, suppose that a user enters the value ‘ZZ’.  While we may recognize this value as one that does not represent a state, a computer program will not unless it can compare the entered value to a list of acceptable values.  Invalid data can consist of anything from misspellings to wrong words that are spelled correctly.  For example, until data types became smart enough to recognize valid dates without a complex user written validation routine, systems may have allowed dates such as 02/29/13.

Inaccurate data may seem to be the same thing as invalid data at first, but it is really different.  Inaccurate data is data that is valid for the possible data domain, just the wrong value.  For example, entering a date of 02/12/13 is a valid date, but it could be inaccurate if the correct date should be 02/21/13.  These types of errors are often the hardest to find much less fix.

Duplicate data occurs when someone or some process enters the same record or group of records more than once.  Duplicate data might occur when a customer hits the Submit button more than once in the process of placing an order, or when they hit the backspace key a few times going back a few pages and then re-executing the code on those pages, or when users or support staff upload or process the same data file multiple times.  Sometimes the duplicate data is easy to spot, especially when all of the fields in two or more records have the same values.  Other times when only some fields remain the same, identifying duplicate data could become a challenge.

These are just some of the types of errors that can creep into your data and subsequently distort your analysis results if you do not clean them out first.  Starting next time, I will start exploring the use of Data Quality Services to address these and other problems.

C’ya next time.