Have you heard the following phrase at your organization, “We need a single version of the truth around here”? Sure you have. It is often used as one of the reasons why organizations invest the time and effort building enterprise data warehouses. All organizations have information created by different parts of the organization. There could be financial information, manufacturing information, sales information, marketing information, productivity information, customer satisfaction information, employee performance information and information from many, many more areas. The problem is that often the information resides in more than one place in the organization. Sometimes the information is updated regularly. Sometimes it is not. Sometimes the data is detailed. Sometimes the department only has access to summary data. In any case, the sad fact is that reports generated by one group could be very different from reports generated by another group simply because they used a different version of the data or they transformed it differently. Then there is the skill level of the person gathering the data and how well they understand the data schema. Finally, if everyone who needs to use the data has to determine how to gather and transform the data they need for their report, they all would need the skills to know how to access different data sources. They would need to know the data schemas in those data sources, and they would need to know to transform data in the languages of those individual data sources.
Organizations use many of these arguments to justify the creation of a single team designated to create the Enterprise Data Warehouse and to populate it with the most recent data from all data sources. By centralizing the data gathering function, users can be assured that the data is collected and transformed in exactly the same way each time it is updated. They can write procedures to correct invalid data, account for missing data, and dedup data that may appear in multiple sources. Even if the data comes from different sources such as a Oracle database, SAP, SQL Server, MySQL, or some other source, this central team can focus and tune the skills needed to create the extract and transform functions. Once the data is centralized, users from any department in the organization can use simple reporting tools or even end-user analysis tools like Excel and PowerPivot tables to analyze and display their data.
After putting a system like this into place at an organization, it is often argued that no one outside of the data warehouse team should ever need to access the raw data to perform their analysis or reports. They argue that it the responsibility of the data warehouse team to pull data from the raw systems and generate the sanitized and homogenized data store for everyone else to use. Because the end user community only has to learn how to access a single system instead of potentially dozens, they argue that this reduces the training time for these users and lets them become productive analyzing the data much sooner.
But what if the data, although it is gathered consistently the same way each time, is not correctly transformed when entered into the common data store? What if the data dedup is not quite right? What if older data overwrites newer data by accident? If other groups are then discouraged from gathering the raw data themselves and transforming it, errors can remain hidden for a long time. Depending on the severity of the error, management decisions based on this data could adversely affect the future of the organization. A little extreme? Perhaps, but the point is that any centralized system that is never challenged and taken as an absolute truth can harbor incorrect data that will come back to haunt you later.
In addition, some people in the industry recognize this problem and maintain that getting data to the masses for analysis cannot wait until the data is perfectly cleansed because that could take too long reducing or completely eliminating the time needed to affect the change for which the data may indicate a need. But could the ‘I need the data and I need it now’ approach of management result in decisions with negative impacts on the organization? The question is how does the organization account for known data issues? Is a little bad data worth the time saved to make a decision now? Even if the data warehouse moves forward with these known data issues because some data is better than no data, what can we, as IT professionals do?
While it is easy to say you can be more careful with the initial data extract and transformation processes, the simple truth is that no matter how careful we may be, errors do and will occur. On the other hand, isn’t it possible to treat the entire data handling process using an approach similar to that used for systems design. What I’m talking about here is an iterative approach to data validation and cleansing.
Let’s take as a given that everyone involved in the data extract and transform does their best to accurately move the data into the data warehouse. However, once the data is in the data warehouse it should never be assumed that the processes used are perfect and do not need to be re-evaluated. While developers outside of the data warehouse team should not be spending all of their time reproducing the same data structures as the data warehouse team, discouraging them from pulling their own data from the raw data sources occasionally to validate the data they are using can be a mistake. This is especially true when they ‘think’ something in the data is not making sense. Just the same, they should not be allowed to create a permanent alternative to the data warehouse thus duplicating the effort.
The bottom line message is that healthy skepticism should never be discouraged. On the other hand, creation of alternative data warehouses is a waste of resources and is a direct violation of the goal for a single source of the truth. Deciding how much questioning of the accuracy of the data warehouse is a difficult tight rope to walk. Leaning too much one way or the other can be dangerous. In any case, a single version of the truth should always be the goal, but it should not be accomplished at the risk of silencing all voices that ask for proof that the truth is correctly represented by that one voice.
See you later this week for more on working with the SharePoint Chart Web Part.