Your Logical Data Model is not Normal

If you have been reading this blog over the last several years, you know that I have been a strong supporter of using PowerPivot and the Tabular model to perform data analysis, even for power users, not just DBAs. What you may not have realized is that I’ve been secretly teaching you a little about data modeling. What is data modeling? It is really nothing more than the collection of all the data in tables and the relationships between those tables in a database. Did you know that there are at least two major ‘types’ of data models when it comes to how your structure your data into tables? Each one serves a different purpose. Therefore, it can reasonably be argued that neither one is more correct than the other. But you need to understand when to use each type. That is what I want to discuss today.

Most DBAs and developers who work with databases like SQL Server, Oracle, and many common database are comfortable with the relational model for creating tables and defining the relationships that connect them. In fact, they immediately start to normalize a denomalized database in their heads within seconds of seeing the data schema. The relational model relies on the application of the rules of data normalization introduced by Edgar F. Codd in 1970. In essence, the goal of the rules of data normalization is to minimize the data redundancy which also has the effect of decreasing the overall size of the database while at the same time making it easier to maintain information that would otherwise be repeated through many records.

There are three primary rules that are used to determine whether a database, a collection of tables, has been normalized. These rules are:

First Normal Form: No two rows of data may contain repeating data or groups of data. Such repeating data must be split into a separate but connected table. For example, a sales order may contain the purchase of one or more items. Because the number of items included in a sales order is not predefined, it must be split into a separate table with one row for each item on the sales order. These two tables are then typically connected by an order id.

Second Normal Form: This rule only applies to tables which have a compound primary index, an index built from two or more fields. In this rule, all other fields in the table must depend on the entire compound index value, not only a portion of it. A possible example of this might be a table that includes students at a school in which the primary index combined the school name (or id) along with the student name (or id). Imagine that the table also included the address and phone number of the school. This information is not depended on the combination of the school and the student. It only depends on the school. Therefore, this violation of the second normal form requires that the data related only to the school be split into a second table that includes only school information.

Third Normal Form: This rule requires that every field not part of the primary index be depended on the primary index. Going back to my favorite Contoso database, one could argue that in the FactSales table, the Sales Amount field is redundant since this table also includes the sales quantity, sales price, and any sales discounts or returns. Why is it redundant? Because it can be calculated based on the other values of other columns in the record. Therefore, to fully achieve, third normal form, this field should be removed.

While there are other special case rules that can be applied to normalization of a database, most DBAs will be satisfied with a database that satisfies these three rules. They will then build the tables in their database corresponding to these rules and thus create the physical data model. It is called the physical data model because it defines the physical schema of the tables and the relationships between them.

However, business users of the data don’t look at the database that way. In fact most business users would be baffled by the large number of tables required and the relationships between them. In fact, they will not understand why they need to combine data from a half dozen to a dozen tables just to answer a single query question. As an example, let’s look at how a business user might think of just the products within the Contoso database.

The figure below shows the normalized data model for products and their subcategories and categories. As you can see from the schema, the only important piece of information in the subcategories table is the name of the subcategory. Similarly, the only important piece of information in the categories table is the name of the category.

Most business users would not think of category and subcategory names as part of separate tables, but as attributes of the product itself. In fact they would think of product information more like the following figure.

Imagine a database model with dozens of normalized tables and then try to envision how the typical business user sees the data with perhaps only a half dozen tables after denormalization.

In PowerPivot, we can address this issue in one of two ways. The first way would be to add the category and subcategory names to the product table in SQL Server before loading the data into PowerPivot. This would essentially make the physical data model the same as the logical data model used by the end-users. However, this data model would no longer be normalized.

Physically denomalizing the data tables is not the only solution. I could, as shown before in https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/, build two calculated fields in the product table that use the RELATED() function to get the category and subcategory names and display them in the product table. I could then hide the two tables, dimCategory and dimSubcategory, from the user so that they would see a structure similar to their expected denormalized logical data model even though the data still is physically stored in a normalized structure.

The advantage of making the changes to the data model outside of PowerPivot is that it effectively reduces the amount of data that PowerPivot must store in memory thus potentially allowing a larger model to be created. The advantage of making the changes inside PowerPivot by using the RELATED() function and then hiding the technical tables that are still needed in the model but do not need to be seen by the client is that it preserves the sanity of those DBAs who cringe every time someone tries to their denormalize their data. Both methods will allow me to create similar Pivot tables in my Excel spreadsheet. Thus both methods can be considered correct.

The thing to remember is this. Data Normalization was developed to make data storage and maintenance easier and more efficient. However, data analysis and reporting often requires a fair amount of data denormalization. This is especially true when performing data analysis using reports, pivot tables and charts, and even third party data analysis tools. In fact, the logical view of the data schema aids in the performance of most data analysis. Just remember that your logical model is not a normalized model.

C’ya next time.

Advertisements

It’s Only the Role I’m Playing

This week I’m going to return to my favorite sample database, Contoso. If you remember, Contoso consists of sales data for a company that produces several different lines of customer electronics, from computers to phones to TVs. The data spans several years of sales. In past examples, I related the FactSales table (the table containing all of the sales records) to several tables including Channel (DimChannel), Date (DimDate), Product (DimProduct) and Product Sub-Category (DimProductSubCategory). In fact, the data source pre-defined these relations so that when I imported the data into my PowerPivot model in Excel, these relationships appeared by default as shown in the following figure.

Visually, I could switch to the Diagram view of the data model to see these relationships represented by solid lines connecting these five tables as shown below.

However, suppose I have additional information in my FactSales table. The DateKey field that I have used in the past identifies the sale date for each of the sales records. In the real world, there may be additional dates associated with each sale. For example, it is not hard to imagine that each sale would also have a Delivery Date and an Invoice Due Date. Now I might ask, does it make sense to report on sales based on the sale date or does it make more sense to report on sales based on the invoice due date or even the date of payment. Well, you might say, ‘That depends on who is asking the question.’ Very true. My Marketing/Sales manager might want to see sales by the sales date. He or she does not care about when the invoice is paid off. They just need to know if they hit their sales quotas for each month. Therefore, they might need a report that uses the default relationship and looks something like the following:

On the other hand, the Chief Financial Officer is not as concerned about when the sale took place, but when the invoice for the sale is due because only then is the income truly realized for the company. Thus the above report does not meet their needs.

Now suppose that I had a few additional columns, as mentioned earlier, that told me some other dates such as when the items were delivered and/or when the invoice for the sale was due. Let’s further suppose that these columns exist in the FactSales table and might appear like the following figure.

With these columns in my FactSales table, I would want to create relations between them and my date table (DimDate) so that I could generate reports using them. The following figure shows the Create Relationship screen in which I define a new relationship between the delivery date (DeliveryDate) column in FactSales and the date (DateKey) column in DimDate. When I click the Create button, the Power Pivot engine creates a second relationship between these two tables. (Remember the first relationship was between the sales date (Datekey) column in FactSales and the (Datekey) column in DimDate.

In a similar fashion, I create a third relation between these two tables to connect the invoice due date (InvDue) column in FactSales and the date (Datekey) column in DimDate. For the purposes of this demo, I am going to stop there. However, I could create additional relationships between any other date fields in the FactSales table and the date (DateKey) column in DimDate. Switching to the Diagram view, I could now see something like the following between these two tables.

Notice that there are three lines between FactSales and DimDate. One of these lines is solid and the other two lines are dashed. You can only have one active relationship between any two tables. That active relationship is represented by the solid line. The other two relations are inactive at the moment. These three relationships are collectively called role playing relations and DimDate is a role playing dimension because only one relation at a time can be active from the DimDate dimension. Think of it this way, the date column (DateKey) in DimDate can play one of three different roles. It can either play the part of the sales date, the delivery date, or the Invoice date. However, it can only play one role at a time. Unless all three dates were exactly the same, I have to choose which role I want DateKey in DimDate to play with the FactSales table by selecting one of the relationships.

The initial data import associate the DateKey column in DimDate to the DateKey value in FactSales which represented the date the item was sold. As long as the report requested wants to allocate sales to the sales date, I need to do nothing other than generate the Pivot Table report as shown below.

However, if I now have to generate a report for my CFO, I would have to go into the model and change the role that DateKey in DimDate plays. I might try to simply right click on the dashed line representing the connection between the DateKey in DimDate with the InvDue column in FactSales and select: Mark as Active.

This action would generate the following error message.

The problem is clearly stated. You can only have a single active relationship between the two referenced tables. Therefore, I must first right click on the current active relationship to deactivate it before activating a different relationship between the two tables.

Once the original relationship is deactivated, I can go back and activate the new relationship between DateKey and InvDue.

Knowing that only one role (relationship) can be active at a time, I am a little surprised that Microsoft did not automatically deactivate the original relationship when a new one is set active. However, like many things, they did not ask me. Plus they may have had other reasons for not automatically deactivating the current relationship. In any case, after making this change and returning to my pivot table (without making any changes to the pivot table itself), the data automatically updates using the new relationship to show the sum of sales by month of the invoice due date rather than the sales date.

If you examine the sales totals for either the years or individual months, you can quickly see that this role change for the DimDate dimension makes a significant change in the sales numbers reported each month.

That’s all for this week. I hope that now you have a better idea how you can use a single dimension to play different roles. (You could also have added the DimDate table three times to the model, once for each date column in FactSales so that each instance of the DimDate table could have a single relationship to FactSales. Then by changing which dimension you select to display in the Pivot Table, you can achieve essentially the same result. This may be easier for the end-user who does not have access to the data model to make the changes listed above. However, your power-users may prefer a simpler model with fewer tables and can change the role played by those dimensions as needed.)

C’ya next time.

Finding Out When Enough Is Enough

Last week I asked the question, ‘How many people must you ask in a survey?’  While I talked about the topic in generalities at that time, I also mentioned that it would be interesting to test the hypothesis that a survey only needs to query a small percentage of the population to get meaningful results.

To test that theory, I took some data from a recent survey that was conducted over several months and disguised the question, but kept the results.  The question was a basic Likert Scale type question in which the question itself postulates a specific position and asks the survey take whether they agree or disagree with the statement.  This survey was conducted using the SharePoint Survey list and was set to allow a user to only answer the question once so as to not pad the ballot box so to speak.

The  total possible population of respondents was around 12,000.  Of course the survey owners wanted to get as many respondents as possible which is why they conducted the survey over several months.  However, I have always been of the opinion that for this survey, anything more than about a month of making the survey really did not serious effect the overall results.  By that I mean that the responses after about a month were a true representation of the total population and that there was no need to try to get 100% participation.

However, the question I chose to use for this study contained five possible responses listed below:

  • Strongly Agree
  • Agree
  • Neither Agree nor Disagree
  • Disagree
  • Strongly Disagree

Since I was collecting the data with SharePoint, I also stored the date on which each survey was taken.  Therefore, I could tell on any given date, how many responses have been entered since the start of the survey.  Knowing the total population, I could very easily determine the percent participation.  By exporting the data from SharePoint to an Excel spreadsheet, an extremely valuable option from a SharePoint survey, I could load the data into a PowerPivot data model and then create a variety of tables and charts based on the data.

The first figure I show below is the final tabulated results  after three and half months of data collection.  You can see that the total count of responses was only 8,203 out of a possible 12,000.  This represents a little more than 67% of the population.  Of the people who responded to the question (Yes, the question was changed to protect the guilty), ‘I believe Pivot Tables help me analyze data at work, 63.7% of them strongly agreed with the statement.  In fact, over 96% agreed or strongly agreed with the statement.  But my question was, did I need to poll 67% of the population to discover that?

Survey01

Going back to my PowerPivot table, I added a report filter (For those that don’t have PowerPivot, this data set is small enough that a simple Excel Pivot table would also work fine.).

Survey02

When I opened the filter dropdown as shown in the next figure, I can expand the All node of the value tree to show all the possible values in the table.  Note each date is represented as a separate entry.

Survey03

In order to select multiple dates as my filter, I need to click the checkbox at the bottom of the list box: Select Multiple Items.  This action places a checkbox next to each date as well as the All node.  By default, all records (dates in my case) are selected.

Survey04

I first need to unselect the checkbox next to the All node.  Then I can select only the dates that I want to appear in my table.  For example, in the next figure, I select only the first three days of the survey.

Survey05

When I click OK, my table updates and shows a total count of 214 survey responses on which 76.64% strongly agreed with the statement.  While this is close to the final 63.7% at the end of the survey period, it is still 13% away.  Obviously 3 days of a survey are not enough.

Survey06

I then chose 10 more days through February 2nd.

Survey07

This time with 1103 responses, my results for strongly agree was 65.55% and my total for strongly agree and agree were 96.7%.  Now I am getting really close to my final results and after only 13 days rather than 3 and a half months.

Survey08

I added another 10 days bringing my survey count up to 4023, nearly half of the three and half month result and my Strongly Agree percent is starting to settle in at 63.81%, only a tenth of a percent off of the final result.

Survey09

So, just for fun, (statistics is fun isn’t it?) I decided to chart the percentage of Strongly Agree responses as a function of the survey date.  I noticed that by the time I hit a month into the survey, my results had flattened out to around 64% plus or minus less than a half percent.

Survey10

I then plotted the percent response rate assuming a maximum of 12,000 possible responders and to only about a 15-17% response rate.

Survey11

So after surveying only about 15% of the population, I could say that the additional survey results over the next two and half months would not significantly affect my results.  Therefore, I could also say that it would be reasonable to assume that even though I only surveyed 67% of the total population, getting responses from the remaining 33% would probably not significantly change my results.

That is the power of surveys.  The trick is determining when the survey results begin to flatten out.  Every survey can be a little different and the number of possible answers to the survey will also affect the result (something we can maybe test in a future blog entry).

If I were plotting this data on a daily basis, I would have been able see when my results began to flatten and be able to ‘declare a winner’ with a great degree of certainty after a month and half or perhaps less.  In fact, with greater experience with similar types of data and by using questions with fewer possible answers, the size of the survey can be greatly reduced while retaining a high level of accuracy in the result.

I hope you found this interesting.  I chose to give the Tuesday blog a bit more of a technical twist this week because I am about to go on a summer writing schedule.  What does that mean?  I may drop back to one blog entry a week for most weeks.  There is just so many other things to do in the summer that are more fun than writing a blog, like cutting the grass and pulling weeds from the garden or even trimming overgrown bushes.  Anyway, I’ll try to keep a few non technical blogs in the mix each month to lighten up the reading from the dry technical topics.  When fall comes, I will switch back to two entries a week.

C’ya later.