Dealing with Blanks in Your Data Model

Last time, I discussed some reasons why you might want to denormalize your data model when using PowerPivot ( As you may remember, one reason is to make the model easier for the end user to understand. A second reason is that if you denormalize the data outside of PowerPivot, you can substantially decrease the model size by eliminating many if not most of the lookup tables and just adding the lookup information you need in the entity table such as the Product table. I will talk more about denormalizing the data outside of PowerPivot, or at least during the process of importing the data next time. This time, I want to look at another issue that sometimes occurs in your data that could make your data harder to interpret by your end-users.

I’m talking about blank or null fields in the data. Often times a table will have one or more columns which are optional. Take for example a product table like the one we are using from Contoso. Not every product may be sold by color. Some products may not belong to a style or a class. Some products may not have a measured size or a weight. In some databases, the corresponding lookup tables have a special entry for no color, no style, or no class so that every product can point to one and only one value in the lookup. However, you may have a lookup table that only contains specific values. If a product does not exactly match one of those values, the reference from the product table to the lookup table may be left blank. When the user sees a blank value, they may wonder if the user just forgot to select a value or if no value from the lookup table applied. In other words, they do not really know the reason why the field is blank.

Suppose we start with the data model using Contoso data in the following figure. Note to anyone trying to reproduce this example, I went into the Product table ahead of time and removed the ClassID and ClassName from all products in which the color was ‘Blue’ just to create some records with blank values.

After bringing this data into an Excel PowerPivot data model, I create a simple pivot table to show the sales amount and sales counts by class. Notice that the three product classes that are defined include:

  • Economy
  • Regular
  • Deluxe

But you can see from the report that 87,597 of the over 2 million sales records had no class defined. From a user’s viewpoint, they do not know if someone forgot to enter the class for some of the products or if the definition of class does not apply to some products.

As you build your data model in Excel, you can fix this problem by replacing the blank values with something like the string: “[No Class]”. The difficulty is, however, after you load your data into the PowerPivot data model, you cannot edit individual cells. You can, on the other hand, create a new calculated column.

Since I want the new calculated column to have the name ClassName and because I cannot have two columns in the same table with the same name, I begin by renaming the existing ClassName column to ClassNameOriginal. The actual new name I use does not matter as long as it is different. Also note that changing the column names in the Excel data model has absolutely no effect on the name of the column in the original data source.

After renaming the original ClassName column, I add a new column to the end of the table called ClassName. For the calculation to replace the blanks from the original column, I need to know if those blanks are Null values or if they contain an empty string. If the original column used empty strings when a class value was not defined, I can use the following DAX expression to create my new class name value.

= IF(([ClassNameOriginal]=””),”[No Class]”, [ClassNameOriginal])

This expression uses a simple IF() function to see if the value in the column ClassNameOriginal contains the empty string. If it does, the function outputs the string: [No Class]. Otherwise, the function outputs the current row value of the column ClassNameOriginal. This is shown in the following figure.

On the other hand, if the original ClassName column uses a NULL value when there is no corresponding class, the expression needed here must be defined with the ISBLANK() function to test for NULL values as in:

= IF(ISBLANK([ClassNameOriginal]),”[No Class]”, [ClassNameOriginal])

Whichever method is appropriate for your table, you now have a column that you can use for the PivotTable. Before proceeding however, be sure to go back and hide the original class name column: ClassNameOriginal so as not to confuse users with two similar columns that they could use as dimensions in their tables.

Since I still have my original pivot table open, I can just switch over to the Pivot table and it should automatically update to show that the blank class now has the name [No Class]. (Which goes to show you that having no class is better than just being a blank. J )

That’s it for this week. By the time you read this I will be at Code Camp Orlando giving my presentation: Calculated Columns, Measures, and KPIs, Oh My!

C’ya next time when I will show you how to use the query designer while loading data to the PowerPivot model to fix some of the issues from this week and last week.

Your Logical Data Model is not Normal

If you have been reading this blog over the last several years, you know that I have been a strong supporter of using PowerPivot and the Tabular model to perform data analysis, even for power users, not just DBAs. What you may not have realized is that I’ve been secretly teaching you a little about data modeling. What is data modeling? It is really nothing more than the collection of all the data in tables and the relationships between those tables in a database. Did you know that there are at least two major ‘types’ of data models when it comes to how your structure your data into tables? Each one serves a different purpose. Therefore, it can reasonably be argued that neither one is more correct than the other. But you need to understand when to use each type. That is what I want to discuss today.

Most DBAs and developers who work with databases like SQL Server, Oracle, and many common database are comfortable with the relational model for creating tables and defining the relationships that connect them. In fact, they immediately start to normalize a denomalized database in their heads within seconds of seeing the data schema. The relational model relies on the application of the rules of data normalization introduced by Edgar F. Codd in 1970. In essence, the goal of the rules of data normalization is to minimize the data redundancy which also has the effect of decreasing the overall size of the database while at the same time making it easier to maintain information that would otherwise be repeated through many records.

There are three primary rules that are used to determine whether a database, a collection of tables, has been normalized. These rules are:

First Normal Form: No two rows of data may contain repeating data or groups of data. Such repeating data must be split into a separate but connected table. For example, a sales order may contain the purchase of one or more items. Because the number of items included in a sales order is not predefined, it must be split into a separate table with one row for each item on the sales order. These two tables are then typically connected by an order id.

Second Normal Form: This rule only applies to tables which have a compound primary index, an index built from two or more fields. In this rule, all other fields in the table must depend on the entire compound index value, not only a portion of it. A possible example of this might be a table that includes students at a school in which the primary index combined the school name (or id) along with the student name (or id). Imagine that the table also included the address and phone number of the school. This information is not depended on the combination of the school and the student. It only depends on the school. Therefore, this violation of the second normal form requires that the data related only to the school be split into a second table that includes only school information.

Third Normal Form: This rule requires that every field not part of the primary index be depended on the primary index. Going back to my favorite Contoso database, one could argue that in the FactSales table, the Sales Amount field is redundant since this table also includes the sales quantity, sales price, and any sales discounts or returns. Why is it redundant? Because it can be calculated based on the other values of other columns in the record. Therefore, to fully achieve, third normal form, this field should be removed.

While there are other special case rules that can be applied to normalization of a database, most DBAs will be satisfied with a database that satisfies these three rules. They will then build the tables in their database corresponding to these rules and thus create the physical data model. It is called the physical data model because it defines the physical schema of the tables and the relationships between them.

However, business users of the data don’t look at the database that way. In fact most business users would be baffled by the large number of tables required and the relationships between them. In fact, they will not understand why they need to combine data from a half dozen to a dozen tables just to answer a single query question. As an example, let’s look at how a business user might think of just the products within the Contoso database.

The figure below shows the normalized data model for products and their subcategories and categories. As you can see from the schema, the only important piece of information in the subcategories table is the name of the subcategory. Similarly, the only important piece of information in the categories table is the name of the category.

Most business users would not think of category and subcategory names as part of separate tables, but as attributes of the product itself. In fact they would think of product information more like the following figure.

Imagine a database model with dozens of normalized tables and then try to envision how the typical business user sees the data with perhaps only a half dozen tables after denormalization.

In PowerPivot, we can address this issue in one of two ways. The first way would be to add the category and subcategory names to the product table in SQL Server before loading the data into PowerPivot. This would essentially make the physical data model the same as the logical data model used by the end-users. However, this data model would no longer be normalized.

Physically denomalizing the data tables is not the only solution. I could, as shown before in, build two calculated fields in the product table that use the RELATED() function to get the category and subcategory names and display them in the product table. I could then hide the two tables, dimCategory and dimSubcategory, from the user so that they would see a structure similar to their expected denormalized logical data model even though the data still is physically stored in a normalized structure.

The advantage of making the changes to the data model outside of PowerPivot is that it effectively reduces the amount of data that PowerPivot must store in memory thus potentially allowing a larger model to be created. The advantage of making the changes inside PowerPivot by using the RELATED() function and then hiding the technical tables that are still needed in the model but do not need to be seen by the client is that it preserves the sanity of those DBAs who cringe every time someone tries to their denormalize their data. Both methods will allow me to create similar Pivot tables in my Excel spreadsheet. Thus both methods can be considered correct.

The thing to remember is this. Data Normalization was developed to make data storage and maintenance easier and more efficient. However, data analysis and reporting often requires a fair amount of data denormalization. This is especially true when performing data analysis using reports, pivot tables and charts, and even third party data analysis tools. In fact, the logical view of the data schema aids in the performance of most data analysis. Just remember that your logical model is not a normalized model.

C’ya next time.

It’s Only the Role I’m Playing

This week I’m going to return to my favorite sample database, Contoso. If you remember, Contoso consists of sales data for a company that produces several different lines of customer electronics, from computers to phones to TVs. The data spans several years of sales. In past examples, I related the FactSales table (the table containing all of the sales records) to several tables including Channel (DimChannel), Date (DimDate), Product (DimProduct) and Product Sub-Category (DimProductSubCategory). In fact, the data source pre-defined these relations so that when I imported the data into my PowerPivot model in Excel, these relationships appeared by default as shown in the following figure.

Visually, I could switch to the Diagram view of the data model to see these relationships represented by solid lines connecting these five tables as shown below.

However, suppose I have additional information in my FactSales table. The DateKey field that I have used in the past identifies the sale date for each of the sales records. In the real world, there may be additional dates associated with each sale. For example, it is not hard to imagine that each sale would also have a Delivery Date and an Invoice Due Date. Now I might ask, does it make sense to report on sales based on the sale date or does it make more sense to report on sales based on the invoice due date or even the date of payment. Well, you might say, ‘That depends on who is asking the question.’ Very true. My Marketing/Sales manager might want to see sales by the sales date. He or she does not care about when the invoice is paid off. They just need to know if they hit their sales quotas for each month. Therefore, they might need a report that uses the default relationship and looks something like the following:

On the other hand, the Chief Financial Officer is not as concerned about when the sale took place, but when the invoice for the sale is due because only then is the income truly realized for the company. Thus the above report does not meet their needs.

Now suppose that I had a few additional columns, as mentioned earlier, that told me some other dates such as when the items were delivered and/or when the invoice for the sale was due. Let’s further suppose that these columns exist in the FactSales table and might appear like the following figure.

With these columns in my FactSales table, I would want to create relations between them and my date table (DimDate) so that I could generate reports using them. The following figure shows the Create Relationship screen in which I define a new relationship between the delivery date (DeliveryDate) column in FactSales and the date (DateKey) column in DimDate. When I click the Create button, the Power Pivot engine creates a second relationship between these two tables. (Remember the first relationship was between the sales date (Datekey) column in FactSales and the (Datekey) column in DimDate.

In a similar fashion, I create a third relation between these two tables to connect the invoice due date (InvDue) column in FactSales and the date (Datekey) column in DimDate. For the purposes of this demo, I am going to stop there. However, I could create additional relationships between any other date fields in the FactSales table and the date (DateKey) column in DimDate. Switching to the Diagram view, I could now see something like the following between these two tables.

Notice that there are three lines between FactSales and DimDate. One of these lines is solid and the other two lines are dashed. You can only have one active relationship between any two tables. That active relationship is represented by the solid line. The other two relations are inactive at the moment. These three relationships are collectively called role playing relations and DimDate is a role playing dimension because only one relation at a time can be active from the DimDate dimension. Think of it this way, the date column (DateKey) in DimDate can play one of three different roles. It can either play the part of the sales date, the delivery date, or the Invoice date. However, it can only play one role at a time. Unless all three dates were exactly the same, I have to choose which role I want DateKey in DimDate to play with the FactSales table by selecting one of the relationships.

The initial data import associate the DateKey column in DimDate to the DateKey value in FactSales which represented the date the item was sold. As long as the report requested wants to allocate sales to the sales date, I need to do nothing other than generate the Pivot Table report as shown below.

However, if I now have to generate a report for my CFO, I would have to go into the model and change the role that DateKey in DimDate plays. I might try to simply right click on the dashed line representing the connection between the DateKey in DimDate with the InvDue column in FactSales and select: Mark as Active.

This action would generate the following error message.

The problem is clearly stated. You can only have a single active relationship between the two referenced tables. Therefore, I must first right click on the current active relationship to deactivate it before activating a different relationship between the two tables.

Once the original relationship is deactivated, I can go back and activate the new relationship between DateKey and InvDue.

Knowing that only one role (relationship) can be active at a time, I am a little surprised that Microsoft did not automatically deactivate the original relationship when a new one is set active. However, like many things, they did not ask me. Plus they may have had other reasons for not automatically deactivating the current relationship. In any case, after making this change and returning to my pivot table (without making any changes to the pivot table itself), the data automatically updates using the new relationship to show the sum of sales by month of the invoice due date rather than the sales date.

If you examine the sales totals for either the years or individual months, you can quickly see that this role change for the DimDate dimension makes a significant change in the sales numbers reported each month.

That’s all for this week. I hope that now you have a better idea how you can use a single dimension to play different roles. (You could also have added the DimDate table three times to the model, once for each date column in FactSales so that each instance of the DimDate table could have a single relationship to FactSales. Then by changing which dimension you select to display in the Pivot Table, you can achieve essentially the same result. This may be easier for the end-user who does not have access to the data model to make the changes listed above. However, your power-users may prefer a simpler model with fewer tables and can change the role played by those dimensions as needed.)

C’ya next time.

When Plain Text Isn’t

One of the problems many beginner SharePoint content editors have is pasting text into SharePoint from other sources. The problem has to do with the unseen or hidden information embedded in the text that formats the text in these other products.

Take for example a typical passage of text in Microsoft Word as shown below:

If I were to simply copy this text from Microsoft Word into SharePoint, the result might at first look like a fairly good approximation of the original text as displayed in Word as shown in the following image.

However, when I look at the HTML of this text, I see that it is littered with embedded commands that would make it very difficult to edit by just looking at the visible portion of the text. The problem is that when I select text that is visible to me, I may not include all of the tags that wrapped around that text. Furthermore, if I were to try to change the style of the selected text, it may not appear to change if my selection included the original formatting tags that I want to replace, but are left in the code. The selection of the text may not even grab the embedded start and end tags, but only one or the other resulting in tags that are not property nested. The browser then struggles to interpret exactly what I want to do and may add additional but superfluous <div> tags to try to isolate the problem. While I could directly edit the HTML to fix these issues, I would not recommend this to the typical content editor unless I was very sure that they were familiar with HTML. In any case, the challenges of editing formatted text directly may be more than I typically want to handle.

One solution that I have been told to use in the past is to copy the text first to NotePad. NotePad does not support most of the special formatting features of Microsoft Word and therefore it strips out these embedded tags from the text. The following image shows the same text copied first to NotePad and then copied to SharePoint.

The text looks fairly clean of any special formatting. I might think that I can now easily go into this text and begin changing the formatting of specific portions of the text to match what I had in Microsoft Word. For the most part, I would be safe doing this. In fact, I am much better off than if I pasted the Microsoft Word text directly onto the SharePoint page. However, looking at the HTML of even this page, I can see that some special formatting still exists although it is not nearly as bad as the HTML pasted directly from the original Microsoft Word text.

Another option has been to try to remove the formatting from the text after pasting it on the page. To do this, I would first select all of the text that I just pasted and then open the Format Text ribbon as shown in the image to the right and select the icon that looks like the letter ‘A’ with an eraser to its lower right. The text that appears when I hover over this button says that it will clear all formatting from the selection leaving only plain text. While that sounds promising, in actual use, the results are less than I hoped for as the image below shows. Notice that the HTML, while starting to look fairly clean with only the actual text, still has a lot of formatting tags, specifically <div> and <span> tags.

Fortunately, in SharePoint 2010 there is another option, one that very few people find. Recognizing the problem with Copy and Paste, a method used by many people, Microsoft added a second paste option. When you are ready to paste your text in SharePoint, rather than just pressing the CTRL-V to paste the text that you have on the clipboard to the current cursor position, you can click on the lower half of the Paste button in the Format Text ribbon which happens to be the leftmost button on this ribbon. Clicking on the lower half of the button displays a submenu that includes the two options shown in the image to the right. The one that I am interested in is the option: Paste plaintext.

This time when I look at the resulting HTML, I see a much cleaner version of the text. The only special formatting left is the <BR> tags used to end paragraphs and create line breaks. The resulting text should be easy to maintain in SharePoint with the formatting tools in the Format Text ribbon.

In addition to the methods described here, there are many other text editing tools that allow you to either cut and paste text from your favorite editor or may even serve as an alternative text editor. I cannot attempt to evaluate them all here. Rather I will mention a few that I’ve heard of but leave it up to you to test whether they serve your needs. Two that come to mind are PureText for the PC which is a free download and Flycut for the Mac.

If you insist on using a browser other than IE (remembering that SharePoint was designed around integration with Microsoft products, not products from other companies), you might try one or more of the browser extensions for FireFox or Chrome that support copy and paste of just text. Just be aware that while some of these may work fine to copy text from one web page to another, they may not work fine with other applications such as Microsoft Word.

Finally, FireFox and Chrome allows you to paste just text by using the key combination CTRL+SHIFT+V. Similarly, the Macintosh uses the keys: Shift + OPTION + COMMAND + V.

If all of this seems like too much of a bother, remember that you can always create your text directly in SharePoint rather than starting from another text tool and then having to copy and paste that text into SharePoint. The SharePoint interface for entering text is very similar to Microsoft Word, or perhaps you use WordPad, and provides a good WYSIWYG interface for entering and formatting your text. Entering and formatting text directly within SharePoint practically guarantees that you will not have future issues editing the text.

That’s it for this week. C’ya next time.