May 30 2015

A New View in Excel

I have used many of my blogs over the last couple of years to talk about PowerPivot in Excel 2010 and Excel 2013. By now you should have a pretty good idea how to build data models and analyze your data. You many even have used the Pivot Chart capability to display your data results visually. But did you know that there was another way to visualize your data within Excel 2013?

Just like PowerPivot that now comes preinstalled with Microsoft Office Excel Professional Plus 2013, so does Power View. To use this tool, you must also have installed Internet Explorer and Silverlight on your computer. If you have those prerequisites, you can enable Power View by opening the options dialog in Excel and selecting Add-Ins. At the bottom of the right panel is an option to manage different classes of Add-ins. Using the drop-down menu select: Com Add-ins and click the Go button. From the list of available add-ins, locate Power View and make sure the checkbox to the left is selected. Then click OK.

Next open the Insert ribbon in Excel 2013 and click the Power View button that appears in the Reports group. If this is your first time using Power View, you will need to enable it by clicking the Enable button that appears. This opens a separate Power View worksheet in the current workbook which will look something like the following:

If you do not have Silverlight installed, you will be prompted across the top of the Power View worksheet to do this. Click “Install” to install Silverlight. When the process completes, click the “Reload” button in the message bar to return to Power View.

In addition to the Power View worksheet, you should see a Power View tab with several options as shown below:

Because I started from a blank spreadsheet, I have no data I can use in my Power View report yet. In fact, the Power View Fields panel tells me that I need to create and select a range of cells with data and then click Power View from the Insert tab to proceed. For today’s quick example, I will create a simple table as shown below:

Once I’ve added this table to my Power View report worksheet, the Fields panel now shows me my active tables and the fields within those tables. Notice that numeric fields are automatically treated as aggregated sums. I can select the fields that I want to appear in my table by either clicking on the checkbox before the field name in the top half of the Power View Fields panel or I can drag the fields that I want down to the FIELDS box in the lower half of the panel. The advantage of dragging fields to the FIELDS box is that I can arrange the order of the fields here by dropping new fields in the position I want and I can even drag and drop fields later to rearrange the order of the fields in the table. The following figure shows the result of including all of the fields from the original table in my Power View table.

Of course if I don’t want to display all the fields, I can remove them by simply unchecking the check box in the upper portion of the Fields panel or I can use the dropdown menu to the right of any field in the FIELDS box to remove the field or change the aggregate function used for that field. For example, suppose I only was interested in the total student populations of each of the schools. I can simply remove the Student Grade column to achieve that result.

Next, suppose I wanted to have a second table or a chart on the same page. To do that, I begin by clicking in any blank area (not defined by the rectangle created by the first table). I then select the fields I want to include in my table or chart. In the example below, I use only the school name and student population just as was done in the first table.

I can then go to the Design ribbon associated with the Power View worksheet whose tab is displayed to the immediate right of the Power View tab. The first group contains options to switch the visualization of the data. The fourth icon, Other Chart, displays a dropdown that includes options to display the data as a line, scatter, and pie chart as shown below. There are some other interesting options in this ribbon like Map and Tiles that I will cover in future installments of this blog, but for now, let’s display a Pie chart of this data.

When I select PIE, Power View attempts to determine which fields to use for each part of the pie chart. In this case with only two fields, the answer is easy since only one of the fields is numeric. The numeric field is chosen as the size field that determines the size of the pie slice while the text field becomes the identifier of the slice which Power View calls Color. Note that there is no way currently to change the individual colors used in the pie chart. However, you can use the Themes dropdown in the Themes group of the Power View ribbon to change the color set used. Keep in mind that this can also change the font used for text on the page.

So what happened to Student Grades? If I click anywhere within the pie chart and the select the checkbox next to Student Grade in the Fields panel as shown below, Power View creates dividers in each of the schools that represent ‘sub-slices’ one for each grade in the school.

Unfortunately these sub-slices are not labelled so it is not possible to determine visually which slice belongs to each grade. Suppose we went back to the first table in this worksheet and added the Student Grade column back in as shown below.

Now the table includes a row for each grade with the number of students in that grade. At first you might say that did not help much.

However, if we click on any of the sub-slices in the pie chart, the table automatically filters to the information for that sub-slice/grade as shown below.

Ok, I know we have a lot more to learn about Power View and over the coming weeks, I will attempt to introduce you to its many other features. I also know that the charts created today were not that dissimilar to charts and table you could create with PowerPivot tables and charts. In fact, those tables and charts have a greater degree of formatting flexibility. However, we have to walk before we can run. In future segments, I will show how to create Power View charts that you could not create before.

Until then, C’ya!

May 23 2015

Can You FILTER() That Down For Me

The last several weeks I have been looking at how PowerPivot in Excel works with Row Context and Filtered Context. I showed that most column expressions use a row context while measures use a filtered context although I could add and remove filters using certain expressions that allowed me to define a filter as a Boolean expression in one of the parameters. Last time we even looked at how to remove the filters by using the ALL() function. This time I will explore the FILTER() function which allows me to define a permanent filter condition to a measure no matter what dimensions or slicers the user chooses for the pivot table. In fact, in the case I am going to show you today, I need to do this because I need one measure to use all filters defined by the dimensions in the pivot table, and I need another measure to obey those filters plus one more.

Again I will use my Contoso data model that I’ve been using for all the examples in this set. I want to look at the number of orders that have returns and compare that to the total number of orders. I initially will want to show this information by sales channel and year/month. However, once I have my pivot table defined, I could of course change the dimensions I want to explore.

Let’s begin with a basic Sales pivot table as shown below.

I built this table using my basic data model with no additional calculated columns or measures except the calculated column in the date dimension that I use to order the name of the months correctly. I can use any of the columns in the FactSales table as my value field as long as I change the aggregate function from SUM to COUNT. By default, Pivot tables assume that numeric fields are summed and non-numeric fields are counted. But as long as I change the aggregate function for numeric fields to COUNT, I will get my desired results. I also modified the formatting to get rid of any decimal places and to add a thousands separator. Other than that, I did nothing special to build this table.

However, now I am going to return to the FactSales table and add a simple measure to count the total number of sales. The expression I will use is shown in the following figure.

I use the COUNT() function which has a single parameter, the name of the column I want to count. Again I could choose any column, but I chose the column [ReturnQuantity]. I will come back to format this measure in a moment, but you can see that the count is a little over two and a quarter million sales records. In fact, I know that this is correct by simply looking at the number of rows in my FactSales table.

Next, I want to count the number of sales records that have returns. This I can do by comparing either the [ReturnQuantity] column or the [ReturnAmount] column to 0. Only sales records which have values greater than 0 for these two columns represent orders which had returns. How can I do this?

One way I could do this is to use the SUMX() functionwith a second measure named [ReturnCount2]. This function has two parameters. The first parameter must be a table and second parameter is an expression of what I want to count. So I might think that I could do something like the following expression:

ReturnCount2:=SUMX(FactSales,IF(FactSales[ReturnAmount]>0, 1, 0))

The theory is that I want to compare the column [ReturnAmount] to 0 and if it is greater than zero to add one to my ReturnCount2 value. I cannot simply sum the [ReturnAmount] because this column represents the dollar value of the return. Nor can I use [ReturnQuantity] because the buyer may have returned more than one of the item from the order and summing the quantity would over count the total number of orders with returns.

I could also use the COUNTX() function. However, if I simply replace SUMX() with COUNTX(), I will get the total number of orders in the FactSales table because COUNTX() will could all non-blank rows. But I can trick the IF() into returning a blank for orders without returns by using the following expression:

ReturnCount2:=COUNTX(FactSales,IF(FactSales[ReturnAmount]>0, 1, BLANK()))

But both of these solutions used the entire FactSales table. There is one other way I want to show you today. I can use the FILTER() function to apply a filter to the FactSales table to return a subtable that only has rows with returns by using the following expression to return a table

FILTER(FactSales, FactSales[ReturnAmount]>0.0)

I can now replace the first parameter in COUNTX() with this FILTER() result which is a filtered table. I can then use any column in FactSales as the column I want to count. Well, almost any column. Actually, I cannot reuse the [ReturnAmount] column which is used in the FILTER() expression because this confuses DAX, but as I said before, I can count on any column in the table. Therefore, my [ReturnCount2] measure expression is shown below.

In this image you can see that I already formatted my measures as numbers without decimals but with thousands separators. Why do I format the numbers here? Simply because it saves time from having to format the numbers in each pivot table in which I use the measures. If I display these two measures in my pivot table side by side, I can see the total number of order by channel in each month along with the number of orders that had returns.

Suppose I wanted to show this information to management and rather than look at the raw counts which could take a bit of time to interpret, I decide to calculate the percent of orders that have returns. I can create a third measure as shown in the following figure that uses the results of the first two measures. I can then format this measure as a percentage prior to using it in my pivot table.

Returning to my pivot table, I remove the counts which I no longer need to display and replace them with the [Percent_Returns] so that management can quickly see that Catalog sales result in the most returns and Store sales in the least returns. Returns do not vary greatly by month, something that I will leave up to you to explore with a Pivot Chart.

Well, I hoped you learned some new ways to apply different filters in your measures from this discussion. C’ya next time.

May 9 2015

I Want It ALL()

This week I am going to reverse direction on applying filters to my pivot table and show you how and why you might want to remove all filters instead of adding filters to an expression.

Again working with the Contoso dataset, I am going to start by looking at sales by product category. In addition, I want to be able to slice my data by channel or combinations of different channels. To do this, I want to use the visually friendly slicer tool as shown in the following figure. Note that in this case, I already have selected only the Online channel which results in a total of $2.6776 billion in sales. Keep in mind that the total of all sales across all channels is a little more than $12.4 billion.

Now for each product category, I want to see the total sales in that channel compared to the total sales across all product categories and all channels. If I refer back to my earlier blog on the different ways pivot tables can represent data by using built-in features, I might try looking there first to see if there is a fast way to accomplish my task. By right clicking on any of the rows in the Sales Total column and selecting the option: Show Values As, I can pick from a dropdown menu of different built-in calculations. Many of these options calculate percentages of row, column or grand totals of either the entire pivot table or a group level. These options also provide difference and running total calculations.

For example, if I were to select % of Grand Total, I would get some interesting percentages. However, these values would be based on the total sales of the slicer filter, in other words, the total sales for online sales as shown in the following figure.

So let’s play a little with a different pivot table that shows total sales by each of the channels. In the figure below, I’ve included the channel as my row filter and a have two columns which both show the total sales amount. Note that each row of the total sales amount is filtered by the channel. This is an example of filtered context when calculating a measure.

Now I’ve labeled the first of the two column: Total Sales Amount and will therefore let the pivot table display the sum of the measure filtered by the channel.

However, I’ve labeled the second column: % Dales by Channel. I can right click on any of the values in this column to select one of the other built-in calculations. In this case, calculating a percent of the grand total will show me the percent of sales that come from each of the four channels as shown below. The value in the Grand Total row displays 100% because all sales are represented by one of the four channels. Note here that it is clear that online sales account for only 21.57% of the total sales. I can use this information to validate what I’m about to do in the next step.

I am going to create a new measure named: Percent_of_Total_Sales. To generate a value for this measure, I want to sum the column FactSales[SalesAmount] for the filtered context of each place this measure appears in my pivot table. However, to get a percentage of total sales, I need to calculate the sum of FactSales[SalesAmount] for all sales, not just sales for a channel or product category, or any other filter criteria. In effect, I want to calculate the total sales as if there was no filter context in the pivot table at all. I can do this by creating a ‘new’ table for the SUMX() function (remember SUMX() has two parameters, the first of which must reference a table of values). There is another function I must use to eliminate the filter context for this ‘new’ table. This function is aptly named: ALL(). When I use ALL(FactSales) (and yes, the parameter for the ALL function must be the name of the table and it returns a table with all filters removed), I can get a ‘copy’ of the FactSales table without applying the filter context of the pivot table. In other words, all of the records in the original FactSales table will be included in ALL(FactSales). If I use this ‘new’ table as the data source for my SUMX() function and then simply sum the Sales Amount column using the SUM() function as shown below, I can return the total sales of the unfiltered FactSales table which then can be used as my denominator in my calculation. The numerator is a SUM() function of the Sales Amount also, but is calculated on the filtered context which in my case is filtered for online sales and product category.

Initially the measure returns a value of 1 because in the data model there is no filter context so the sum of the ‘filtered’ sales amount total divided by the ‘unfiltered’ sales amount total will be equal to 1. Rather than go directly to the pivot table, let’s first format this value as a percent by right clicking on the measure definition cell and selecting Format.

This option displays a dialog that lets me select the data category which is: Number. I then select the format of the number as Percentage with 2 decimal places. I then click OK to accept the format for the measure.

My measure calculation in the data model now displays a value of 100.00%. However, if I place this measure in my pivot table that displays sales by product category and uses a slicer to include on online sales, I can see my sales percentages as shown below. These values are now correctly dividing the product category sales for online sales by the total sales of my company. I can feel confident that the calculation is correct because the percent in the Grand Total line which represents sales from all product categories made through online sales is 21.57% which is the same percent I calculated in the pivot table earlier that only looked at sales by channel.

Using the channel slicer, I can select store sales instead of online sales. My previous pivot table told me that this should be 55.93% of the total sales. As you can see in the following figure, the Grand Total of my pivot table that displays sales by product category has percentages that also (accounting for rounding) add up to 55.93% since this pivot table uses the channel slicer.

In case any reader is wondering, the second pivot table that displays sales by channel, while on the same spreadsheet as the first table that displays sales by category does not use the slicer. If you have a slicer and multiple pivot tables or pivot charts, you must associate the slicer with each table and/or chart to which you want it to apply. It is not automatic nor implemented by spreadsheet page.

I hope you are starting to get a better feel for how row context and filter context work together with DAX expressions and functions to calculate values. Being able to correctly apply and remove filter contexts is essential in calculating values in many pivot table tables.

C’ya next time.

May 2 2015

Getting the Right Context – Part 2

Last time, I introduced the concept of context within DAX expressions using by PowerPivot to calculate columns and measures. We saw that the default context for column calculations was row context while the default context for measures was filtered context. However, I ended the discussion by showing that within a measure calculation, I could use column context with certain aggregate functions like SUMX which can be used to apply an expression across all the rows of a table. However, before that expression is evaluated, that table is automatically filtered since I used it in a measure which begins by applying the filtered context of the pivot table to the rows used by the expression.

So this time, let’s dive a little deeper. Let me begin by going to the FactSales table of my Contoso data model and calculate a measure for total sales. This is not difficult and can be achieved by using the SUM() function as shown in the following figure.

Now let’s assume that I want a measure that shows only the sales made through the store channel. At first, I might try to use the SUMX() function along with the FILTER() function. The FILTER() function also has two parameters. The first parameter is the name of the table that I want to apply a filter on. The second parameter is a filter expression. I may at first assume that because I created a relationship between the FactSales table and the DimChannel table, that I can simply reference the column [ChannelName] and compare it to the string “Store” to filter the FactSales table. However, as you can see in the following figure, this expression would result in an error.

The reason for the error is that while the FILTER() function references the table FactSales, there is no context to link records in FactSales to DimChannel. I know you might ask, “Doesn’t the relation between these two tables define that context?” The answer is that the relationship between tables while defining the ‘mechanism’ of how to connect the two tables, it does not activate a context between the rows in FactSales with a row in DimChannel. When pointing from the many side of a relationship to the one side of the relationship, we must use the RELATED() function to activate the context within the expression. I show this in the following figure.

You can see that now I have a total sales for the just the stores as a measure. If I use this measure in a pivot table that displays sales by month and by product category, I will have the additional filter of sales by store in each of the pivot table cells.

What if, however, I wanted to create a calculated column in the table DimChannel that displayed the total sales for that channel? Again you might start with the SUMX() function because you want to calculate an expression from another table. In this case, the first parameter of the SUMX() function would be the FactSales table and the column that we want to sum would be the [SalesAmount] column. However, if we were to create this column, we might be surprised by the result, shown in the following figure.

All of the values in the column are exactly the same. Furthermore, if I refer to the image earlier in this blog for the total sales across all channels, I would see that the value displayed here in each cell of the column is actually the total sales. Again the problem is context. There is no context to refer back to FactSales from DimChannel. Therefore, when SUMX() evaluates the [SalesAmount] in table FactSales, it pulls values from all the sales records, not just the sales from the channel represented by the current row in DimChannel.

In this case, because I am going from the one side of the DimChannel relationship to the many side of FactSales, I need to return a table that contains just the rows from FactSales that represent sales from the channel in the context of the current row. I can do this by using the RELATEDTABLE() function which uses a single parameter, the name of the table on the many side of the relationship. I must also have the relationship explicitly defined between DimChannel and FactSales. I have already done this. So Power Pivot can use the relationship to create a subset of rows from FactSales for the current channel. I can then use this resulting table in the SUMX() function to sum the [SalesAmount] column as shown in the following figure.

As you can see in the figure, the first row which represents the store channel displays the same sales total as we calculated from the measure in FactSales earlier.

So again, you can see that there are many different ways to define the context of an expression. When dealing with multiple tables, it is important to understand whether you can perform a row context calculation by using the RELATED() function to extend the row context to the related table on the lookup side of the relationship or whether you need to use the RELATEDTABLE() function to filter the rows used in an aggregate function like SUMX() to calculate values for a column which uses the row context to define the link to the many side table.

Next time, I’ll look at some functions that let you turn off a filtered context and show where you might use it.

C’ya next time.

Apr 4 2015

Populating Your Data Model with A Query

Over the last two weeks we talked about normalizing your logical data model in PowerPivot to help make it easier for users to understand and work with (https://sharepointmike.wordpress.com/2015/03/21/your-logical-data-model-is-not-normal/) and (https://sharepointmike.wordpress.com/2015/03/28/dealing-with-blanks-in-your-data-model/). However, in each of these models, we did all the data manipulation inside the data model. That left a lot of technical fields and tables that we could not remove from the model, but which took extra memory which could limit the size of our data model. So how do we solve that problem? We could do much of the data manipulation outside of the data model or as part of the import step. Let’s see how.

I’m going to start by importing product data from the Access database version of Contoso into the Power Pivot data model.

In the Table Import Wizard, I’m going to select the option to write a query that will specify the data to import. This option allows me to combine data from the dimProduct, dimProductCategory, and dimProductSubCategory tables into a single file. I can also embed the calculation to fix the blank Class fields we visited last time.

Clicking Next, I see the Specify a SQL Query dialog. This is an intimidating dialog. I could manually enter a SQL query here. However, I would have to know all the field names from all the tables that I want to bring together. Not impossible, but there has got to be an easier way. Wait. What is this Design button in the bottom right of the dialog?

When I click the Design button, I get an equally intimidating dialog that looks like it has an area at the top to again manually enter a SQL statement and a grid area at the bottom to display the results of the SQL statement if I were to test run it by clicking the exclamation point (!). Obviously, this screen is not very friendly toward building Access database queries. Nowhere near as friendly as the query builder inside Access itself.

If I go back into Access and build my query there using the query builder, I can visually design my query. First, I add the tables I want to use from the Show Table dialog that appears when I open the Query Designer from the Create ribbon. Then I can add the fields I want to include from any of the included tables by double clicking on their names in the tables at the top of the dialog.

I can even create my calculated column here by entering a new name in the row of the design grid followed by a colon (:). Then I can use the same expression as I did last time to test the value in the ClassName field.

NewClassName: IIF(([ClassName]=””),”[No Class]”, [ClassName])

Note that the IIF() function in Access has a double ‘I’. Otherwise, it works the same. Also note that I’m using a different name for the new column here to keep it separate from the field in the table dimProduct.

After I have finished creating the query, I can test it by clicking the Run button in the Query Tools Design window. If the query executes correctly, I am ready to copy my query to my Power Pivot data load screen. If I open the dropdown menu under SQL View in the Query Tools Design window, I see an option: SQL View as well as some other views.

This option displays the SQL statement which I can copy and then paste back in the Excel Data Model Table Import Wizard as shown below.

If I were to try to run this query directly, it would fail. Admittedly, it took me awhile to figure out why. I finally found that it did not like the one field definition: dimProduct.Size. To test what was going on, I clicked on the Design button to take me to the screen which displays both the SQL query and a grid of results from testing the query. After some testing, I found that if I redefined this field as: dimProduct.[Size], the parsing engine was happy and gave me the results I expected as shown below. The square brackets are used to clarify to the browser that the text within the brackets represent a field name, not a command or reserve word.

I then clicked OK for this screen and Finish on the next screen. I was rewarded with the data engine reading and loading my Product data. Note that it is a single table now. By including the name of the product category and the product subcategory in my query, I no longer need those two additional tables thus reducing the overall size of my data model a bit.

Next I still need to load the Sales data from Contoso along with the date and channel tables. The follow screen shows that I can open the Access database a second time to extract the other tables leaving out the product and related tables.

After the FactSales table is loaded, I still need to create a relationship with the new Product table I pulled in with the query. Even though FactSales and dimProduct were linked in the original Access database, by loading the tables separately, the relationship is not automatically discovered. Therefore, I have to use the Create Relationship dialog to link these two tables.

If I were to display the diagram view of my Power Pivot data model, you can see that the model has been simplified down to only four tables rather than six as we used before.

If I now build the same Pivot table that I did last week to shows sales by class, you can see in the following figure that the results are the same.

I can even create a hierarchy in my product table to go from category to subcategory to product. I covered building hierarchies before (). After building the hierarchy, I strongly recommend hiding the supporting columns so as to not confuse the end-users. (See: https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/)

Denormalizing your data as you import it may sound a bit more complex, but it can save you memory and Power Pivot’s main limitation is that it must hold the entire data model in memory. So the smaller you can make that model, the more data you can load into it.

Some people might ask how far do I denormalize the model. Of course, in theory, you could denormalize to a perfectly flat single table. However this is not practical as it can serious limit the flexibility of the model and it can even cause other data errors. I tend to think in terms of model objects. For example, Products are an object in the model. Therefore, collapsing the Product Subcategory and Product Category tables into Product makes sense to me. If I had tables Stores, Cities, Countries, and Geography, I might be tempted to collapse them together as well since they represent a Location object for sales. On the other hand, I would not denormalize all the product information into the FactSales table.

So if you are thinking that there is no hard and fast answer to when should you denormalize a model, that is somewhat true. It is a bit of an art. However, if you have a table that is used just as a lookup for a name or perhaps one or two other associated fields and it is not a role playing table (See: https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/), denormalize that table.

That’s it for this series. C’ya next time.

Mar 28 2015

Dealing with Blanks in Your Data Model

Last time, I discussed some reasons why you might want to denormalize your data model when using PowerPivot (https://sharepointmike.wordpress.com/2015/03/21/your-logical-data-model-is-not-normal/). As you may remember, one reason is to make the model easier for the end user to understand. A second reason is that if you denormalize the data outside of PowerPivot, you can substantially decrease the model size by eliminating many if not most of the lookup tables and just adding the lookup information you need in the entity table such as the Product table. I will talk more about denormalizing the data outside of PowerPivot, or at least during the process of importing the data next time. This time, I want to look at another issue that sometimes occurs in your data that could make your data harder to interpret by your end-users.

I’m talking about blank or null fields in the data. Often times a table will have one or more columns which are optional. Take for example a product table like the one we are using from Contoso. Not every product may be sold by color. Some products may not belong to a style or a class. Some products may not have a measured size or a weight. In some databases, the corresponding lookup tables have a special entry for no color, no style, or no class so that every product can point to one and only one value in the lookup. However, you may have a lookup table that only contains specific values. If a product does not exactly match one of those values, the reference from the product table to the lookup table may be left blank. When the user sees a blank value, they may wonder if the user just forgot to select a value or if no value from the lookup table applied. In other words, they do not really know the reason why the field is blank.

Suppose we start with the data model using Contoso data in the following figure. Note to anyone trying to reproduce this example, I went into the Product table ahead of time and removed the ClassID and ClassName from all products in which the color was ‘Blue’ just to create some records with blank values.

After bringing this data into an Excel PowerPivot data model, I create a simple pivot table to show the sales amount and sales counts by class. Notice that the three product classes that are defined include:

Economy
Regular
Deluxe

But you can see from the report that 87,597 of the over 2 million sales records had no class defined. From a user’s viewpoint, they do not know if someone forgot to enter the class for some of the products or if the definition of class does not apply to some products.

As you build your data model in Excel, you can fix this problem by replacing the blank values with something like the string: “[No Class]”. The difficulty is, however, after you load your data into the PowerPivot data model, you cannot edit individual cells. You can, on the other hand, create a new calculated column.

Since I want the new calculated column to have the name ClassName and because I cannot have two columns in the same table with the same name, I begin by renaming the existing ClassName column to ClassNameOriginal. The actual new name I use does not matter as long as it is different. Also note that changing the column names in the Excel data model has absolutely no effect on the name of the column in the original data source.

After renaming the original ClassName column, I add a new column to the end of the table called ClassName. For the calculation to replace the blanks from the original column, I need to know if those blanks are Null values or if they contain an empty string. If the original column used empty strings when a class value was not defined, I can use the following DAX expression to create my new class name value.

= IF(([ClassNameOriginal]=””),”[No Class]”, [ClassNameOriginal])

This expression uses a simple IF() function to see if the value in the column ClassNameOriginal contains the empty string. If it does, the function outputs the string: [No Class]. Otherwise, the function outputs the current row value of the column ClassNameOriginal. This is shown in the following figure.

On the other hand, if the original ClassName column uses a NULL value when there is no corresponding class, the expression needed here must be defined with the ISBLANK() function to test for NULL values as in:

= IF(ISBLANK([ClassNameOriginal]),”[No Class]”, [ClassNameOriginal])

Whichever method is appropriate for your table, you now have a column that you can use for the PivotTable. Before proceeding however, be sure to go back and hide the original class name column: ClassNameOriginal so as not to confuse users with two similar columns that they could use as dimensions in their tables.

Since I still have my original pivot table open, I can just switch over to the Pivot table and it should automatically update to show that the blank class now has the name [No Class]. (Which goes to show you that having no class is better than just being a blank. J )

That’s it for this week. By the time you read this I will be at Code Camp Orlando giving my presentation: Calculated Columns, Measures, and KPIs, Oh My!

C’ya next time when I will show you how to use the query designer while loading data to the PowerPivot model to fix some of the issues from this week and last week.

Mar 21 2015

Your Logical Data Model is not Normal

If you have been reading this blog over the last several years, you know that I have been a strong supporter of using PowerPivot and the Tabular model to perform data analysis, even for power users, not just DBAs. What you may not have realized is that I’ve been secretly teaching you a little about data modeling. What is data modeling? It is really nothing more than the collection of all the data in tables and the relationships between those tables in a database. Did you know that there are at least two major ‘types’ of data models when it comes to how your structure your data into tables? Each one serves a different purpose. Therefore, it can reasonably be argued that neither one is more correct than the other. But you need to understand when to use each type. That is what I want to discuss today.

Most DBAs and developers who work with databases like SQL Server, Oracle, and many common database are comfortable with the relational model for creating tables and defining the relationships that connect them. In fact, they immediately start to normalize a denomalized database in their heads within seconds of seeing the data schema. The relational model relies on the application of the rules of data normalization introduced by Edgar F. Codd in 1970. In essence, the goal of the rules of data normalization is to minimize the data redundancy which also has the effect of decreasing the overall size of the database while at the same time making it easier to maintain information that would otherwise be repeated through many records.

There are three primary rules that are used to determine whether a database, a collection of tables, has been normalized. These rules are:

First Normal Form: No two rows of data may contain repeating data or groups of data. Such repeating data must be split into a separate but connected table. For example, a sales order may contain the purchase of one or more items. Because the number of items included in a sales order is not predefined, it must be split into a separate table with one row for each item on the sales order. These two tables are then typically connected by an order id.

Second Normal Form: This rule only applies to tables which have a compound primary index, an index built from two or more fields. In this rule, all other fields in the table must depend on the entire compound index value, not only a portion of it. A possible example of this might be a table that includes students at a school in which the primary index combined the school name (or id) along with the student name (or id). Imagine that the table also included the address and phone number of the school. This information is not depended on the combination of the school and the student. It only depends on the school. Therefore, this violation of the second normal form requires that the data related only to the school be split into a second table that includes only school information.

Third Normal Form: This rule requires that every field not part of the primary index be depended on the primary index. Going back to my favorite Contoso database, one could argue that in the FactSales table, the Sales Amount field is redundant since this table also includes the sales quantity, sales price, and any sales discounts or returns. Why is it redundant? Because it can be calculated based on the other values of other columns in the record. Therefore, to fully achieve, third normal form, this field should be removed.

While there are other special case rules that can be applied to normalization of a database, most DBAs will be satisfied with a database that satisfies these three rules. They will then build the tables in their database corresponding to these rules and thus create the physical data model. It is called the physical data model because it defines the physical schema of the tables and the relationships between them.

However, business users of the data don’t look at the database that way. In fact most business users would be baffled by the large number of tables required and the relationships between them. In fact, they will not understand why they need to combine data from a half dozen to a dozen tables just to answer a single query question. As an example, let’s look at how a business user might think of just the products within the Contoso database.

The figure below shows the normalized data model for products and their subcategories and categories. As you can see from the schema, the only important piece of information in the subcategories table is the name of the subcategory. Similarly, the only important piece of information in the categories table is the name of the category.

Most business users would not think of category and subcategory names as part of separate tables, but as attributes of the product itself. In fact they would think of product information more like the following figure.

Imagine a database model with dozens of normalized tables and then try to envision how the typical business user sees the data with perhaps only a half dozen tables after denormalization.

In PowerPivot, we can address this issue in one of two ways. The first way would be to add the category and subcategory names to the product table in SQL Server before loading the data into PowerPivot. This would essentially make the physical data model the same as the logical data model used by the end-users. However, this data model would no longer be normalized.

Physically denomalizing the data tables is not the only solution. I could, as shown before in https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/, build two calculated fields in the product table that use the RELATED() function to get the category and subcategory names and display them in the product table. I could then hide the two tables, dimCategory and dimSubcategory, from the user so that they would see a structure similar to their expected denormalized logical data model even though the data still is physically stored in a normalized structure.

The advantage of making the changes to the data model outside of PowerPivot is that it effectively reduces the amount of data that PowerPivot must store in memory thus potentially allowing a larger model to be created. The advantage of making the changes inside PowerPivot by using the RELATED() function and then hiding the technical tables that are still needed in the model but do not need to be seen by the client is that it preserves the sanity of those DBAs who cringe every time someone tries to their denormalize their data. Both methods will allow me to create similar Pivot tables in my Excel spreadsheet. Thus both methods can be considered correct.

The thing to remember is this. Data Normalization was developed to make data storage and maintenance easier and more efficient. However, data analysis and reporting often requires a fair amount of data denormalization. This is especially true when performing data analysis using reports, pivot tables and charts, and even third party data analysis tools. In fact, the logical view of the data schema aids in the performance of most data analysis. Just remember that your logical model is not a normalized model.

C’ya next time.

Mar 14 2015

It’s Only the Role I’m Playing

This week I’m going to return to my favorite sample database, Contoso. If you remember, Contoso consists of sales data for a company that produces several different lines of customer electronics, from computers to phones to TVs. The data spans several years of sales. In past examples, I related the FactSales table (the table containing all of the sales records) to several tables including Channel (DimChannel), Date (DimDate), Product (DimProduct) and Product Sub-Category (DimProductSubCategory). In fact, the data source pre-defined these relations so that when I imported the data into my PowerPivot model in Excel, these relationships appeared by default as shown in the following figure.

Visually, I could switch to the Diagram view of the data model to see these relationships represented by solid lines connecting these five tables as shown below.

However, suppose I have additional information in my FactSales table. The DateKey field that I have used in the past identifies the sale date for each of the sales records. In the real world, there may be additional dates associated with each sale. For example, it is not hard to imagine that each sale would also have a Delivery Date and an Invoice Due Date. Now I might ask, does it make sense to report on sales based on the sale date or does it make more sense to report on sales based on the invoice due date or even the date of payment. Well, you might say, ‘That depends on who is asking the question.’ Very true. My Marketing/Sales manager might want to see sales by the sales date. He or she does not care about when the invoice is paid off. They just need to know if they hit their sales quotas for each month. Therefore, they might need a report that uses the default relationship and looks something like the following:

On the other hand, the Chief Financial Officer is not as concerned about when the sale took place, but when the invoice for the sale is due because only then is the income truly realized for the company. Thus the above report does not meet their needs.

Now suppose that I had a few additional columns, as mentioned earlier, that told me some other dates such as when the items were delivered and/or when the invoice for the sale was due. Let’s further suppose that these columns exist in the FactSales table and might appear like the following figure.

With these columns in my FactSales table, I would want to create relations between them and my date table (DimDate) so that I could generate reports using them. The following figure shows the Create Relationship screen in which I define a new relationship between the delivery date (DeliveryDate) column in FactSales and the date (DateKey) column in DimDate. When I click the Create button, the Power Pivot engine creates a second relationship between these two tables. (Remember the first relationship was between the sales date (Datekey) column in FactSales and the (Datekey) column in DimDate.

In a similar fashion, I create a third relation between these two tables to connect the invoice due date (InvDue) column in FactSales and the date (Datekey) column in DimDate. For the purposes of this demo, I am going to stop there. However, I could create additional relationships between any other date fields in the FactSales table and the date (DateKey) column in DimDate. Switching to the Diagram view, I could now see something like the following between these two tables.

Notice that there are three lines between FactSales and DimDate. One of these lines is solid and the other two lines are dashed. You can only have one active relationship between any two tables. That active relationship is represented by the solid line. The other two relations are inactive at the moment. These three relationships are collectively called role playing relations and DimDate is a role playing dimension because only one relation at a time can be active from the DimDate dimension. Think of it this way, the date column (DateKey) in DimDate can play one of three different roles. It can either play the part of the sales date, the delivery date, or the Invoice date. However, it can only play one role at a time. Unless all three dates were exactly the same, I have to choose which role I want DateKey in DimDate to play with the FactSales table by selecting one of the relationships.

The initial data import associate the DateKey column in DimDate to the DateKey value in FactSales which represented the date the item was sold. As long as the report requested wants to allocate sales to the sales date, I need to do nothing other than generate the Pivot Table report as shown below.

However, if I now have to generate a report for my CFO, I would have to go into the model and change the role that DateKey in DimDate plays. I might try to simply right click on the dashed line representing the connection between the DateKey in DimDate with the InvDue column in FactSales and select: Mark as Active.

This action would generate the following error message.

The problem is clearly stated. You can only have a single active relationship between the two referenced tables. Therefore, I must first right click on the current active relationship to deactivate it before activating a different relationship between the two tables.

Once the original relationship is deactivated, I can go back and activate the new relationship between DateKey and InvDue.

Knowing that only one role (relationship) can be active at a time, I am a little surprised that Microsoft did not automatically deactivate the original relationship when a new one is set active. However, like many things, they did not ask me. Plus they may have had other reasons for not automatically deactivating the current relationship. In any case, after making this change and returning to my pivot table (without making any changes to the pivot table itself), the data automatically updates using the new relationship to show the sum of sales by month of the invoice due date rather than the sales date.

If you examine the sales totals for either the years or individual months, you can quickly see that this role change for the DimDate dimension makes a significant change in the sales numbers reported each month.

That’s all for this week. I hope that now you have a better idea how you can use a single dimension to play different roles. (You could also have added the DimDate table three times to the model, once for each date column in FactSales so that each instance of the DimDate table could have a single relationship to FactSales. Then by changing which dimension you select to display in the Pivot Table, you can achieve essentially the same result. This may be easier for the end-user who does not have access to the data model to make the changes listed above. However, your power-users may prefer a simpler model with fewer tables and can change the role played by those dimensions as needed.)

C’ya next time.

Aug 4 2012

A PowerPivot Sort By Any Other Name

The past several weeks I have been looking at the new features of PowerPivot 2012. I also mentioned this past Tuesday, that the new Microsoft Office 2013 version of Excel includes both PowerPivot and Power View as part of the install. They just need to turned on. So this week I’m continuing my Contoso example with Excel 2013 and PowerPivot running on Windows 8. Therefore, you may notice some differences in the look of the Excel spreadsheet. Don’t fear. The techniques I discuss here work equally well if you are using Vista with Excel 2010 and PowerPivot 2012.

Today I am going to look at a problem that we solved a different way several months ago. That problem is how to sort month labels chronologically rather than alphabetically. Back on November 6, 2011 (Wow, was it really that long ago?) I described a way to create a custom sort order for months. While that method still works in PowerPivot 2012, there is a new way to sort a column by using another column. Let’s see how that works.

I begin by opening my Contoso 2012 sample data and create a simple pivot table that shows total sales by month. As shown in the figure below, when I drag the calendar month label field into the rows, the rows appear by default sorted alphabetically. Unfortunately, that is not a good sort order for months.

To fix this sort order, return to the PowerPivot window and open the tab (table) containing the dates. In Contoso, this is the dimDate table. In the Home ribbon find the Sort and Filter option group. A new button in this group is labeled Sort by Column. You could immediately click on this button to open the sort dialog. However, I recommend first clicking on the column for which you want to define the new column sort, then click the Sort by Column button.

This action opens the following dialog and pre-populates the Sort column with the column that you selected. This saves you a step. However, if you forget, you can always use the dropdown arrow to display all fields in the current table and select the column from the list.

Next select the column you want to sort by and click the OK button. You may be tempted to sort by the existing column CalendarMonth. However, this will generate an error because there are multiple values for CalendarMonth for each value of CalendarMonthLabel because CalendarMonth has the format YYYYMM. In other words, July would match not just to 07 or 7, but rather to 200507, 200607, 200707, and 200807. You cannot match the sort column to multiple values in the sort by column.

Therefore, I created a simple calculated column named Month which only consists of the month number as shown in the next figure.

Now I can return to the Sort By Column dialog and create a sort relationship between CalendarMonthLabel and Month as shown in the following dialog.

When I click OK and return to my pivot table in the Excel spreadsheet, I will see that the table automatically refreshes and now displays the month labels in the expected chronological order that makes sense for most users.

That’s it for this week. A short, simple, but very useful way to sort any column by another column in PowerPivot 2012 or Excel 2013.

C’ya next time.

SharePointMike's Blog

Your one stop place to learn more about ADA, SQL, Excel, Power Pivot, and Power BI too, especially for Data Analysis

Tag Archives: Excel 2013

A New View in Excel

Can You FILTER() That Down For Me

I Want It ALL()

Getting the Right Context – Part 2

Populating Your Data Model with A Query

Dealing with Blanks in Your Data Model

Your Logical Data Model is not Normal

It’s Only the Role I’m Playing

A PowerPivot Sort By Any Other Name