Calculating Percentages in PowerBI

Last time I showed how you can pivot a multivalued column from a survey downloaded from SharePoint to Excel. After some transformations, I created the following report that shows the number of career responses in each of the 18 categories. While interesting in of itself, it would be more interesting to see how the career choices break down by different dimensions such as the school or grade level, the gender of the student, or perhaps even by individual school.

In my report page, I am going to reserve an area along the left side of the page to list several of the dimensions I’m interest in and to display the total number of students that took the survey along with the number of students in the filtered subset when I filter by one of the dimensions. Basically that means building something like the following:

This image actually consist of four separate visualizations. The top visualization for Total Students uses the Card visualization and initially displays the count of the ID column in the survey table that has one record for each survey. I duplicate this visualization and drag it below the first one. Then for the top visualization, I open the Format option panel in the visualization section and make the following changes:

  • I turn off the Category Label
  • I turn on the Title and then open the Title section to add the Title Text: Total Students
  • I change the font size and color to make this information stand out.

I then repeat the process for the second visualization but with the Title Text: Sampled Students because this count will represent the number of students included in the visualizations taking into account any filters applied to the page.

I then add two single column tables, one for gender and one for grade level. Because I want to use the values in these table as page filters, I change the visualization of these two tables to the Slicer.

Along with the table I created in my last blog, I can work with the slicer values to explore how the career choices change based on gender and grade level by clicking on the check boxes. When I do this, I see that the chart automatically adjusts the columns based on the changes I make to the slicers. Also the number of sampled students in the second card visualization displays the count of students included after applying the slicer. Unfortunately, the total student count also changes. This I do not want. I want the total number of students to always represent all the students in the entire survey.

I can achieve this goal with a little DAX and a custom measure back in the data page for the survey table. The custom measure needs to count the IDs for all the records in the table ignoring any filters applied to the report page. I can do this by passing the survey table name to the ALL() function. This function ignores all other page filters. Then I use the COUNTAX() function which defines the data source as the output from the ALL() function and then performs a count the number of IDs. While this may sound complicated, it is as simple as the following equation:

Notice that I name the measure Students. I must provide a unique measure name for each measure I create. However, I can then use that measure in any visualization such as the card visualization for the total students in the survey.

Back on my report page, I select the top card visualization (for Total Students) and change the field used from ID to my new measure, Students.

Now if I select any of the values in the slicer visualizations, my sampled students card displays the number students included in the filter while the total students card displays the total number of student surveys taken as shown below.

I then add on the right some additional column visualizations to display other data fields such as which subject the student finds most interesting in school or charts that display career choices by gender, by grade level or by other criteria from the survey. Each of these charts begins with a simple table visualization in which I add the columns I want to use. I then convert the visualization to a column chart.

In the image below, you see the final result of the first page of my report. Notice that I also added a vertical line shape to separate the two card visualizations and the slicers from the other column charts.

Since each student was allowed to select one or more careers from the list of possible careers, the total number of career choices is significantly larger than the number of students. In the above figure, the count of career interests, if I were to add the values in each of the columns, would total the number of career selections which is over 17,000, not the number of students. Therefore, I might decide to display the same charts as a percent of all the students rather than a count of all career selections.

Again I need another custom measure to calculate the percent based on total students. Fortunately, I already have a measure that calculates the total student count ignoring any slicer selection or page filter. Therefore, I can generate a percentage using a formula similar to the following:

With this formula, I count the number of surveys filtered by the visualizations and slicers on the page divided by the total number of students who took the survey. The maximum value of this percentage would be 100% if all the students who took the survey selected the same career, such as computers, as one of their choices. Similarly, the minimum value would be 0% if no student selected a specific career as one of their choices. Because students could select more than one career of interest, the sum of the percentages of each of the columns will not add up to 100%, but some value greater than 100%. (Lesson learned: Next time ask for their preferred career choice, then their second career choice and finally their third career choice.)

Next I take the first page of my report and duplicate it by right clicking on the page tab and selecting the option to duplicate the page. (This is a lot faster than recreating the same visualizations on a new page, isn’t it?) On the duplicated page, I modify each of the column charts to display the Percent measure just created rather than the count of ID.

Why does this work? Well, because each student can only select any specific career one time even though they may select two, three, or more careers, I can simply count the number of filtered students in each column by the total number of students in the survey. Each chart already divides the students by career choice, one career for each column. Then additional filters from the slicers may limit the gender or grade level or both. Therefore, I can count students that match all those filter criteria and divide by the total student count to get a percentage of students who have an interest in that career. In fact, this measure also correctly calculates the percent of students interested in each school subject (of which they can only select one subject each) which I can verify by summing these percentages from the chart in the upper left and getting a total of 100%.

The following figure shows the percent page of my report.

I can then proceed to add other visualizations if I want on additional pages. But I’ll suspend this example at this point for now.

C’ya next time for more exciting ways to use Power BI.

Getting the Right Context – Part 2

Last time, I introduced the concept of context within DAX expressions using by PowerPivot to calculate columns and measures. We saw that the default context for column calculations was row context while the default context for measures was filtered context. However, I ended the discussion by showing that within a measure calculation, I could use column context with certain aggregate functions like SUMX which can be used to apply an expression across all the rows of a table. However, before that expression is evaluated, that table is automatically filtered since I used it in a measure which begins by applying the filtered context of the pivot table to the rows used by the expression.

So this time, let’s dive a little deeper. Let me begin by going to the FactSales table of my Contoso data model and calculate a measure for total sales. This is not difficult and can be achieved by using the SUM() function as shown in the following figure.

Now let’s assume that I want a measure that shows only the sales made through the store channel. At first, I might try to use the SUMX() function along with the FILTER() function. The FILTER() function also has two parameters. The first parameter is the name of the table that I want to apply a filter on. The second parameter is a filter expression. I may at first assume that because I created a relationship between the FactSales table and the DimChannel table, that I can simply reference the column [ChannelName] and compare it to the string “Store” to filter the FactSales table. However, as you can see in the following figure, this expression would result in an error.

The reason for the error is that while the FILTER() function references the table FactSales, there is no context to link records in FactSales to DimChannel. I know you might ask, “Doesn’t the relation between these two tables define that context?” The answer is that the relationship between tables while defining the ‘mechanism’ of how to connect the two tables, it does not activate a context between the rows in FactSales with a row in DimChannel. When pointing from the many side of a relationship to the one side of the relationship, we must use the RELATED() function to activate the context within the expression. I show this in the following figure.

You can see that now I have a total sales for the just the stores as a measure. If I use this measure in a pivot table that displays sales by month and by product category, I will have the additional filter of sales by store in each of the pivot table cells.

What if, however, I wanted to create a calculated column in the table DimChannel that displayed the total sales for that channel? Again you might start with the SUMX() function because you want to calculate an expression from another table. In this case, the first parameter of the SUMX() function would be the FactSales table and the column that we want to sum would be the [SalesAmount] column. However, if we were to create this column, we might be surprised by the result, shown in the following figure.

All of the values in the column are exactly the same. Furthermore, if I refer to the image earlier in this blog for the total sales across all channels, I would see that the value displayed here in each cell of the column is actually the total sales. Again the problem is context. There is no context to refer back to FactSales from DimChannel. Therefore, when SUMX() evaluates the [SalesAmount] in table FactSales, it pulls values from all the sales records, not just the sales from the channel represented by the current row in DimChannel.

In this case, because I am going from the one side of the DimChannel relationship to the many side of FactSales, I need to return a table that contains just the rows from FactSales that represent sales from the channel in the context of the current row. I can do this by using the RELATEDTABLE() function which uses a single parameter, the name of the table on the many side of the relationship. I must also have the relationship explicitly defined between DimChannel and FactSales. I have already done this. So Power Pivot can use the relationship to create a subset of rows from FactSales for the current channel. I can then use this resulting table in the SUMX() function to sum the [SalesAmount] column as shown in the following figure.

As you can see in the figure, the first row which represents the store channel displays the same sales total as we calculated from the measure in FactSales earlier.

So again, you can see that there are many different ways to define the context of an expression. When dealing with multiple tables, it is important to understand whether you can perform a row context calculation by using the RELATED() function to extend the row context to the related table on the lookup side of the relationship or whether you need to use the RELATEDTABLE() function to filter the rows used in an aggregate function like SUMX() to calculate values for a column which uses the row context to define the link to the many side table.

Next time, I’ll look at some functions that let you turn off a filtered context and show where you might use it.

C’ya next time.

Can You Put That Into Context For Me?

Over the past several years, I’ve conducted many sessions on how to use DAX in PowerPivot to create columns and measures. Throughout this time, the one thing that has confused many people with whom I have talked with is the concept of context for an expression. Many do not understand why some equations appear to work across for defining new columns while other equations only appear to work when calculating measures within a pivot table. This week, and perhaps for another week or two, I am going to take a shot at explaining the concept of context. Until you get a good understanding of context, many of the DAX expressions for columns and measures may appear to be magic that somehow get the correct answer but you may just not understand why.

Let me start by defining the two major context types that DAX expressions have to deal with:

  1. Row Context – This is probably the easiest context to understand. When you define a formula to calculate the value for a new column, the formula is applied to each row individually. In fact, the value calculated can depend only on the value of other columns in the current row. This is much like the way you calculate a new column value in Excel. However, unlike Excel, it is more restrictive. Using Row Context, it is not possible to directly access the values in a prior or successive row because those terms do not really have meaning within a table in the data model of a pivot table.
  2. Filtered Context – This is probably easiest to understand when you think of the entire set of rows in a table. If you sum a numeric column or simply count any column in the table, you are applying a filtered context. Of course, in its simplest form, the filter is really the same as all of the rows in the table, or no filter at all. But let’s assume for a moment that you have a sales table that includes sales from each of your stores in each of the states of United States. Furthermore, you only want to know the sales for stores in the state of Florida. In database terms, you apply a filter using a WHERE clause to include only those rows from the sales table in which the sales came from a store in Florida. Of course you can have more than one filter. You might filter not only on the state, but also on a specific product that was sold during a specific month of a specific year. That would require four filters to be applied to the table to reduce the total number of rows of sales to only those you wanted to sum. Filtered context is often used in expressions that involve summing, counting, averaging, finding a minimum or finding a maximum. These are also referred to as aggregate functions.

So let’s see how this works in DAX. Using my Contoso sales table, suppose I want to calculate the profit for each sale. Since I want to perform the calculation per sale and because my sales table has one row for each sale, I can use a calculated column. The formula for my calculated column begins by taking the value of the column [SalesAmount] which is the money received from the sale. This is not the profit. To calculate profit, I need to subtract costs from [SalesAmount]. To do this, I can subtract the value from the column [TotalCost]. This gives me a pretty good idea of what my potential profit might be. However, in the Contoso data, I also have to consider any returned item. These are items returned because they are damaged or do not work. The policy is to refund the customer their money. However, the returned product it not worth anything and is considered trash. Therefore, I need to also subtract from [SalesAmount] the [ReturnAmount]. If I define a new column in my sales table, I can use the formula shown in the following figure to calculate the profit for each of the sales.

As you know, the PowerPivot data model applies the column formula to each record in the table. I can then use this column in pivot tables to display profit for sales by any of the appropriate dimensions such as time, location, product, etc.

Fortunately, in PowerPivot for Excel, displaying profit in the pivot table is as simple as moving the column [Profit] to the Values area in the field list. PowerPivot lets us use any numeric field in the values area without any additional work. However, if I attempt to create the same model in Analysis Services, I must create a measure as shown below.

The SUM() function is an aggregate function and therefore defines a measure, never a column. Note however, I must still first create the column [Profit] as previously defined. I cannot simply create a measure that sums the result of the expression as shown in the following figure.

This is because the calculation: [SalesAmount] – [TotalCost] – [ReturnAmount] is a row context expression since it must be calculated on every row of the table. However the SUM() function is a filter context expression (even if the filter is all the rows). I cannot combine the two in this expression without getting the Semantic Error shown above. The SUM() function can only aggregate the values of a single column, not aggregate the values of a calculation.

What I can do however, is use one of the other DAX functions, SUMX(). This function has two parameters, the first parameter is the name of the table I want to perform a calculation on. In this case, the table name is FactSales. The second parameter is an expression that can include one or more columns from the designated table. In this case, I can use the expression I previously used to calculate the [Profit] column as shown in the following figure.

As you can see in this figure, the resulting total profit across all the records in the FactSales table is exactly the same whether I use the expression SUM([Profit]) which sums the calculated column or if I use the expressions SUMX(FactSales, [SalesAmount] – [TotalCost] – [ReturnAmount]). This second expression essentially calculates the row context expression for profit on each row in the table FactSales. However, as a measure rather a calculated column, the expression is filtered by any dimensions used in the pivot table.

In the following figure, the only dimension is sales channel. Therefore, with four sales channels, there are four rows. Note that both calculations of total profit result in the same values for each channel.

What is the difference between the two? Since all of the sales records were used to build this table since all four sales channels were included, the same number of calculations were effectively performed in both cases, calculating the profit for each sales row and then summing the profit for each channel. The only real difference is that the first one required the prior creation of a [Profit] column which takes some of the precious data model memory while the second method using the SUMX() function does not require that column.

But what if we applied other dimensions any perhaps by using a slicer displayed the profit from only some of the sales. In the following example, the row dimension displays countries. However, only countries in Asia appear based on a slicer that limits which continent I want to see profits. I also included a slicer on year to limit the years in my analysis to only two years. Therefore, in the following figure, the profit amounts represent only a subset of the total profits for the Contoso organization.

How does this affect the calculation? To calculate Total_Profit, I still need the column [Profit] to calculate value for every sales row in the FactSales table, even rows that I will not use. On the other hand, to calculate Total_Profit_2, the filter on continent and year is applied first limited the total number of rows that are needed to be calculated from the FactSales table. The calculation of profit is then only performed on this filtered subset of rows and then aggregated appropriately based on the dimensions shown.

Thus, you might say for now that column calculations are generally row context and measures are filter context expressions. However some functions when used in measures let you perform a row context expression after applying a filter context like the SUMX() function and other aggregate functions that end in ‘X’ like COUNTX(), AVERAGEX(), MAXX(), and MINX().

Next time, I will dive a little further into understanding context within DAX.

C’ya next time.

It’s a Fiscal Thing

In past blog articles, I’ve talked about the need for a date table, not just for BI/PowerPivot analysis, but for any application. You might be hard pressed to come up with many applications you have written or have used in your organizations that do not include some aspect of time as one of the dimensions by which you collect data and/or report on it. Therefore, building a Date table once that you can use for many different applications can save you a great deal of time when trying to determine how to group your data for different reporting period or to display different date labels along one of your dimensions.

One of the easiest ways to build a date table that you can use anywhere is to begin with Excel. It is a simple matter to create a column called DateKey and enter the first two dates of a large general purpose date range as shown in the following figure.

Then to create any size table, simply select the first two cells and drag the bottom right corner down to any number of rows you want. Given Excel’s limit of a million rows, this should cover you for most applications.

Once you have the first column, you can add columns for other ways of displaying date information. The following table shows a few of the more common expressions I have used to create additional columns.

<insert table here>

If you use a standard calendar, January 1 through December 31, the functions provided work well to calculate the names of months, days of the week, quarters, years, etc. However, if your organization uses a fiscal calendar, some of these calculations are not as easy. For example, suppose your organization uses a fiscal year that begins October 1 and goes through September 30th. The calculation for month number is not as simple as using the MONTH() function. Nor is the calculation of year simple either because the fiscal year for 2015 may in that case go from October 1, 2014 through September 30, 2015. So I thought you might like to see a few functions that could help you calculate fiscal columns in your data table.

Let’s start with the calculation of the fiscal year. Using the example date range above, how can I calculated the fiscal year from any given date. If I were to add 3 months to any date and then calculate the year using the YEAR() function, I would get the fiscal year. Why three months? Because there are three months in the prior calendar year. If my fiscal year started July 1, 2014 and went to June 30, 2015, I would add 6 months to the date before calculating the fiscal year using the YEAR() function. The following figure shows the column expression I need to create a CompanyFiscalYear column.

Calculating the year was not that difficult. However, calculating the month number of the fiscal year is a bit more complex. Why do I need a fiscal month number? Remember how I used the month number column as the sort by column for the month names so that the months appear in the correct order. I again have a similar problem when I am dealing with fiscal calendars. In the above example fiscal year beginning October 1, 2014, the order of the month names must be:

  • October
  • November
  • December
  • January
  • February
  • March
  • April
  • May
  • June
  • July
  • August
  • September

I can calculate the fiscal month number using the expression shown in the following figure. Note that it is a bit more complex than the expression to calculate the fiscal year. Where I’m using the number ‘9’, I’m really using ’12 – 3′ in which ’12’ is the number of months in the year and ‘3’ is the number of fiscal months in the first calendar year which in this case is October, November, and December. Therefore, if my fiscal calendar started in July, I would use ’12-6′ or ‘6’. A similar argument would explain why I need a ‘3’ at the end of argument for months in the second calendar year to increase the fiscal month number of these months appropriately. Again I would use a ‘6’ for a fiscal year starting in July because there are 6 months in the first calendar year that are part of the current fiscal year.

The final column I’m going to show today is the calculation of the fiscal quarter. In my October-September scenario, the first quarter of the fiscal year includes the months October, November, and December. Since I already have calculated fiscal month numbers, I can simplify my calculation of the quarter by simply dividing the fiscal month number by ‘3’ and rounding the result up to the next whole number if necessary using the ROUNDUP() function. Thus the following figure shows an easy calculation of the fiscal quarter.

That’s all for this time. I hope this helps you create useful date tables with not just traditional calendar dates and calculated fields, but also fiscal calendar fields that you can use in your pivot tables and reports.

C’ya next time.

Your Logical Data Model is not Normal

If you have been reading this blog over the last several years, you know that I have been a strong supporter of using PowerPivot and the Tabular model to perform data analysis, even for power users, not just DBAs. What you may not have realized is that I’ve been secretly teaching you a little about data modeling. What is data modeling? It is really nothing more than the collection of all the data in tables and the relationships between those tables in a database. Did you know that there are at least two major ‘types’ of data models when it comes to how your structure your data into tables? Each one serves a different purpose. Therefore, it can reasonably be argued that neither one is more correct than the other. But you need to understand when to use each type. That is what I want to discuss today.

Most DBAs and developers who work with databases like SQL Server, Oracle, and many common database are comfortable with the relational model for creating tables and defining the relationships that connect them. In fact, they immediately start to normalize a denomalized database in their heads within seconds of seeing the data schema. The relational model relies on the application of the rules of data normalization introduced by Edgar F. Codd in 1970. In essence, the goal of the rules of data normalization is to minimize the data redundancy which also has the effect of decreasing the overall size of the database while at the same time making it easier to maintain information that would otherwise be repeated through many records.

There are three primary rules that are used to determine whether a database, a collection of tables, has been normalized. These rules are:

First Normal Form: No two rows of data may contain repeating data or groups of data. Such repeating data must be split into a separate but connected table. For example, a sales order may contain the purchase of one or more items. Because the number of items included in a sales order is not predefined, it must be split into a separate table with one row for each item on the sales order. These two tables are then typically connected by an order id.

Second Normal Form: This rule only applies to tables which have a compound primary index, an index built from two or more fields. In this rule, all other fields in the table must depend on the entire compound index value, not only a portion of it. A possible example of this might be a table that includes students at a school in which the primary index combined the school name (or id) along with the student name (or id). Imagine that the table also included the address and phone number of the school. This information is not depended on the combination of the school and the student. It only depends on the school. Therefore, this violation of the second normal form requires that the data related only to the school be split into a second table that includes only school information.

Third Normal Form: This rule requires that every field not part of the primary index be depended on the primary index. Going back to my favorite Contoso database, one could argue that in the FactSales table, the Sales Amount field is redundant since this table also includes the sales quantity, sales price, and any sales discounts or returns. Why is it redundant? Because it can be calculated based on the other values of other columns in the record. Therefore, to fully achieve, third normal form, this field should be removed.

While there are other special case rules that can be applied to normalization of a database, most DBAs will be satisfied with a database that satisfies these three rules. They will then build the tables in their database corresponding to these rules and thus create the physical data model. It is called the physical data model because it defines the physical schema of the tables and the relationships between them.

However, business users of the data don’t look at the database that way. In fact most business users would be baffled by the large number of tables required and the relationships between them. In fact, they will not understand why they need to combine data from a half dozen to a dozen tables just to answer a single query question. As an example, let’s look at how a business user might think of just the products within the Contoso database.

The figure below shows the normalized data model for products and their subcategories and categories. As you can see from the schema, the only important piece of information in the subcategories table is the name of the subcategory. Similarly, the only important piece of information in the categories table is the name of the category.

Most business users would not think of category and subcategory names as part of separate tables, but as attributes of the product itself. In fact they would think of product information more like the following figure.

Imagine a database model with dozens of normalized tables and then try to envision how the typical business user sees the data with perhaps only a half dozen tables after denormalization.

In PowerPivot, we can address this issue in one of two ways. The first way would be to add the category and subcategory names to the product table in SQL Server before loading the data into PowerPivot. This would essentially make the physical data model the same as the logical data model used by the end-users. However, this data model would no longer be normalized.

Physically denomalizing the data tables is not the only solution. I could, as shown before in https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/, build two calculated fields in the product table that use the RELATED() function to get the category and subcategory names and display them in the product table. I could then hide the two tables, dimCategory and dimSubcategory, from the user so that they would see a structure similar to their expected denormalized logical data model even though the data still is physically stored in a normalized structure.

The advantage of making the changes to the data model outside of PowerPivot is that it effectively reduces the amount of data that PowerPivot must store in memory thus potentially allowing a larger model to be created. The advantage of making the changes inside PowerPivot by using the RELATED() function and then hiding the technical tables that are still needed in the model but do not need to be seen by the client is that it preserves the sanity of those DBAs who cringe every time someone tries to their denormalize their data. Both methods will allow me to create similar Pivot tables in my Excel spreadsheet. Thus both methods can be considered correct.

The thing to remember is this. Data Normalization was developed to make data storage and maintenance easier and more efficient. However, data analysis and reporting often requires a fair amount of data denormalization. This is especially true when performing data analysis using reports, pivot tables and charts, and even third party data analysis tools. In fact, the logical view of the data schema aids in the performance of most data analysis. Just remember that your logical model is not a normalized model.

C’ya next time.

It’s Only the Role I’m Playing

This week I’m going to return to my favorite sample database, Contoso. If you remember, Contoso consists of sales data for a company that produces several different lines of customer electronics, from computers to phones to TVs. The data spans several years of sales. In past examples, I related the FactSales table (the table containing all of the sales records) to several tables including Channel (DimChannel), Date (DimDate), Product (DimProduct) and Product Sub-Category (DimProductSubCategory). In fact, the data source pre-defined these relations so that when I imported the data into my PowerPivot model in Excel, these relationships appeared by default as shown in the following figure.

Visually, I could switch to the Diagram view of the data model to see these relationships represented by solid lines connecting these five tables as shown below.

However, suppose I have additional information in my FactSales table. The DateKey field that I have used in the past identifies the sale date for each of the sales records. In the real world, there may be additional dates associated with each sale. For example, it is not hard to imagine that each sale would also have a Delivery Date and an Invoice Due Date. Now I might ask, does it make sense to report on sales based on the sale date or does it make more sense to report on sales based on the invoice due date or even the date of payment. Well, you might say, ‘That depends on who is asking the question.’ Very true. My Marketing/Sales manager might want to see sales by the sales date. He or she does not care about when the invoice is paid off. They just need to know if they hit their sales quotas for each month. Therefore, they might need a report that uses the default relationship and looks something like the following:

On the other hand, the Chief Financial Officer is not as concerned about when the sale took place, but when the invoice for the sale is due because only then is the income truly realized for the company. Thus the above report does not meet their needs.

Now suppose that I had a few additional columns, as mentioned earlier, that told me some other dates such as when the items were delivered and/or when the invoice for the sale was due. Let’s further suppose that these columns exist in the FactSales table and might appear like the following figure.

With these columns in my FactSales table, I would want to create relations between them and my date table (DimDate) so that I could generate reports using them. The following figure shows the Create Relationship screen in which I define a new relationship between the delivery date (DeliveryDate) column in FactSales and the date (DateKey) column in DimDate. When I click the Create button, the Power Pivot engine creates a second relationship between these two tables. (Remember the first relationship was between the sales date (Datekey) column in FactSales and the (Datekey) column in DimDate.

In a similar fashion, I create a third relation between these two tables to connect the invoice due date (InvDue) column in FactSales and the date (Datekey) column in DimDate. For the purposes of this demo, I am going to stop there. However, I could create additional relationships between any other date fields in the FactSales table and the date (DateKey) column in DimDate. Switching to the Diagram view, I could now see something like the following between these two tables.

Notice that there are three lines between FactSales and DimDate. One of these lines is solid and the other two lines are dashed. You can only have one active relationship between any two tables. That active relationship is represented by the solid line. The other two relations are inactive at the moment. These three relationships are collectively called role playing relations and DimDate is a role playing dimension because only one relation at a time can be active from the DimDate dimension. Think of it this way, the date column (DateKey) in DimDate can play one of three different roles. It can either play the part of the sales date, the delivery date, or the Invoice date. However, it can only play one role at a time. Unless all three dates were exactly the same, I have to choose which role I want DateKey in DimDate to play with the FactSales table by selecting one of the relationships.

The initial data import associate the DateKey column in DimDate to the DateKey value in FactSales which represented the date the item was sold. As long as the report requested wants to allocate sales to the sales date, I need to do nothing other than generate the Pivot Table report as shown below.

However, if I now have to generate a report for my CFO, I would have to go into the model and change the role that DateKey in DimDate plays. I might try to simply right click on the dashed line representing the connection between the DateKey in DimDate with the InvDue column in FactSales and select: Mark as Active.

This action would generate the following error message.

The problem is clearly stated. You can only have a single active relationship between the two referenced tables. Therefore, I must first right click on the current active relationship to deactivate it before activating a different relationship between the two tables.

Once the original relationship is deactivated, I can go back and activate the new relationship between DateKey and InvDue.

Knowing that only one role (relationship) can be active at a time, I am a little surprised that Microsoft did not automatically deactivate the original relationship when a new one is set active. However, like many things, they did not ask me. Plus they may have had other reasons for not automatically deactivating the current relationship. In any case, after making this change and returning to my pivot table (without making any changes to the pivot table itself), the data automatically updates using the new relationship to show the sum of sales by month of the invoice due date rather than the sales date.

If you examine the sales totals for either the years or individual months, you can quickly see that this role change for the DimDate dimension makes a significant change in the sales numbers reported each month.

That’s all for this week. I hope that now you have a better idea how you can use a single dimension to play different roles. (You could also have added the DimDate table three times to the model, once for each date column in FactSales so that each instance of the DimDate table could have a single relationship to FactSales. Then by changing which dimension you select to display in the Pivot Table, you can achieve essentially the same result. This may be easier for the end-user who does not have access to the data model to make the changes listed above. However, your power-users may prefer a simpler model with fewer tables and can change the role played by those dimensions as needed.)

C’ya next time.

Show Report Filter Pages

One of the interesting things to see when working with Pivot tables is to look at the differences between regular pivot tables that Excel has been able to create since about 1997 with Excel 97 (actually Excel 5 in1993 had basic pivot table functionality, but no wizard yet) and Power Pivot tables which has been a relatively recent addition (Excel 2010 with an add-in). Standard pivot table functionality still exists in Excel 2013 and serves as an alternatively tool for simple pivot tables that only require a single data source and less than a million rows of data.

In regular pivot tables, you might want to create a pivot table and use one of the dimensions as a filter. In the following example, I use a version of the FactSales table ripped from Contoso and I load it directly into Excel as one of the worksheets. Then I click on the Pivot Table command from the Insert ribbon to create a basic pivot table.

I added the Sales Amount field to my Values area. Being a numeric value, Sales Amount makes a good choice as a measure for a pivot table. I then added the dimensions for store and product as my horizontal and vertical dimensions generating a reasonable pivot table. However, I also added channel as the filter.

By default, after adding a filter, the pivot table still displays all values for the filter in the table. However, by using the dropdown, I can view the pivot table with one or more of the filter values at a time.

With only four possible values, it would not seem to be a big deal to view each of the channels one at a time. However, Excel provides another way to display the pivot table showing the pivot table on a series of pages with each page representing a different filter value. To do this, I can go to the PivotTable Tools menu group and select the Analyze ribbon. Then select Show Report Filter Pages… from the Options dropdown menu as shown in the image below.

When I select this option, Excel displays a dialog that let me pick the filter I want to expand. In this case, I only have a single filter on the Channel field so I select that filter and click the OK button.

Excel then generates a separate worksheet for each of the filter values and labels the worksheet tab with the filter value so you can easily click on a table to view the pivot table with the selected filter value applied.

This option makes it easier to go back and forth between views of the pivot table with different filter values applied. If you give a copy of the workbook to someone who might not be as familiar with pivot tables as you are, you can expand out the pivot table by the filter values and then lock the workbook so they cannot accidentally make changes to it. It also makes it easier to simply print a series of the pages to your printer if hardcopy is necessary.

So using the same data, I attempted to add the original data to the Power Pivot Data Model by using the Add to Data Model option in the PowerPivot ribbon

I then defined a Pivot Table from the data model using the same fields as before. When I then opened the Options dropdown menu from the Analyze ribbon of the PivotTable Tools group as I did before, I was surprised by the fact that the option to Show Report Filter Pages was greyed out as you can see below.

Apparently, this option is not available when displaying the data through the Power Pivot model, but is available from the basic Pivot table model. I’m sure there is a way to brute force some code to expand each filter value into a table in a new worksheet and rename that worksheet appropriately, but that kind of defeats the purpose of creating a “BI Tool for the Masses” which is the goal of most of my Pivot Table blogs.

C’ya next time.

The Role of Role Playing Dimensions

Perhaps you have heard the term Role Playing Dimension in regards to PowerPivot and/or Analysis Services cubes. This terms refers to the ability of one dimension to be linked to more than one fact in the fact table. Now strictly speaking, there is still only a single primary link between the dimension and the fact table and that link is used as the default link when aggregating data by that dimension. However, some dimensions, such as date dimension, often can be used against multiple fields in the fact table. For my example, I will use the Adventure Works DW data set that I use in many of my SQL Saturday presentations since it represents a typical sales database. Another good sample database is the Contoso dataset. Using the reseller sales table as my fact table (FactResellerSales), I can proceed to pull into my Excel PowerPivot table this fact table along with the dimensions for Product, Product Category, Product SubCategory, and Date. Notice that there is at least one relationship between each table and one other. This primary relationship is represented by a solid line in the figure below. However, there are three relationships between the FactResellerSales table and the DimDate table. One of the relationships is represented by a solid line and the other two are represented by dotted lines. Back in the source database, these three relationships are represented by regular indexes between these two tables. There is no indication to define which relationship is more important than the other. So how does PowerPivot select which one is the primary relationship between the two tables? The best I can figure out is that it is based on the order of the three different date keys in the fact table. In this case the OrderDateKey appears first and is therefore selected as the primary or Active relationship.

Looking at the raw sales data for a moment, I can see that the dates in the OrderDateKey, DueDateKey, and ShipDateKey are different with the order date occurring first followed by the due date and then the ship date. This would make sense in the real world. Unfortunately, if I did not have the ability to use role playing dimensions for each of these dates, I would either have to deal with only a single relationship between the date dimension table and one of these dates such as the order date or have multiple date dimension tables, one for each relationship. If I only used a single date dimension table, I would have to pick one date in my sales table to relate that dimension. Then any sum I calculated such as the sum of the order amounts, the sum of the ship amounts, or the sum of the due amounts would be associated with a single date such as the order date even though the due date may not occur until the next month or even next year. Unfortunately, this sample data does not show many such cases. However, I will show you one case at the end of the blog.

Our pivot table would then look something like this:

However, that would not be correct since we did not ship the items on the same day as the order nor was the amount due on the same day as the order. Therefore, we need another way to relate these two table when performing a summing calculation. That other way includes the use of the USERRELATIONSHIP() function which lets us define for the purposes of a single calculation which relationship we want to use. For example, to calculate the sum of the amounts ordered, we can sum the Sales Amount field by the OrderDateKey field as shown in the following equation:

When placed in our measure area of the FactResellerSales table, I can see that this calculation returns a value of over 80 million. (I will format as currency this in a moment.)

Similarly, the Shipped Amount can be calculated by summing the Sales Amount using the ShipDateKey as shown in this equation:

Finally the Due Amount can be calculated with the following equation using the DueDateKey field.

After formatting the measures as currency (which saves me time by not having to format this data in the resulting pivot tables separately), the measures appear as this:

Now I can proceed to build the pivot table by clicking on the PivotTable button in the Home ribbon and then selecting

PivotTable from the dropdown.

Excel prompts me to create the PivotTable in either a new worksheet or an existing worksheet. I will select a new worksheet here.

After clicking OK, Excel creates my empty PivotTable and opens the Field List to let me begin defining my PivotTable.

I choose a very simple layout with my calculated measures going across the top of the table and a hierarchy of time coming down the rows of the table.

By default when displaying the names of the months, Excel displays them alphabetically. Typically this will not please your manager. Therefore using a technique I discussed in an earlier blog, I define a column that contains the month number as the column by which I want to sort the column names.

Keep in mind that you only have to do this once. All subsequent PivotTables and PivotCharts will use the sort order assigned to the column EnglishMonthName without my having to do anything else.

Now the months appear in the correct order.

But more importantly for this demonstration, if I drill down into any of the months, I will see the sum of the order amounts, shipped amounts, and due amounts correctly summed by actual dates. Some positions will naturally be empty if there was no order, ship, or due activity for that day.

As promised, the following figure shows an example where orders were taken on two different days within the month but shipping and due dates only occurred once. The point being that the individual sums are correct for the days in which they appear.

Looking at the orders in April of the above figure, you may question the math saying that $2,204,542 plus $82 is $2,204,624, not $2,204,623. This is the result of rounding each individual amounts to whole dollars and is not an indication that Excel cannot perform math. All sums are calculated on the actual dollar amounts and then rounded.

So this is one simple example of how to use role playing dimensions. Dimensions other than dates can serve as role playing dimensions, but most people will encounter the need for role players when working with dates.

C’ya next time.

More Hidden Excel Gems

To follow up on last week’s discussion of hidden gems in Excel, this week I want to introduce to you something called Quick Analysis. According the Microsoft, Quick Analysis is a tool that lets you easily analyze your data using some of Excel’s built-in analysis tools such as charts, conditional formatting, pivot tables and sparklines. While the Quick Analysis shortcuts may not get you everything you need, they can get you started creating visualizations of your data, especially in regards to building charts.

So how do you use Quick Analysis? First you have to enter your data into a spreadsheet. Unlike last weeks tip which required that the data be converted into a ‘table’, this gem does not. In fact, after you enter the data into a spreadsheet, all you need to do is to highlight that portion of the data by dragging through the rows and columns that you want to analyze. When you have done this, an icon box for the Quick Analysis tools appears in the bottom right corner of the selected area as shown in the following figure.

This icon will appear when you release the mouse button after dragging through at least two cells. When you click on the icon, the following dialog box appears which shows five different analysis tools across the top of the dialog. Within each of these areas, you will see 1 or more analysis options in the lower portion of the dialog. The number of options will vary depending on the structure of your data and the area that you selected.

The figure below shows the options available to format the selected cells. This tool is essentially a shortcut to many of the more common features found in Conditional Formatting. For example, the first formatting tool displays data bars whose width defines a relative percentage of the total value of all the values in the selected column If you have more than one column selected, the relative width is based on the largest value in all of the columns. It is not calculated on a column by column basis. If you want to display relative widths based only on values within a column, you must select the data in each column separately and apply the Data Bars formatting tool to each column individually.

Similar rules apply to using the color scale formatting tool along with all of the others. You can apply more than one formatting to an individual range of cells. However, note that applying multiple formatting to a range of cells can lead to confusion. Therefore, the last icon (tool) in the formatting set is to clear all of the formatting from the selected cells.

Also keep in mind that even after you use one of the Quick Analysis tools, you can often go to the commands in the ribbon and provide further customizations. For example, suppose you select the Top 10% to highlight the top ten percent of the cells by value compared to the overall value. However, you really want to highlight the top 20%. These is no icon to do this in the Quick Analysis dialog. However, you can change the percent by following these steps:

From the Home ribbon, click on the Conditional Formatting button to open the dropdown menu. From this menu, select the last option in the list, Manage Rules.

This displays the Conditional Formatting Rules Manager as shown in the next image. Click on the rule that you want to modify keeping in mind that if you applied more than one formatting, each format option appears as a separate rule in this dialog. In this case, I want to click on the Top 10% rule to select it and then click Edit Rule to make changes to it.

In the Edit Formatting Rule dialog that appears next, look at the bottom half of the dialog. Note that you can change the rule to format either the top or the bottom group of values. You can also change the percent from the default 10% to 20% by simply replacing the value found in the dialog. You can even change the formatting colors if you do not like the default pink background with read characters.

After you have made your changes, click the OK button to apply them. You will be returned to the Conditional Formatting Rules Manager as shown below which now shows the title Top 20%.

The following figure shows what your data may look like with the top 20$ of the data conditionally formatted.

Another formatting tool is to display the selected data as a chart. Again, remember to select the data you want to appear in the chart first. It will show one or more possible charts it thinks are possible with the existing data. When you click on a chart type, a preview of the chart appears on the screen. The following chart shows a line graph created from US debt data (in 2005 dollars)

A slightly different grouping of the data allows more chart options as shown in the following image in which a clustered column chart was selected. Note that the preview appears automatically when you move your mouse over any of the chart options. The chart is not physically added to the spreadsheet until you click on the one of the chart options. Note also that many of the chart labels are either not defined or have been given generic names. After you add the chart to your spreadsheet, you can then select the various parts of the chart to edit what appears for example to add a chart title, x- or y- axis labels, a different legend, etc. by using the commands in the Design and Format ribbon of the Chart Tools that now becomes available at the top of the screen when the chart is selected.

You can even add sparklines to your spreadsheet. A sparkline is a miniature graph of the data and can either be a line, column or a win/loss chart. By default, Quick Analysis only creates sparklines for horizontal data series.

If you use the commands in Insert ribbon, you can create sparklines for a vertical series of data as well as shown in the image below.

Another interesting feature is the ability to create a pivot table from the data using Quick Analysis. I’m going to save a detailed discussion of this shortcut for a future blog. In the meantime, please try the different Quick Analysis tools the next time you work within an Excel Spreadsheet.

One last trick, when you select any range of cells, the average, sum and count of the values in that range will appear in the status bar at the bottom of the Excel window as shown in the following figure. So if all you need is a quick average or sum of values, there is no need to do anything other can just select the cells you want.

C’ya next time.



By sharepointmike Posted in Excel

Let Excel Parse For You

I’m a little short of time this week because I will be in Jacksonville by the time you read this presenting at the Jacksonville SQL Saturday, but I wanted to give you something to chew on this weekend. So I found this interesting feature of Excel which appears to have been available since Excel 2007, but is probably know by very few of you.

Have you ever had a list of data that you needed to parse? For example, suppose you have a list of email addresses as shown in the following figure from which you wanted to extract the person’s first name.

I suppose you could begin in the first row and enter ‘Mike’, then enter ‘Bill’ in the second row, ‘William’ in the third row, etc. to enter the first name of each person manually. You might also use a combination of Excel functions to find the period that separates the first and last names and then take all of the characters to the left of the period. To do that, you might use a formula like the following:

= LEFT(A2, SEARCH(“.”,A2)-1)

But what you may not know is that there is an easier way.

First, make sure that your data is a table with or without a header.  Then simply enter the first name of the person in the first row; ‘Mike’ and then begin to enter the first name of the person in the second row, “B”.

Notice that in ghost print (light grey), you can see that Excel has already figured out the first names of everyone in the list. Simply press the ENTER key to complete your entry (do not finish typing Bill’s name). Excel fills in the rest of the rows with the person’s first name.

Cool? What if we add another column for last name

This appears to work as well. Excel appears to be able to determine which delimiters surrounded the portion of the name we want to extract from the email address and is able to reproduce that across the rest of the rows.

Similarly, we can parse the rest of the email address to get the following columns.

That’s great, but I had a very specific format to the username. It was:

<<first name>>.<<last name>>

As you can see, the first name was separated from the last name by a period. What would happen if some of the names were separated by an underscore instead of a period? Of course I had to try this, but the initial results were not promising as you can see in the following figure.

I changed two of the addresses further down the list to use underscores instead of periods and when Excel tried to parse the first name, it picked up everything to the left of the first period which was found just before the top level domain portion of the address. My first thought was that this method may not suit my needs. So I tried a case in which the first two names used underscores between the first and last name as in:

<<First Name>>_<<Last Name>>

I was surprised that this worked for both names separated by periods and names separated by underscores (Jason) and periods (William, Sara, Fran, Annie, and Lisa).

I then tried the following case in which the first two names were separate first by an underscore and then by a period.

As you can see, the first name for Sam did not stop at the period, but continued on to an underscore I added between the domain server and the top level domain.  (Ok, don’t tell me in a comment that putting an underscore before the top level domain is not a valid format for an email address.  I know that, but was merely testing different patterns for parsing.)

Now I was confused. It appears that the pattern is based on the first row defining the separators which will have a preference for the separate used in that first row. For example, if I change the separator on this last example from an underscore to a period, Sam is parsed correctly, but Kyle is not.

This confirms to me that the separator in the first row defines a preference no matter where it is found in subsequent strings. However, if that separator is not found in the subsequent string, the first non alpha character will be used to parse the string as in the case of Jill above.

Of course you can use this technique to parse other column information. For example, I used it to parse address information as in the following figure.

However, I did have a problem pulling out the state name as you can see. I have not yet determined why yet other than the space character appears in many other places in the address and Excel may not be able to develop an appropriate formula to do this.  On the other hand, perhaps someone out there reading this can come up with an elegant solution and share it with the rest of us.

In the meantime, hope you were able to make it to Jacksonville. I’ll c’ya next week.

By sharepointmike Posted in Excel