PowerPivot Charts and Trendlines

Last time, I defined what a moving average is and why you might need one.  I then showed you how to calculate a moving average value in your Pivot table and then display it in a chart.  I close by asking the question whether there was a easier way to draw the moving average on your chart.  If all you need is to display the moving average on your chart and do not care what the values are, you can simply plot the [SalesAmount] field as a function of the day on the chart and then follow the simple steps that follow here.

If I right click inside the chart’s plot area (where the data is), I see a popup menu.  Near the bottom of this menu I next click the Add Trendline option which displays the following panel.

The Trendline options begin with six different ways you can ask Excel to fit a line through the data points including everything from a straight line through polynomial, exponential, logarithmic, and power functions.  However, the one I am interested in today is the last one, Moving Average.

Notice that after selecting this option, you have to define the number of period to include in the moving average where period is define by the units of the chart’s x-axis.  If we were to select 4, the resulting line would exactly duplicate the line I already displayed using my calculated 4-day average.  However, as I increase the number of days in the moving average, the fluctuations even out.  The following figure shows a 15-day moving average as a green line which now clearly shows the sales rising through the end of July and then dropping off in August.

So if I can create a Moving Average by simply using the Trendline options for a chart, why would I ever go through the process of calculating the moving average numbers manually?  The simple reason is that the Trendline option does not display the moving average values.  On the other hand, I can always display a table of my daily sales along with the moving average values as shown in this figure.

Well, I know this topic was a little short today, but I’m at SQL Saturday Orlando conducting a presentation on the new features in PowerPoint 2012.  If you have never been to a SQL Saturday event, you owe it to yourself to go to http://www.sqlsaturday.com and find out when the next one is in your area.  It is a great way to get a day of free training on SQL Server and related topics (like PowerPivot) and to network with your peers of similar interest.

C’ya next time.

Advertisements

Calculating a Moving Average in PowerPivot

Two weeks ago I promised to talk about how to generate a moving average in PowerPivot, but then last week I got sidetracked by telling you about a cool way to display YouTube videos on your SharePoint pages using a web part found on CodePlex that some of my work team members found.  It was so easy to implement, I just had to share it with you all.

However, returning back to the topic of calculating a moving average, the first question might be what is a moving average and then why would you want to use one.  A moving average is simply the sum of two or more time-dependent values in which the sum is then divided by the number of values used.  For example, if I was talking about stock prices, I may want to use something like a 7-day moving average to dampen the effect of individual day spikes or drops in the stock price that are not indicative of the overall stock trend.  (Some long term investors use even longer period moving averages.)  That does not mean that if a stock plummets or soars that I would sit back until the moving average tells me to act.  Any good stock investor will tell you there are many other factors both internal and external to a company that could force your hand to sell or buy any particular stock.  But the point is, and this is the answer to the second question, a moving average dampens randomness so I can more easily see the overall pattern of the numbers that I am tracking.

Ok, so suppose I work for Contoso and wanted to know whether sales are rising, falling or generally flat.  If I look at daily sales, the numbers are likely to fluctuate up and down in no particular pattern preventing me from spotting an overall trend.  The following figure shows Contoso daily Contoso sales over a 3 month period during the summer of 2008.  I chose to show the data as a chart to help show how sales fluctuate by day revealing information that I might not be able to see as easily had I created a table of the same values.

Of course, I could chart an entire year or more, but to see individual days, I would have to widen the chart substantially.  However, even with this smaller time period, I can see that the sales fluctuate quite nicely.  But I might ask are sales increasing, decreasing or staying the same.  If I have a good eye, I might say that sales peak toward the end of July and then fall back a little as the chart moves into August.  But that is not as obvious as the fact that there is a great deal of daily fluctuation.

So how can I visually display trends?  With Moving Average Sales.  Now for the purpose of this illustration, I’m going to create a four-day moving average, but honestly, there is no one right number of periods in a moving average.  In fact, I should experiment with different time periods to see which time period allows me to spot not only overall trends, but also in this case where I am displaying store sales, at seasonal changes.

I already know that if I display data by day, I can use the following formula to calculate the daily sales of just our ‘store’ channel. (Yes, I could simply use [SalesAmount] and apply a channel slicer to use only ‘store’ sales, but let’s stick with the example.)

StoreSales:=CALCULATE(SUM([SalesAmount]),DimChannel[ChannelName]=”Store”)

I can then use this calculated measure to calculate the previous day’s sales for any day by creating the following measure.

StoreSales1DayAgo:=CALCULATE([StoreSales], DATEADD(DimDate[DateKey], -1, day))

You might be able to guess that the formula for calculating sales for two days ago and three days ago respectively are:

StoreSales2DayAgo:=CALCULATE([StoreSales], DATEADD(DimDate[DateKey], -2, day))

StoreSales3DayAgo:=CALCULATE([StoreSales], DATEADD(DimDate[DateKey], -3, day))

With these four values calculated for each day, I can calculate the sum of these values and divide by 4 to get a 4 day moving average using the following calculated value:

FourDayAverage:=([StoreSales] + [StoreSales1DayAgo] + [StoreSales2DayAgo] + [StoreSales3DayAgo]) / 4.0

Now if I switch back to my chart page, I should see that Excel updates the field list to include the new calculated measures.  If I then add the field [FourDayAverage] to the Values box creating a second series in the chart, I now have both the actual daily sales and the four day moving average displayed in the same chart.  The only problem is, I would also want to change the chart format to display the daily sales (my first data series) as columns and my moving average (my second data series) as a line.  When I right click on the chart and select Change Chart Type, I can select Combo as the chart type as shown in the following figure.  In this case, the Clustered Column – Line chart is exactly what I want.  Because I added the moving average series to the Values area last, it becomes by default the Line and all other data series appear as clustered columns.  Since I only have one value for each day, the chart shows an individual column per day.

If I had entered my data series into the Values area in the wrong order, I could simply use this dialog to select the chart type for each series.  When I click OK in this dialog, my chart now looks like the following which more clearly shows the more of the overall trend and less daily fluctuation.

But wait, is there an easier way to do this? Why yes there is.  But to learn how to do that, you will have to wait until next week.

C’ya next time.

Statistics: PowerPivot Style, Part 3: The ‘X’ functions

This week I continue my look at some of the new statistical functions in PowerPivot 2012.  Last time I looked at the difference between regular statistical functions like AVERAGE, COUNT, MIN and MAX and the ‘A’ versions of the same functions.  I saw that the basic difference between these two sets was the way they treated non-numeric and empty values.  However, both versions of these functions assume that I want to display the resulting value in the same table as the values they aggregate.  Therefore, if I want to create an AVERAGE function against student scores, I need to create a calculated measure in the same table that holds the student scores using an expression like:

AvgScore := AVERAGE([AssignmentScore])

But what if I want to calculate the average of the assignment scores in a different table than the Assignments3 table.  The answer to that requires the use of the ‘X’ functions.  I suppose the ‘X’ represents the fact that rather than supplying a single column from the current table as the argument to the function, these functions require me to define a mathematical eXpression from another table.  Well, it does not actually have to be another table, and it does not have to be any more of a mathematical expression other than a single column.  For example, the following expression defines in the same assignment score values as defined by AvgScore defined in the previous expression.

Avg := AVERAGEX(Assignments3, [AssignmentScore])

The point however is that I can take that same expression and drop it into the calculation area of any of the other tables such as my Grades table and the calculation works exactly the same since the function is directed to a specific table and a specific calculated expression (or a single column in this case).

To show a more practical example of this technique, suppose I want to list students in order not alphabetically by their name, but by their total scores.  My pivot table displays the student names along with the sum of their assignment scores.  Calculating the sum of the students scores would be simple by itself since I could simply display their scores as the (Sum of AssignmentScore) measure in the pivot table and drag the student name to the row label area.  (Actually, I also added a calculated column to the Students table named StudentName to concatenate the student’s last name and first name using the expression:  =[StudentLName] & “, “ & [StudentFName]).  The following figure shows a portion of the resulting pivot table.

Note, the students appear alphabetically by their last name, not by the sum of their scores, which is what I want.  How can we sort the rows by the sum of the values?  The secret lies in the use of the Sort by Column feature we mentioned two weeks ago.  However, the column that I want to sort by must exist in one of the model tables, not as a generated measure ‘column’ in the resulting pivot table.  In other words, I cannot tell the pivot table to sort the rows by the Sum of Student Scores column.

Since the column I want to sort, StudentName, is in the Students table, I need to create a calculated column in Students that displays the sum of the student’s scores.  I can do that with the column DAX expression:

SUMX(RELATEDTABLE(Assignments3), Assignments3[AssignmentScore])

In this expression, instead of just using the name of the table as the first parameter, I used the expression RELATEDTABLE(Assignments3).  Why?  Because when I use this expression in the Students table, I want to sum the assignments for the related student, not all of the students.  Furthermore, there could be multiple records in Assignment3 that have scores for the current student.

In fact, RELATEDTABLE is exactly what I want because it returns a temporary table that only contains the rows for the current student.  Why do I need a temporary table?  Because I only need the table to exist long enough to perform a calculation, summing all the AssignmentScore values from each record in this temporary table.

On the other hand, when I define a calculated measure in a measure table that references fields in a dimension table, I can usually use just the RELATED function to reference a column from the single related dimension row since there is only one dimension record corresponding to each record in the fact table.

Of course, the second parameter in the SUMX function, as you may guess, defines the expression I want to sum.  In this case, that expression is simply a reference to a single column: [AssignmentScore].

Now I can select the StudentName column and click the Sort by Column option in the Sort and Filter group of the Home ribbon.  Returning to the pivot table that I previously built, I see that the order of the rows automatically have been updated to display the students by the Sum of StudentScore field which I’ve included to show that the order is correct.   Interesting that the default order is from the largest value to the smallest value when using the Sort by Column.

How do I change the sort to show the lowest score students at the top and the highest student scores at the button.  I could right click in any cell in the Sum of StudentScore and select the Sort Smallest to Largest option.  Perhaps you are thinking that you could have achieved the same result without using a sort by column in the Student table and just using the sort property of the measure column in the pivot table to show the students in ascending order or descending order.

And you would be right.  In fact, you could also use the dropdown menu for the StudentName field and define a sort using the following dialog that references the Sum of StudentScores field as either an ascending or descending (shown) sort.

However, these methods only works when you actually display the Sum of StudentScore field.  What would you do if you did not display that field in the pivot table.  Aaahhhh!  That is why the Sort by Column option in the Student table using a calculated column to total each student’s scores can be important.

Next time, I’ll show you another option to rank the students by their scores by using the RANKX function with creating the intermediate Sum of StudentScore column.  Furthermore, the RANKX function provides some additional options that could be important.

C’ya next time.

Statistics: PowerPivot Style, Part 2: The ‘A’ functions

This week I will continue with a look at some of the new statistical functions in PowerPivot 2012.  Perhaps you have noticed that several of the statistical functions have one version with the regular function name, another function with an ‘A’ appended to the end of the function and a third version with  an ‘X’.  This week I will look at the ‘A’ functions and next week I’ll look at the ‘X’ functions.

The following functions are both a ‘regular’ version and an ‘A’ version:

AVERAGE            –>           AVERAGEA

COUNT                 –>           COUNTA

MAX                      –>           MAXA

MIN                       –>           MINA

Let’s begin by looking at the AVERAGE function.  First, DAX functions typically act on columns or tables.  Furthermore, we all know that the AVERAGE function returns the arithmetic mean of a series of numbers.  Combining these two pieces of information, it would be a correct assumption that the AVERAGE function calculates the mean of a series of column values based on the aggregate level of the row.

Take the example from last week as shown again in the following figure.

Suppose we wanted to calculate the class average for assignment Homework 1 for the Bears.  We would add the number: 8 + 9 + 6 + 7 + 10 + 7 + 10 + 8 + 6 + 6 arriving at the total of 77.  Since there are 10 students in this class, the class average on this homework assignment is 77./10. or 7.7.  In fact, this is the value that you can see in the third figure from last week’s blog topic.

What may not be obvious is how PowerPivot treats non-numeric data.  In fact, the function returns blanks if the column contains any text values because the aggregation does not know how to handle text.

Furthermore, any cell that is either empty or contains a logical value will be ignored even though logical values are represented by either a ‘0’ or ‘1’.  If Wilson did not turn in Homework 1 and the teacher left the cell empty, the sum of the other student’s grades would be 71, but AVERAGE divides this value of 71 by 9 instead of 10 because only 9 grades were entered resulting in a value of 7.89.

However, if the teacher enters a value of zero for Wilson’s missing homework assignment, that cell is counted not only in the sum (which of course still adds nothing to the sum) as well as to the count of values.  Then our total of would again be 71 instead of 77, but we would divide the sum by 10 to get an average of 7.1.

Thus, whether the teacher enters a zero for a missing assignment or just leaves the value blank can have a major impact on the average class grade.

So what happens if I use the AVERAGEA function instead of the AVERAGE function?  First, rest assured that if all of the students had numeric grades entered for the assignment, the AVERAGEA function would return the same result as the AVERAGE function.  What happens when Wilson does not turn in her assignment?  If the teacher enters a ‘0’, the average using AVERAGEA is still 7.1.  If the teacher skips over Wilson leaving the assignment grade blank, AVERAGEA treats the blank as a zero and still calculates the AVERAGEA value of 7.1

However, some other side effects of using AVERAGEA is that logical functions are also included in the calculation of the mean.  A logical true evaluates to a value of ‘1’ and a logical false evaluates to a value of ‘0’.  Furthermore, any non-numeric text also evaluates as ‘0’.  In each of these cases, the number of values in the series is incremented by 1 just like a regular numeric value.  The net result is that for any column that could contain non-numeric text or logical values, AVERAGEA may result in a lower average than if it ignored these values.  On the other hand, AVERAGEA treats a blank field as a value of ‘0’ which in many cases may be exactly what you want.

If you understand this fundamental difference between AVERAGE and AVERAGEA, you might be able to deduce the difference between the other functions listed at the top of this article.  This difference can be especially tricky when using the MINA and MAXA functions when logical values exist in the column.  In the case of using MINA with a series of positive numbers, a logical value of FALSE could result in a zero return value and a logical value of TRUE could result in a return value of ‘1’ assuming that all the other values in the series are greater than ‘1’.

This is a case where knowing that application that generates the data file can be very important.  If you know that the application that teachers use to enter grades will allow only numeric values in the grades field and if the program requires the teacher to enter something, even if it is a zero for a missing assignment, you may want to stick with the functions AVERAGE, MIN, MAX, and COUNT.  On the other hand, if you know the program allows the teacher either to enter a numeric grade or to leave the entry blank, you may want to use the AVERAGEA, MINA, MAXA, or COUNTA functions since a blank value acts exactly like a value of zero.

C’ya next time when I look at the ‘X’ functions.

Statistics: PivotTable Style, Part 1

In my continuing exploration of the new PowerPivot 2012 running on Windows 8 with Excel 2013, I ran across the addition of a collection of statistics functions that have been added to DAX.  In the past, you could display a few basic statistics on a pivot table such as SUM, COUNT, AVG, MIN, and MAX, but these are not always enough to tell you want is really going on.  For example, it is entirely possible to have two totally different series of numbers such as grades on a test that have the same average value (mean) but have totally different distributions.  What do I mean by distributions.  Well, let’s take a simple example of only 5 values.  Suppose you had this set of test scores:

70, 75, 80, 85, 90

I kept the numbers simple so that you can easily see that the average of these numbers comes to 80.  However, I could also have this set of test scores:

60, 70, 80, 90, 100

This series also has an average of 80.  However, it is plain to see that the distribution of the first set of numbers is tighter around the average than the second set of numbers.  Therefore merely reporting the average score does not really tell you the entire story about how students may have performed on the test.

That is where the concept of Standard Deviation comes into the picture.  I’m not going to go into the way standard deviation is calculated other than to say that it sums the difference between each number and the mean of the series, squares it to eliminate negative values and divides by the number of values to come up with a statistic called the variance.  The square root of the variance is the standard deviation.  The smaller this number is, the tighter the series is around the mean and of course vice versa.

Knowing the standard deviation, you can also say something about the probability of additional values being close to the mean.  For example, the probability of being within 1 standard deviation of the mean is 68.2% divided evenly on both sides of the mean.  If you want a 95% confidence, you can use 2 standard deviations units from the mean.   Actually, that gives you about 95.4% confidence.

So let’s see how you can include standard deviation in a pivot table.  I’m going to begin with a table I generated for some other SQL Saturday events that includes student grades by class within a grade level as shown in the following figure.

It is important to include as columns the number series that you want to calculate statistics for.  In this case, I clearly want to calculate statistics on individual test scores.  If I wanted to instead calculate statistics across test scores for individual students, I would have to exchange my rows and columns so that the students would appear as the column.

To simplify what we are looking at (and because the average grade for an assignment by a single student does not really make sense), I’ve collapsed the student name level to show only the grade and class levels.

So where do we calculate the score averages?  I am going to create a new measure in the scores table using the DAX function AVERAGE as shown in the following figure.  (Don’t know how to create a new calculated measure?  Look at last Saturday’s blog entry.)  When you begin typing the AVERAGE function, you will see in the context help that it expects a column name as the one and only parameter.  There are some other variations on the AVERAGE function that I will cover next time.  For now, lets just stick to this version.

Similarly, I can use STDDEV.P function to find the standard deviation of the population.  There is also a STDDEV.S function to find the standard deviation of a sample.  What is the difference?  If you are including all of the data, that is a population.  If you are including only a sample of the data, that is a sample.  Since I am including all of the students in the class and want the standard deviation of the class, I can use STDDEV.P.  If I used this class as a sample for all classes within the school, I would have to use STDDEV.S.

Once I create a new calculated measure, I could switch over to my pivot table, find my calculated measures in the table that contains the test values and include it them in the pivot table instead of other measures as shown in the following figure.

Just for the sake of completeness this week, I’ll also calculate the variance of the scores with the following expression.

In the following figure, I’m showing my three calculated measures: AVERAGE, STDDEV.P, and VAR.P for the four homework assignments by class and grade.

You might notice that a couple of columns are missing (columns  B, C, and D).  Actually, I’m using another new feature of PowerPivot 2012 to freeze the first column while I scroll to the right to see my additional columns.  In the prior version, it was annoying to lose the first column which defined the row as you scrolled to the right to see the rest of your columns.  Excel has had a feature to freeze columns and/or rows for some time.  In PowerPivot 2012, this feature now can be used within your pivot table as well.  To do this, just navigate to the View ribbon and select Freeze First Column as shown below.

Now let’s calculate a 95% confidence range for each of the classes and grades.  As I mentioned earlier, you need two standard deviations on either side of the mean to define a 95% confidence level.  We can easily calculate the lower bounds and the upper bounds of this limit by using a simple expression based on the DAX functions directly as shown below.

Or we can use the calculated measures directly in a new expression as in:

LowLimit:=[AvgScore] – 2 * [StdDev]

Which method is better?  The latter of course since any changes to the expression can be made in one place and because the calculation is performed once while the cached value is reused.

Note: You must use the square brackets around the measure name or you will get an error when trying the  save the expression definition.

I would of course use a similar expression for the upper limit on the 95% confidence range adding the product of two times the standard deviation to the mean.

I could then use these measures in my pivot table to show the average, standard deviation and 95% confidence interval for each class and grade in my test score database as shown below.

That’s it for this time.  Next week I’ll take a look at some of the other statistical functions available in DAX for PowerPivot 2012.

C’ya next time.

A PowerPivot Sort By Any Other Name

The past several weeks I have been looking at the new features of PowerPivot 2012.  I also mentioned this past Tuesday, that the new Microsoft Office 2013 version of Excel includes both PowerPivot and Power View as part of the install.  They just need to turned on.  So this week I’m continuing my Contoso example with Excel 2013 and PowerPivot running on Windows 8.  Therefore, you may notice some differences in the look of the Excel spreadsheet.  Don’t fear.  The techniques  I discuss here work equally well if you are using Vista with Excel 2010 and PowerPivot 2012.

Today I am going to look at a problem that we solved a different way several months ago.  That problem is how to sort month labels chronologically rather than alphabetically.  Back on November 6, 2011 (Wow, was it really that long ago?) I described a way to create a custom sort order for months.  While that method still works in PowerPivot 2012, there is a new way to sort a column by using another column.  Let’s see how that works.

I begin by opening my Contoso 2012 sample data and create a simple pivot table that shows total sales by month.  As shown in the figure below, when I drag the calendar month label field into the rows, the rows appear by default sorted alphabetically.  Unfortunately, that is not a good sort order for months.

To fix this sort order, return to the PowerPivot window and open the tab (table) containing the dates.  In Contoso, this is the dimDate table.  In the Home ribbon find the Sort and Filter option group.  A new button in this group is labeled Sort by Column.    You could immediately click on this button to open the sort dialog.  However, I recommend first clicking on the column for which you want to define the new column sort, then click the Sort by Column button.

This action opens the following dialog and pre-populates the Sort column with the column that you selected.  This saves you a step.  However, if you forget, you can always use the dropdown arrow to display all fields in the current table and select the column from the list.

Next select the column you want to sort by and click the OK button.  You may be tempted to sort by the existing column CalendarMonth.  However, this will generate an error because there are multiple values for CalendarMonth for each value of CalendarMonthLabel because CalendarMonth has the format YYYYMM.  In other words, July would match not just to 07 or 7, but rather to 200507, 200607, 200707, and 200807.  You cannot match the sort column to multiple values in the sort by column.

Therefore, I created a simple calculated column named Month which only consists of the month number as shown in the next figure.

Now I can return to the Sort By Column dialog and create a sort relationship between CalendarMonthLabel and Month as shown in the following dialog.

When I click OK and return to my pivot table in the Excel spreadsheet, I will see that the table automatically refreshes and now displays the month labels in the expected chronological order that makes sense for most users.

That’s it for this week.  A short, simple, but very useful way to sort any column by another column in PowerPivot 2012 or Excel 2013.

C’ya next time.