In my continuing exploration of the new PowerPivot 2012 running on Windows 8 with Excel 2013, I ran across the addition of a collection of statistics functions that have been added to DAX. In the past, you could display a few basic statistics on a pivot table such as SUM, COUNT, AVG, MIN, and MAX, but these are not always enough to tell you want is really going on. For example, it is entirely possible to have two totally different series of numbers such as grades on a test that have the same average value (mean) but have totally different distributions. What do I mean by distributions. Well, let’s take a simple example of only 5 values. Suppose you had this set of test scores:
70, 75, 80, 85, 90
I kept the numbers simple so that you can easily see that the average of these numbers comes to 80. However, I could also have this set of test scores:
60, 70, 80, 90, 100
This series also has an average of 80. However, it is plain to see that the distribution of the first set of numbers is tighter around the average than the second set of numbers. Therefore merely reporting the average score does not really tell you the entire story about how students may have performed on the test.
That is where the concept of Standard Deviation comes into the picture. I’m not going to go into the way standard deviation is calculated other than to say that it sums the difference between each number and the mean of the series, squares it to eliminate negative values and divides by the number of values to come up with a statistic called the variance. The square root of the variance is the standard deviation. The smaller this number is, the tighter the series is around the mean and of course vice versa.
Knowing the standard deviation, you can also say something about the probability of additional values being close to the mean. For example, the probability of being within 1 standard deviation of the mean is 68.2% divided evenly on both sides of the mean. If you want a 95% confidence, you can use 2 standard deviations units from the mean. Actually, that gives you about 95.4% confidence.
So let’s see how you can include standard deviation in a pivot table. I’m going to begin with a table I generated for some other SQL Saturday events that includes student grades by class within a grade level as shown in the following figure.
It is important to include as columns the number series that you want to calculate statistics for. In this case, I clearly want to calculate statistics on individual test scores. If I wanted to instead calculate statistics across test scores for individual students, I would have to exchange my rows and columns so that the students would appear as the column.
To simplify what we are looking at (and because the average grade for an assignment by a single student does not really make sense), I’ve collapsed the student name level to show only the grade and class levels.
So where do we calculate the score averages? I am going to create a new measure in the scores table using the DAX function AVERAGE as shown in the following figure. (Don’t know how to create a new calculated measure? Look at last Saturday’s blog entry.) When you begin typing the AVERAGE function, you will see in the context help that it expects a column name as the one and only parameter. There are some other variations on the AVERAGE function that I will cover next time. For now, lets just stick to this version.
Similarly, I can use STDDEV.P function to find the standard deviation of the population. There is also a STDDEV.S function to find the standard deviation of a sample. What is the difference? If you are including all of the data, that is a population. If you are including only a sample of the data, that is a sample. Since I am including all of the students in the class and want the standard deviation of the class, I can use STDDEV.P. If I used this class as a sample for all classes within the school, I would have to use STDDEV.S.
Once I create a new calculated measure, I could switch over to my pivot table, find my calculated measures in the table that contains the test values and include it them in the pivot table instead of other measures as shown in the following figure.
Just for the sake of completeness this week, I’ll also calculate the variance of the scores with the following expression.
In the following figure, I’m showing my three calculated measures: AVERAGE, STDDEV.P, and VAR.P for the four homework assignments by class and grade.
You might notice that a couple of columns are missing (columns B, C, and D). Actually, I’m using another new feature of PowerPivot 2012 to freeze the first column while I scroll to the right to see my additional columns. In the prior version, it was annoying to lose the first column which defined the row as you scrolled to the right to see the rest of your columns. Excel has had a feature to freeze columns and/or rows for some time. In PowerPivot 2012, this feature now can be used within your pivot table as well. To do this, just navigate to the View ribbon and select Freeze First Column as shown below.
Now let’s calculate a 95% confidence range for each of the classes and grades. As I mentioned earlier, you need two standard deviations on either side of the mean to define a 95% confidence level. We can easily calculate the lower bounds and the upper bounds of this limit by using a simple expression based on the DAX functions directly as shown below.
Or we can use the calculated measures directly in a new expression as in:
LowLimit:=[AvgScore] – 2 * [StdDev]
Which method is better? The latter of course since any changes to the expression can be made in one place and because the calculation is performed once while the cached value is reused.Note: You must use the square brackets around the measure name or you will get an error when trying the save the expression definition.
I would of course use a similar expression for the upper limit on the 95% confidence range adding the product of two times the standard deviation to the mean.
I could then use these measures in my pivot table to show the average, standard deviation and 95% confidence interval for each class and grade in my test score database as shown below.
That’s it for this time. Next week I’ll take a look at some of the other statistical functions available in DAX for PowerPivot 2012.
C’ya next time.