Back to School

For many, it is time for back to school.  Maybe you yourself are not going back to school, but you have children who are going back to school.  How are you going to make this year different from previous years?  How are you going to help your children succeed?  Or do you take the point of view that educating your children is the responsibility of the school and the teachers at the school and you don’t want to interfere least you might undermine some of the things they are teaching?

This is the first year after 18 years of sending off our daughter to school that we do not have to think about these questions.  She graduated with a Ph.D. in Pharmacy this past Memorial Day weekend.  Even better, she has a job working as a resident at a local clinic.  Just like medical doctors, pharmacists who want to work in hospitals must do a one year residency before they can get a full-time job in a hospital.  The good thing is that the residency does not have to be in the same hospital that they eventually want to work in.  It can be anywhere.  Of course, many wind up staying where they serve their residency.

Thinking back on last week’s discussion about certifications, I am reminded that even to work as a resident, she had to pass both a national and stage pharmacy exam to show basic competency.  However, it does not end there.  She will be required to continue to get continuing education credits for the rest of her life.  Well, I heard that after 50 years of being a pharmacist, they do relax that requirement.  However, the point is the requirement to keep learning because the pharmacy industry, like the computer industry is in a constant state of change and what you knew at the time of your initial certification test will soon be out of date as new drugs replace old drugs and therapy options keep advancing.

Getting back to the topic of going back to school, have you thought about taking time to learn something new yourself this year?  As your kids go back to school, it may be a good time to show a good example by picking something you want to know more about and either registering for a class at a local college or at least going out to your favorite bookstore and picking up one or two books on a topic that you want to know more about.  It could be work related or it could be just a personal area of interest.  The challenge is to find something that you want to learn more about.

After you have picked a subject area that you want to study, you need to develop a plan of action.  Perhaps you want to cover a chapter a week or perhaps you want an even finer level of motivation and set a goal of x pages a day.  The specific goal does not matter.  Having a goal and following through with it on a regular basis will show your kids how to apply discipline in their own studies.

If there is a certification for the area of study you have chosen, plan to go take the test.  Even if the certification does not matter to your current work, it is a way to show to your kids that your study efforts were really serious and had a goal in mind.  Then should that certification help you achieve a better position with your current employer or perhaps help you get a better job with more money, benefits or just something you enjoy more, you are still showing a valuable lesson to your kids that study and learning new things can be rewarding.

This year Microsoft is making it easy for you to pick from a variety of new/updated software that you can choose to study and become expert in.  You can choose from operating systems to PowerShell to Microsoft Office 2013 to SharePoint 2013 to SQL Server 2012 and more.  Some of these technologies don’t have books available yet. However, they have many websites available with information including many whitepapers and videos on Microsoft’s site to get you started until the books become available.  As you dive into your studies, consider creating a blog post occasionally to talk about one or two things that you learned and try to explain to others how they can use what you learned.  WordPress.com, where this site is hosted, offers free blog sites to anyone.  It is also a way to improve your communication skills and if you do start looking for a new position, you can always point potential employers to your site to show what you know rather than merely point to a certification that says you crammed a bunch of knowledge into your head long enough to take a test.

Think about that until next time.  C’ya.

Advertisements

Statistics: PowerPivot Style, Part 3: The ‘X’ functions

This week I continue my look at some of the new statistical functions in PowerPivot 2012.  Last time I looked at the difference between regular statistical functions like AVERAGE, COUNT, MIN and MAX and the ‘A’ versions of the same functions.  I saw that the basic difference between these two sets was the way they treated non-numeric and empty values.  However, both versions of these functions assume that I want to display the resulting value in the same table as the values they aggregate.  Therefore, if I want to create an AVERAGE function against student scores, I need to create a calculated measure in the same table that holds the student scores using an expression like:

AvgScore := AVERAGE([AssignmentScore])

But what if I want to calculate the average of the assignment scores in a different table than the Assignments3 table.  The answer to that requires the use of the ‘X’ functions.  I suppose the ‘X’ represents the fact that rather than supplying a single column from the current table as the argument to the function, these functions require me to define a mathematical eXpression from another table.  Well, it does not actually have to be another table, and it does not have to be any more of a mathematical expression other than a single column.  For example, the following expression defines in the same assignment score values as defined by AvgScore defined in the previous expression.

Avg := AVERAGEX(Assignments3, [AssignmentScore])

The point however is that I can take that same expression and drop it into the calculation area of any of the other tables such as my Grades table and the calculation works exactly the same since the function is directed to a specific table and a specific calculated expression (or a single column in this case).

To show a more practical example of this technique, suppose I want to list students in order not alphabetically by their name, but by their total scores.  My pivot table displays the student names along with the sum of their assignment scores.  Calculating the sum of the students scores would be simple by itself since I could simply display their scores as the (Sum of AssignmentScore) measure in the pivot table and drag the student name to the row label area.  (Actually, I also added a calculated column to the Students table named StudentName to concatenate the student’s last name and first name using the expression:  =[StudentLName] & “, “ & [StudentFName]).  The following figure shows a portion of the resulting pivot table.

Note, the students appear alphabetically by their last name, not by the sum of their scores, which is what I want.  How can we sort the rows by the sum of the values?  The secret lies in the use of the Sort by Column feature we mentioned two weeks ago.  However, the column that I want to sort by must exist in one of the model tables, not as a generated measure ‘column’ in the resulting pivot table.  In other words, I cannot tell the pivot table to sort the rows by the Sum of Student Scores column.

Since the column I want to sort, StudentName, is in the Students table, I need to create a calculated column in Students that displays the sum of the student’s scores.  I can do that with the column DAX expression:

SUMX(RELATEDTABLE(Assignments3), Assignments3[AssignmentScore])

In this expression, instead of just using the name of the table as the first parameter, I used the expression RELATEDTABLE(Assignments3).  Why?  Because when I use this expression in the Students table, I want to sum the assignments for the related student, not all of the students.  Furthermore, there could be multiple records in Assignment3 that have scores for the current student.

In fact, RELATEDTABLE is exactly what I want because it returns a temporary table that only contains the rows for the current student.  Why do I need a temporary table?  Because I only need the table to exist long enough to perform a calculation, summing all the AssignmentScore values from each record in this temporary table.

On the other hand, when I define a calculated measure in a measure table that references fields in a dimension table, I can usually use just the RELATED function to reference a column from the single related dimension row since there is only one dimension record corresponding to each record in the fact table.

Of course, the second parameter in the SUMX function, as you may guess, defines the expression I want to sum.  In this case, that expression is simply a reference to a single column: [AssignmentScore].

Now I can select the StudentName column and click the Sort by Column option in the Sort and Filter group of the Home ribbon.  Returning to the pivot table that I previously built, I see that the order of the rows automatically have been updated to display the students by the Sum of StudentScore field which I’ve included to show that the order is correct.   Interesting that the default order is from the largest value to the smallest value when using the Sort by Column.

How do I change the sort to show the lowest score students at the top and the highest student scores at the button.  I could right click in any cell in the Sum of StudentScore and select the Sort Smallest to Largest option.  Perhaps you are thinking that you could have achieved the same result without using a sort by column in the Student table and just using the sort property of the measure column in the pivot table to show the students in ascending order or descending order.

And you would be right.  In fact, you could also use the dropdown menu for the StudentName field and define a sort using the following dialog that references the Sum of StudentScores field as either an ascending or descending (shown) sort.

However, these methods only works when you actually display the Sum of StudentScore field.  What would you do if you did not display that field in the pivot table.  Aaahhhh!  That is why the Sort by Column option in the Student table using a calculated column to total each student’s scores can be important.

Next time, I’ll show you another option to rank the students by their scores by using the RANKX function with creating the intermediate Sum of StudentScore column.  Furthermore, the RANKX function provides some additional options that could be important.

C’ya next time.

Certifiable

Does you company or organization require certifications for jobs, either to keep existing jobs or to get hired in the first place?  Over the years I’ve had different opinions about the entire certification process and whether it really leads to better IT folks or if it is being used merely as a way to cut through the hundreds of resumes submitted for that new position in your company.

The problem has always been that most certification exams test memorization of key facts (or trivial facts depending on which side of the argument you sit).  There is an industry of books, websites, training classes, and more that promise to help you prepare for some of the certification exams.  Some of the so called ‘boot’ camps even promise that if you do not pass the exam, you can come back and attend the class again at no or at least a reduced cost to improve your odds the second time.

If that was the only problem, it would not be so bad (or maybe it would).  The real problem has been the abuse of the certification system when some people desperate for a passing score hire someone else to take the test for them.  Testing centers have tried to prevent this by requiring photo IDs of the person taking the test.  Some even fingerprint the person going into the testing room and for long tests, each time they leave for a restroom break.  Some testing centers even have people who are paid to do nothing else but to sit and watch you take the test.  Don’t you think that adds to the anxiety of taking the test to have someone watching you?  How does that represent the real world and what a person can do?

Another problem with the certification process even with those who legitimately pass the exam is that they probably spend a few weeks cramming to take the test.  If you were to spot test them a few months (or even a few weeks) later, you might be surprised at how much of the information they have forgotten.  Any good college student knows how to study for an exam so they can remember the facts long enough to take the test (especially for Organic Chemistry) knowing that they will not need to remember those facts after the test is over.  Besides, they know that in their day job, half or more of the ‘stuff’ they are required to learn for the test they will never need to use.

So what is the solution?  I’ve been thinking recently about the advantage of a trades-like approach in which new programmers, DBAs, etc. would apprentice under a person with several years of experience in the area where they wanted to work.  They might even be several levels of competency and each level might require a combination of time spent using the particular technology along with some provable practical success.

Take for example SQL Server.  In SQL Server 2008 and 2012 there are several test that can be taken.  Each test has a slightly different focus on different skills with different parts of the product.  In most companies that have more than a single DBA, it is not unusual to have different DBAs with different expertise.  Not everyone needs to know how to set up a mirrored database, or a parallel database, but everyone probably needs to understand some basic SQL syntax, the way rights work, why indexes are important, etc.  Some people are pretty good at matching the exam goals with their job description and focus only on the certification exams that make sense for them.

But then there are the certification collectors.  You know who I mean.  The people who try to take every certification exam that is ever created, not just for the main product they use, but for every other related product.  We may all know someone who has certifications in one or more Windows products, SQL Server, Office, Exchange, DotNet programming language, and SharePoint.  The questions I have for them is, ‘How much do you really know about each of those areas?  Did you just visit a ‘braindump’ site to obtain the training materials you needed to pass the test (let’s not debate the ethics of such sites this time)?’

If you use certifications as a qualification for those you hire, may I remind you that there is more to a good employee than whether they were able to study a bunch of facts and remember them long enough to take an hour or so exam.  Can they work well with teams?  Can they explain how a system works to an end-user with the end-user’s eyes glazing over?  Can they write documentation for a system that someone else can read a year later and make changes to the system without pulling all of their hair out?

My point is that there are a lot of other considerations as to what makes a good employee than whether they can pass a test.  Technology changes.   A certification test taken today may not have much value in two or three years.  Unless your company or organization is willing to continue to invest in the employee to maintain their certifications as new products come and go, using certifications for a hiring criteria is a false requirement.  If you hire an employee that previously held several certifications and your organization does not support continued training, they will likely leave on their own to pursue other opportunities that not only recognize the certifications they have, but will help them maintain a reasonable level of certifications in the future.

For now, certifications are the only way to measure any technical competency.  However, I still go back to wondering whether you could begin with a general exam to get ‘admitted’ into the IT profession which shows a basic understanding of generic concepts.  Then after a defined period of time ‘apprenticing’ under a professional, a group of the person’s peers could bestow on them a master’s certificate for that particular set of skills.

In full disclosure, many years ago I obtained the designation of Microsoft MSDE.  However most of those exams have since expired or become unimportant in today’s market.  So recently I’ve been looking at taking the BI certification exams for SQL Server, not SharePoint, because of my interest in business intelligence and the fact that only the SQL Server side of certification has a BI track.  However, for me, the journey is of more value than the destination.  Do you have thoughts on certification?  Comment back to me.

C’ya next time.

Statistics: PowerPivot Style, Part 2: The ‘A’ functions

This week I will continue with a look at some of the new statistical functions in PowerPivot 2012.  Perhaps you have noticed that several of the statistical functions have one version with the regular function name, another function with an ‘A’ appended to the end of the function and a third version with  an ‘X’.  This week I will look at the ‘A’ functions and next week I’ll look at the ‘X’ functions.

The following functions are both a ‘regular’ version and an ‘A’ version:

AVERAGE            –>           AVERAGEA

COUNT                 –>           COUNTA

MAX                      –>           MAXA

MIN                       –>           MINA

Let’s begin by looking at the AVERAGE function.  First, DAX functions typically act on columns or tables.  Furthermore, we all know that the AVERAGE function returns the arithmetic mean of a series of numbers.  Combining these two pieces of information, it would be a correct assumption that the AVERAGE function calculates the mean of a series of column values based on the aggregate level of the row.

Take the example from last week as shown again in the following figure.

Suppose we wanted to calculate the class average for assignment Homework 1 for the Bears.  We would add the number: 8 + 9 + 6 + 7 + 10 + 7 + 10 + 8 + 6 + 6 arriving at the total of 77.  Since there are 10 students in this class, the class average on this homework assignment is 77./10. or 7.7.  In fact, this is the value that you can see in the third figure from last week’s blog topic.

What may not be obvious is how PowerPivot treats non-numeric data.  In fact, the function returns blanks if the column contains any text values because the aggregation does not know how to handle text.

Furthermore, any cell that is either empty or contains a logical value will be ignored even though logical values are represented by either a ‘0’ or ‘1’.  If Wilson did not turn in Homework 1 and the teacher left the cell empty, the sum of the other student’s grades would be 71, but AVERAGE divides this value of 71 by 9 instead of 10 because only 9 grades were entered resulting in a value of 7.89.

However, if the teacher enters a value of zero for Wilson’s missing homework assignment, that cell is counted not only in the sum (which of course still adds nothing to the sum) as well as to the count of values.  Then our total of would again be 71 instead of 77, but we would divide the sum by 10 to get an average of 7.1.

Thus, whether the teacher enters a zero for a missing assignment or just leaves the value blank can have a major impact on the average class grade.

So what happens if I use the AVERAGEA function instead of the AVERAGE function?  First, rest assured that if all of the students had numeric grades entered for the assignment, the AVERAGEA function would return the same result as the AVERAGE function.  What happens when Wilson does not turn in her assignment?  If the teacher enters a ‘0’, the average using AVERAGEA is still 7.1.  If the teacher skips over Wilson leaving the assignment grade blank, AVERAGEA treats the blank as a zero and still calculates the AVERAGEA value of 7.1

However, some other side effects of using AVERAGEA is that logical functions are also included in the calculation of the mean.  A logical true evaluates to a value of ‘1’ and a logical false evaluates to a value of ‘0’.  Furthermore, any non-numeric text also evaluates as ‘0’.  In each of these cases, the number of values in the series is incremented by 1 just like a regular numeric value.  The net result is that for any column that could contain non-numeric text or logical values, AVERAGEA may result in a lower average than if it ignored these values.  On the other hand, AVERAGEA treats a blank field as a value of ‘0’ which in many cases may be exactly what you want.

If you understand this fundamental difference between AVERAGE and AVERAGEA, you might be able to deduce the difference between the other functions listed at the top of this article.  This difference can be especially tricky when using the MINA and MAXA functions when logical values exist in the column.  In the case of using MINA with a series of positive numbers, a logical value of FALSE could result in a zero return value and a logical value of TRUE could result in a return value of ‘1’ assuming that all the other values in the series are greater than ‘1’.

This is a case where knowing that application that generates the data file can be very important.  If you know that the application that teachers use to enter grades will allow only numeric values in the grades field and if the program requires the teacher to enter something, even if it is a zero for a missing assignment, you may want to stick with the functions AVERAGE, MIN, MAX, and COUNT.  On the other hand, if you know the program allows the teacher either to enter a numeric grade or to leave the entry blank, you may want to use the AVERAGEA, MINA, MAXA, or COUNTA functions since a blank value acts exactly like a value of zero.

C’ya next time when I look at the ‘X’ functions.

Bringing Computing to the Masses

For the past year or more, I’ve been calling PowerPivot the BI tool for the masses.  I don’t know who I heard say it first.  It very well could have been someone at one of the SQL Saturday events I’ve attended.  However, the point is that I recently read an article that talked about how a new product at the time, Web Matrix would revolutionize the way the average person could build their own web site.  Microsoft released Web Matrix as a tool that would enable the average businessman to create their own web site.  This article was several years old, but I believe that we can say that this prediction has not happened, at least not yet.

The author went on to relate how similar products were introduced to similar fanfare and ultimately failed as well.  He started with COBOL, a business oriented computer language built to read like English.  Salespeople promoted COBOL as a way for the average businessperson to write their own reports if not their own applications.  Decades after making this promise, we now know that the potential was never realized and COBOL has fallen to become a minor product in today’s world of new application development.

COBOL was not the only product to fail.  Countless report writers have come and some have gone like Crystal Reports and SQL Server Reporting Services.  Each time we would again hear the promise that this tool was the one that would now allow end-users to create their own self-service reports.  I think it is safe to say that these tools never became end user tools.  However, they did become very powerful and useful tools in the hands of the professional software developers who could now develop reports faster and build more complex reports than they could before these tools.

Another good example is the long list of 4th generational languages that were suppose to make software development so easy that anyone could do it.  These tools also found their way into the hands of the professional developers to make them more efficient, but failed to turn the masses into program developers.  I’ve always said, ‘Real Programmers Do Use Wizards, They Just Don’t Rely On Them.’

Will PowerPivot and Power View be any different?  Perhaps not.  Recently I have been emphasizing in my PowerPivot presentations at SQL Saturday events that PowerPivot does greatly enhance the ability of Power Users in your organization to analyze data in ways they could not easily do before.  However, the task of building the basic pivot table model, the work that is done in the PowerPivot window to connect to one or more data sources, define custom columns, define hierarchies, create new measure calculations, etc. will most likely remain in the hands of the DBAs or database analysts.  The idea of understanding table relationships, DAX expressions, and other BI topics will probably not filter down to the average end user.

With a little work and some training, it may be possible to teach some of the more technically adept department members how to manipulate dimension attributes on a pivot table with different measures, add a bit of formatting to the page to make it look pleasing and save the result to a web page for collaborative sharing or to PowerPoint for a presentation to management.  However, I really don’t see even power users building the connections to different databases, linking tables, building new fields, and performing other tasks to make the data available for reporting.

Does this detract from the statement that PowerPivot is BI for the Masses?  From my point of view, it does not.  Compared to building cubes in SSAS (SQL Server Analysis Services) and reporting on them with SSRS (SQL Server Report Services), PowerPivot tables in Excel are easy to use and easy to format into custom reports.  However, I’m not an end user and you probably are not either.  Think of PowerPivot as a new tool in your toolbox that lets you respond to user requests.  If the tool helps you to answer their questions faster, it is a winning tool.  If you can train the user into how to make simple changes to the pivot table in a tool they already know, Excel, they might be able to generate some of their own reports by themselves.  If nothing else, you have at least opened their eyes to what is possible even if you have to do it for them.

However, they will still need you for those reports that don’t fit the current pivot table model you have given them.  They still need you to add that new table from a new application into the model so they can ask additional questions with the new measures and/or dimensions.   They may even still need you to build a more robust model in SSAS and display it in SSRS because they need features and capabilities not found in PowerPivot.

Still, if PowerPivot makes more people aware of how to use business intelligence concepts to support their day-to-day business needs, it still moves us all forward to be just a bit more efficient and a bit more intelligent in our decisions and that is what BI is about.

C’ya next time.

Statistics: PivotTable Style, Part 1

In my continuing exploration of the new PowerPivot 2012 running on Windows 8 with Excel 2013, I ran across the addition of a collection of statistics functions that have been added to DAX.  In the past, you could display a few basic statistics on a pivot table such as SUM, COUNT, AVG, MIN, and MAX, but these are not always enough to tell you want is really going on.  For example, it is entirely possible to have two totally different series of numbers such as grades on a test that have the same average value (mean) but have totally different distributions.  What do I mean by distributions.  Well, let’s take a simple example of only 5 values.  Suppose you had this set of test scores:

70, 75, 80, 85, 90

I kept the numbers simple so that you can easily see that the average of these numbers comes to 80.  However, I could also have this set of test scores:

60, 70, 80, 90, 100

This series also has an average of 80.  However, it is plain to see that the distribution of the first set of numbers is tighter around the average than the second set of numbers.  Therefore merely reporting the average score does not really tell you the entire story about how students may have performed on the test.

That is where the concept of Standard Deviation comes into the picture.  I’m not going to go into the way standard deviation is calculated other than to say that it sums the difference between each number and the mean of the series, squares it to eliminate negative values and divides by the number of values to come up with a statistic called the variance.  The square root of the variance is the standard deviation.  The smaller this number is, the tighter the series is around the mean and of course vice versa.

Knowing the standard deviation, you can also say something about the probability of additional values being close to the mean.  For example, the probability of being within 1 standard deviation of the mean is 68.2% divided evenly on both sides of the mean.  If you want a 95% confidence, you can use 2 standard deviations units from the mean.   Actually, that gives you about 95.4% confidence.

So let’s see how you can include standard deviation in a pivot table.  I’m going to begin with a table I generated for some other SQL Saturday events that includes student grades by class within a grade level as shown in the following figure.

It is important to include as columns the number series that you want to calculate statistics for.  In this case, I clearly want to calculate statistics on individual test scores.  If I wanted to instead calculate statistics across test scores for individual students, I would have to exchange my rows and columns so that the students would appear as the column.

To simplify what we are looking at (and because the average grade for an assignment by a single student does not really make sense), I’ve collapsed the student name level to show only the grade and class levels.

So where do we calculate the score averages?  I am going to create a new measure in the scores table using the DAX function AVERAGE as shown in the following figure.  (Don’t know how to create a new calculated measure?  Look at last Saturday’s blog entry.)  When you begin typing the AVERAGE function, you will see in the context help that it expects a column name as the one and only parameter.  There are some other variations on the AVERAGE function that I will cover next time.  For now, lets just stick to this version.

Similarly, I can use STDDEV.P function to find the standard deviation of the population.  There is also a STDDEV.S function to find the standard deviation of a sample.  What is the difference?  If you are including all of the data, that is a population.  If you are including only a sample of the data, that is a sample.  Since I am including all of the students in the class and want the standard deviation of the class, I can use STDDEV.P.  If I used this class as a sample for all classes within the school, I would have to use STDDEV.S.

Once I create a new calculated measure, I could switch over to my pivot table, find my calculated measures in the table that contains the test values and include it them in the pivot table instead of other measures as shown in the following figure.

Just for the sake of completeness this week, I’ll also calculate the variance of the scores with the following expression.

In the following figure, I’m showing my three calculated measures: AVERAGE, STDDEV.P, and VAR.P for the four homework assignments by class and grade.

You might notice that a couple of columns are missing (columns  B, C, and D).  Actually, I’m using another new feature of PowerPivot 2012 to freeze the first column while I scroll to the right to see my additional columns.  In the prior version, it was annoying to lose the first column which defined the row as you scrolled to the right to see the rest of your columns.  Excel has had a feature to freeze columns and/or rows for some time.  In PowerPivot 2012, this feature now can be used within your pivot table as well.  To do this, just navigate to the View ribbon and select Freeze First Column as shown below.

Now let’s calculate a 95% confidence range for each of the classes and grades.  As I mentioned earlier, you need two standard deviations on either side of the mean to define a 95% confidence level.  We can easily calculate the lower bounds and the upper bounds of this limit by using a simple expression based on the DAX functions directly as shown below.

Or we can use the calculated measures directly in a new expression as in:

LowLimit:=[AvgScore] – 2 * [StdDev]

Which method is better?  The latter of course since any changes to the expression can be made in one place and because the calculation is performed once while the cached value is reused.

Note: You must use the square brackets around the measure name or you will get an error when trying the  save the expression definition.

I would of course use a similar expression for the upper limit on the 95% confidence range adding the product of two times the standard deviation to the mean.

I could then use these measures in my pivot table to show the average, standard deviation and 95% confidence interval for each class and grade in my test score database as shown below.

That’s it for this time.  Next week I’ll take a look at some of the other statistical functions available in DAX for PowerPivot 2012.

C’ya next time.

Slowing Changing Dimensions are Not Just for BI

In every book you read about Business Intelligence (BI), you will find mention of a concept called slowly changing dimensions.  The concept being that every attribute in every dimension falls into one of three possible change types.  How you handle these attributes as they change determines how robust your solution is.  While I maintain that these concepts existed long before BI became popular, let me first refresh your memory about what defines each of the change types.

The first change type is referred to as Type 0.  In many ways that is very appropriate because it represents attributes that never change.  You might also think of these attributes as having fixed values.  In the one book I was reading, they referred to gender as an attribute that never changed.  While this book is not that old, it could be argued that two gender values no longer apply and furthermore that when a record references one of those genders, it is not necessarily fixed forever.  Perhaps a better example of an attribute that never changes is an employee’s birthdate in the Employee table.

The second change type refers to attributes that can change and when they do change, we really do not care about the old value.  A good example of these Type 1 attributes might be the phone number for a customer in your order entry system.  When a customer gets a new phone number, you want to update the customer record with that number by simply replacing the old number with the new one.  You probably have no reason to keep track of what the old number was.  You never do reports based on sales by phone number, and obviously, you would never need to call the old number any more.  This case shows that a simple replacement of one attribute value with another has no need to track history, only to keep the current data.

A Type 2 change type differs from a Type 1 change type in that you do want to track the changing values of the attribute.  The best example I can think of for a Type 2 change attribute is a product price.  In any sales system, I can guarantee that the price of products will change over time.  Some organizations may change prices more frequently than others, but whether they change daily or once every other year or so, the fact remains that you must be able to reference the price of each product based on when the product was sold so that historical reports accurate represent sales.  Typically, the DBA or programmer accomplishes this tracking of different sales prices by adding a new record to the product table each time the sales price changes.  In the record are two additional fields not generally found in Type 0 or Type 1 attributes.  These fields specify the start date and end date for the record.  When a product is first entered into the product table, the start date is generally set to the date the record is added and the end date is left blank or NULL or sometimes some future date like 12/31/2999 to avoid having to deal with NULL and blank values.  Then when the product price changes, the date of the product price change is added to the end date field of the original record and new record is added for the product with the current date as  the start date.  Because a product can have multiple records in the table, the product ID is generally not used as the primary key.  Rather an identity column (auto-increment) is added to the table to provide uniqueness while the current price of a product is the record with the blank, NULL, or future date as its end date.

Sounds straight forward, right?  I’ve been creating programs since,… well, let’s just say a long time and this concept existed even back in my early database work with FoxBase.  We had to have ways to track changes to certain key values in some of the tables we used.  We may not have had the formal definition of a ‘slowly changing dimension’ as it appears in today’s business intelligence systems like SSAS (SQL Server Analysis Services), but remember having to use the concept of a start date and an end date to identify which record to use for reports based on the date of the ‘fact’ table such as a sales table.

There is actually another type not detailed here that is similar to the Type 2 type and may be called Type 3.  It also requires that databases track changes by adding records each time a change occurs and identifying a start and end date to specify the valid time range for each record.  The difference is only that it truncates history after a specific number of changes.  In other words, the table only retains the last x number of changes to the dimension values.  Frankly, I am not a big fan of this technique because I really feel uneasy when data is lost like that.  However, if a business is only required to report on data that is seven years in the past or less, perhaps rather than counting the number of changes that the system retains, it may make sense to drop records that are older than seven years in order to conserve space.

Do you have experience with ‘slowly changing field values’ before the concept was popularized in Business Intelligence books?  If so, add your comments to this entry.

C’ya next time.

A PowerPivot Sort By Any Other Name

The past several weeks I have been looking at the new features of PowerPivot 2012.  I also mentioned this past Tuesday, that the new Microsoft Office 2013 version of Excel includes both PowerPivot and Power View as part of the install.  They just need to turned on.  So this week I’m continuing my Contoso example with Excel 2013 and PowerPivot running on Windows 8.  Therefore, you may notice some differences in the look of the Excel spreadsheet.  Don’t fear.  The techniques  I discuss here work equally well if you are using Vista with Excel 2010 and PowerPivot 2012.

Today I am going to look at a problem that we solved a different way several months ago.  That problem is how to sort month labels chronologically rather than alphabetically.  Back on November 6, 2011 (Wow, was it really that long ago?) I described a way to create a custom sort order for months.  While that method still works in PowerPivot 2012, there is a new way to sort a column by using another column.  Let’s see how that works.

I begin by opening my Contoso 2012 sample data and create a simple pivot table that shows total sales by month.  As shown in the figure below, when I drag the calendar month label field into the rows, the rows appear by default sorted alphabetically.  Unfortunately, that is not a good sort order for months.

To fix this sort order, return to the PowerPivot window and open the tab (table) containing the dates.  In Contoso, this is the dimDate table.  In the Home ribbon find the Sort and Filter option group.  A new button in this group is labeled Sort by Column.    You could immediately click on this button to open the sort dialog.  However, I recommend first clicking on the column for which you want to define the new column sort, then click the Sort by Column button.

This action opens the following dialog and pre-populates the Sort column with the column that you selected.  This saves you a step.  However, if you forget, you can always use the dropdown arrow to display all fields in the current table and select the column from the list.

Next select the column you want to sort by and click the OK button.  You may be tempted to sort by the existing column CalendarMonth.  However, this will generate an error because there are multiple values for CalendarMonth for each value of CalendarMonthLabel because CalendarMonth has the format YYYYMM.  In other words, July would match not just to 07 or 7, but rather to 200507, 200607, 200707, and 200807.  You cannot match the sort column to multiple values in the sort by column.

Therefore, I created a simple calculated column named Month which only consists of the month number as shown in the next figure.

Now I can return to the Sort By Column dialog and create a sort relationship between CalendarMonthLabel and Month as shown in the following dialog.

When I click OK and return to my pivot table in the Excel spreadsheet, I will see that the table automatically refreshes and now displays the month labels in the expected chronological order that makes sense for most users.

That’s it for this week.  A short, simple, but very useful way to sort any column by another column in PowerPivot 2012 or Excel 2013.

C’ya next time.