Using DAX to Join Tables on Multiple Fields


Last year, October 29th to be exact, I published a blog that I called “Getting Started With DAX Functions”.  It was originally suppose to be the start of a series of articles on DAX, but then I got side-tracked.  Between work issues and personal issues that served as great fodder for discussion, the rest of the DAX series got lost as I covered several Excel and SharePoint related topics.  In fact, I was about to launch into a another set of SharePoint web parts for displaying data different ways when I remembered that I never finished the DAX series.  To make completing this series even more important, I just received word that I will be doing a session at the Orlando Code Camp at the end of March on using DAX to create Date and Time dimension data.  As I went back to the DAX article from last October, I realized that I may have started at the wrong point.

In that first article, I jumped right into the date and time functions supported by DAX.  But perhaps we should have covered some additional fundamentals like why does DAX even exist and where did it come from?

The simple answer to that question is that DAX is a set based language that helps you manipulate data in Microsoft’s Vertipaq engine.  That’s what makes PowerPivot work in Excel, but we will get back to that a little later.  The purpose of pivot tables is to allow the user to easily analyze facts and compare the value of those facts as you change either horizontal or vertical parameters also known as dimension.  For example, if you look at football scores, the total points made by each team is a fact.  Some obvious dimensions are the names of the teams, the dates of the games, and even the names of the players who scored the points.  By using different dimensions to define the vertical columns and horizontal rows, you can analyze how teams performed as a function of the date, how players performed as a function of the date, or even how players performed as a function of the team they were on or even better, the team they played against.   For a company, a fact may be sales while dimensions might include the date, the product, the type of store the product was sold in, the state, or even the region of the country where the stores are located.  You might even want to compare sales to weather.  For example, Disney probably sells a lot more rain ponchos on rainy summer days in Florida than dry spring days.

So that is what a pivot table is.  How did they get started?  Actually pivot tables were not initially created by Microsoft.  Rather a simple form of a pivot table first appeared in Lotus Improv for the NeXT computer in 1991.  Other versions became available for PCs but it was not until 1993 that Microsoft Excel 5 introduced a new feature called a pivot table.

Without going through all of the versions of Excel to see how pivot tables have evolved becomming more powerful, let’s just jump to Excel 2010.  By this time, Excel supported worksheets with up to a million records.  This also defined the limit for the number of rows a pivot table could use as its source.  Excel basic pivot tables can work with data from only a single table, although that table could come from a variety of data sources.  But getting data from a single table probably means a trip to your friendly DBA with a box of fresh donuts to ask them to build for you a single table with all of the measures and dimensions you would need in your pivot table.  If they had to gather data from multiple data sources or even multiple databases, you might have to bring two or more boxes of donuts.

PowerPivot for Excel 2010 solves many of these problems by letting you work with individual tables with more than a million records.  In fact, some Excel experts such as Bill Jelen claim that it is possible to have a pivot table with 900 million records.  Of course there still are limits, but these limits are based more on current physical file storage limitations of the compacted Excel workbook rather than limits of the underlying Microsoft engine for PowerPivot named Vertipaq.  This strange sounding name actually has meaning in that the data for a PowerPivot table is compressed based on common column values across the rows rather than the typical database compression that tries to compress individual rows.  Since columns are considered vertical in a grid, the name Vertipaq for vertical packing/compression seemed like a natural choice.  BTW, Rob Collie can probably give you a much better explaining of how Vertipaq compresses data.

But PowerPivot does not stop there.  It allows you to use multiple data tables eliminating the need to build custom single tables with all the data.  That in itself should save you a lot of money by not having to buy all those boxes of donuts.  In fact, you can connect to and link these tables together using relationships just as you would in a regular relational database.

But the fun doesn’t stop there. These multiple tables can come from different data sources.  You can combine data from a SQL Server database with data from Oracle, Access, Excel spreadsheets, text files and at least a dozen other common data sources.  All you need is a common linking field from each table to create the relationship.

This last statement actually hints at one of the limitations of the current PowerPivot implementations.  Relationships between tables can only be created using a single column from each table.  What if the relationship is more complex than a single column requiring two or more columns?  That is one of the ways DAX can come to your rescue.

The following figure shows a portion of a table that holds the schedule for college football games.  No one column defined a unique or primary key.  For example, there are multiple records for each institution (team) since they play more than one game.  Also there are multiple records for each game date since more than one game is played on most days, especially weekends.

If we had two or more text columns, we could just concatenate the two columns to create a calculated column.  But in this case, we have a text column and a date column.  Fortunately for us, PowerPivot can perform some automatic data type changes (typically to the text type) so that we can create a calculated column from two fields and PowerPivot automatically converts the game date to a string before appending the institution name to it to provide a unique value (any institution can only have one game on any one day).

To create a calculated column, scroll horizontally to the last column on the right of the table and click in the header row of the first empty column.  Here you can enter the name for the new field.  Since this is just going to be a connecting column, I’ll name it: InstitutionGame.  To perform the concatenation, click in the first row beneath the header and enter an equal sign “=”.  This tells PowerPivot you want to define a calculation.  For this simple calculation we can simple enter the field references for the two fields we want to concatenate.  While there are several ways to enter field references, I prefer to begin by entering the table name.  Immediately a context sensitive dropdown appears with the field names within all tables that begin with the letter(s) you entered so far.  Simply enter more letters or scroll down through the list to highlight the field you want and press the Tab key to select that field.

After selecting the first field, enter the concatenation operator “&” and then repeat the previous steps to select the other field.  In my case, I see the following expression being built in the Expression box.

=DIVISIONBSchedules[Game_Date]&DIVISIONBSchedules[Institution]

When I press Enter, PowerPivot calculates the values for the entire column.  The calculation is really a set operation across the entire table rather than a cell-by-cell calculation as it would be in tranditional Excel, even simple Excel pivot tables.  However, the calculation is amazingly fast.  Next I would switch to the Division Offense table to get information on points scored.  Here I would create a similar calculated column using the corresponding game date and institution name fields for this table.

With both calculated columns defined, I can click on the Design ribbon and click on Create Relationship in the Relationships group

Since I left the focus of my cursor in the first row of the calculated InstitutionGame column of the DIVISIONBOffense table, the relationship dialog prefills this information into the first table and column name as shown below.

Next, select DIVISIONBSchedules as the related lookup table.  Since we named the calculated column the same, the Create Relationships wizard automatically supplies this field name as the Related Lookup Column. If the related column names do not match, you must use the dropdown to select the related lookup column.

In this case, we defined the direction of the relationship correctly.  However, even if we had started with the DIVISIONBSchedules table and tried to relate it to the DIVISIONBOffense table, the wizard is smart enough to recognize this situation and automatically correct it after popping up an informational warning that you got the relationship backwards.  (What a friendly wizard!)

At this point, you have used DAX to relate two tables using more than one column to define the relationship.  But that is not all DAX can do.  In the coming weeks, we will look at some other ways you can use DAX.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s