Power BI Desktop – Your First Hour

For those of you who have been reading my blog for a while, you know my interest in PowerPivot, Power Chart, and Power View go back several years. (If you don’t, you have a lot of old blog entries to catch up on.) Anyway, I’ve been spending some time recently looking for what the next big thing is for data analysis. Several others have indicated that Power BI might fill that requirement but I spent some time looking around first before agreeing and picking up Power BI to see what it could do. I’m still getting acquainted with it, but I decided that I like it enough to take you along on my journey. So first I need to help you catch up a little.

The first thing you will need to do is to get a copy of Power BI Desktop installed on your local computer. Note that I said local computer, not your server. In fact, I’ve been running it quite successfully on a Surface Pro 3 with Windows 10 and SQL Server 2014. If you do not have SQL Server on your desktop but have it on a server that you can access, that will work just fine. For today though, you will not even have to worry about SQL at all.

To download and install a copy of Power BI Desktop, go to https://powerbi.microsoft.com/desktop.

You can also go directly to the download page at: http://www.microsoft.com/en-us/download/details.aspx?id=45331. You will see the system requirements on this page and you may notice that Windows 10 and SQL Server 2014 are not listed. That was all the more reason to try it and I can tell you that it appears to work just fine.

However you get to the download page, start the download and then install the application on your desktop.

Download and install the desktop. After the desktop is installed, it should start automatically. If it does not, go to your Start menu an entry that says Power BI Desktop. On Windows 10, this appears in a separate section along the left side under Recently Added. It may also create an icon on your desktop and even possible get added to your taskbar across the bottom of the screen in Windows 8 through 10. In some way, start it and you should see the following dialog at appears on top of the desktop application which fills in the background:

Along the left side are options to get data. We will see and explore some of the many ways you can get data from different sources over the coming weeks. You can also return to recent sources that you used. Next is a menu item that says, ‘Open Other Reports’. Desktops with their various charts, tables, and other visualization are referred to as Reports and are stored with a .pbix extension. If you had previous reports that you worked on and saved, they would also appear in this section allowing you to click on them to directly open them.

There are also several videos and tutorials listed in this dialog to help you get started. I strongly recommend that you watch these videos and perhaps read some of the tutorials. In fact, there are many more videos and tutorials for Power BI Desktop. Another good page to help you get started is the Microsoft Power BI Support page at: https://support.powerbi.com/knowledgebase/articles/471664.

Microsoft provides several sample databases and in future weeks, I will probably use the ContosoDW databases that I previously used for many of the PowerPivot blogs. But for today, let’s just try something to amaze your co-workers.

Almost everyone knows the site www.imDB.com. Using the new Edge browser, I went to the site and discovered that there are some new pages available. Most people use it to look for information about movies and the actors in them. The site also has information on television shows. In fact, it has two new pages of interest. One of the new additions is a page that displays the top 250 movies and another page shows the top 250 TV shows. Just because everyone does movies, I went to the top 250 TV shows page just to find out which year had the most highly rated shows. The following image shows a little of that page.

But what I’m really interested in is the URL. I can select the page URL from the Address box at the top of my browser. It should look something like the following:

Next I go back to my Power BI Desktop and click the Get Data icon in the External Data group of the home ribbon. Like many buttons, this one has a top portion and a lower portion. The bottom portion will open a dropdown of common data sources as shown below:

However, I can also click on the top portion of the button to display the following dialog box. This box shows categories of external data sources on the left and the names of specific data source types on the right. Since I want to get data from a web page, I need to specify a Web source. I could check each category just to see what was available and eventually find Web in the Other grouping as shown in the following figure or I could have selected Web from the previous figure. Either way, takes me to the same place. (But it is interesting to see all the possible data sources, isn’t it?)

After selecting Web, I am prompted to enter the URL of the web page that holds my data. This is where I past the URL that I captured previously when I displayed the page of top 250 TV shows.

When I click OK, Power BI analyzes the page and displays objects that might be something I’m interested in.

Of course the names like Table 0 and Tab1 have little meaning. However by selecting any of the objects, Power BI Desktop displays a preview of the contents of the object on the right side. This is how I ‘discovered’ that Table 0 contains the data representing the top 250 TV shows as shown below.

When I click load, Power BI Desktop loads the data into its local data model. In many cases, that model may contain columns that we do not need. It may also contain columns of concatenated data such as in this case in which the show ranking, show name, and show year appear concatenated together. We will need to fix this. Ultimately, I want to get to a table that looks something like this:

However, that will be next time. Have a great Labor Day weekend and get Power BI Desktop installed on your machines because next time we are going to start manipulating the data and creating visualizations.

C’ya.

Your Logical Data Model is not Normal

If you have been reading this blog over the last several years, you know that I have been a strong supporter of using PowerPivot and the Tabular model to perform data analysis, even for power users, not just DBAs. What you may not have realized is that I’ve been secretly teaching you a little about data modeling. What is data modeling? It is really nothing more than the collection of all the data in tables and the relationships between those tables in a database. Did you know that there are at least two major ‘types’ of data models when it comes to how your structure your data into tables? Each one serves a different purpose. Therefore, it can reasonably be argued that neither one is more correct than the other. But you need to understand when to use each type. That is what I want to discuss today.

Most DBAs and developers who work with databases like SQL Server, Oracle, and many common database are comfortable with the relational model for creating tables and defining the relationships that connect them. In fact, they immediately start to normalize a denomalized database in their heads within seconds of seeing the data schema. The relational model relies on the application of the rules of data normalization introduced by Edgar F. Codd in 1970. In essence, the goal of the rules of data normalization is to minimize the data redundancy which also has the effect of decreasing the overall size of the database while at the same time making it easier to maintain information that would otherwise be repeated through many records.

There are three primary rules that are used to determine whether a database, a collection of tables, has been normalized. These rules are:

First Normal Form: No two rows of data may contain repeating data or groups of data. Such repeating data must be split into a separate but connected table. For example, a sales order may contain the purchase of one or more items. Because the number of items included in a sales order is not predefined, it must be split into a separate table with one row for each item on the sales order. These two tables are then typically connected by an order id.

Second Normal Form: This rule only applies to tables which have a compound primary index, an index built from two or more fields. In this rule, all other fields in the table must depend on the entire compound index value, not only a portion of it. A possible example of this might be a table that includes students at a school in which the primary index combined the school name (or id) along with the student name (or id). Imagine that the table also included the address and phone number of the school. This information is not depended on the combination of the school and the student. It only depends on the school. Therefore, this violation of the second normal form requires that the data related only to the school be split into a second table that includes only school information.

Third Normal Form: This rule requires that every field not part of the primary index be depended on the primary index. Going back to my favorite Contoso database, one could argue that in the FactSales table, the Sales Amount field is redundant since this table also includes the sales quantity, sales price, and any sales discounts or returns. Why is it redundant? Because it can be calculated based on the other values of other columns in the record. Therefore, to fully achieve, third normal form, this field should be removed.

While there are other special case rules that can be applied to normalization of a database, most DBAs will be satisfied with a database that satisfies these three rules. They will then build the tables in their database corresponding to these rules and thus create the physical data model. It is called the physical data model because it defines the physical schema of the tables and the relationships between them.

However, business users of the data don’t look at the database that way. In fact most business users would be baffled by the large number of tables required and the relationships between them. In fact, they will not understand why they need to combine data from a half dozen to a dozen tables just to answer a single query question. As an example, let’s look at how a business user might think of just the products within the Contoso database.

The figure below shows the normalized data model for products and their subcategories and categories. As you can see from the schema, the only important piece of information in the subcategories table is the name of the subcategory. Similarly, the only important piece of information in the categories table is the name of the category.

Most business users would not think of category and subcategory names as part of separate tables, but as attributes of the product itself. In fact they would think of product information more like the following figure.

Imagine a database model with dozens of normalized tables and then try to envision how the typical business user sees the data with perhaps only a half dozen tables after denormalization.

In PowerPivot, we can address this issue in one of two ways. The first way would be to add the category and subcategory names to the product table in SQL Server before loading the data into PowerPivot. This would essentially make the physical data model the same as the logical data model used by the end-users. However, this data model would no longer be normalized.

Physically denomalizing the data tables is not the only solution. I could, as shown before in https://sharepointmike.wordpress.com/2012/07/21/power-pivot-hierarchies-part-2-of-powerpivot-2012-series/, build two calculated fields in the product table that use the RELATED() function to get the category and subcategory names and display them in the product table. I could then hide the two tables, dimCategory and dimSubcategory, from the user so that they would see a structure similar to their expected denormalized logical data model even though the data still is physically stored in a normalized structure.

The advantage of making the changes to the data model outside of PowerPivot is that it effectively reduces the amount of data that PowerPivot must store in memory thus potentially allowing a larger model to be created. The advantage of making the changes inside PowerPivot by using the RELATED() function and then hiding the technical tables that are still needed in the model but do not need to be seen by the client is that it preserves the sanity of those DBAs who cringe every time someone tries to their denormalize their data. Both methods will allow me to create similar Pivot tables in my Excel spreadsheet. Thus both methods can be considered correct.

The thing to remember is this. Data Normalization was developed to make data storage and maintenance easier and more efficient. However, data analysis and reporting often requires a fair amount of data denormalization. This is especially true when performing data analysis using reports, pivot tables and charts, and even third party data analysis tools. In fact, the logical view of the data schema aids in the performance of most data analysis. Just remember that your logical model is not a normalized model.

C’ya next time.

It’s Only the Role I’m Playing

This week I’m going to return to my favorite sample database, Contoso. If you remember, Contoso consists of sales data for a company that produces several different lines of customer electronics, from computers to phones to TVs. The data spans several years of sales. In past examples, I related the FactSales table (the table containing all of the sales records) to several tables including Channel (DimChannel), Date (DimDate), Product (DimProduct) and Product Sub-Category (DimProductSubCategory). In fact, the data source pre-defined these relations so that when I imported the data into my PowerPivot model in Excel, these relationships appeared by default as shown in the following figure.

Visually, I could switch to the Diagram view of the data model to see these relationships represented by solid lines connecting these five tables as shown below.

However, suppose I have additional information in my FactSales table. The DateKey field that I have used in the past identifies the sale date for each of the sales records. In the real world, there may be additional dates associated with each sale. For example, it is not hard to imagine that each sale would also have a Delivery Date and an Invoice Due Date. Now I might ask, does it make sense to report on sales based on the sale date or does it make more sense to report on sales based on the invoice due date or even the date of payment. Well, you might say, ‘That depends on who is asking the question.’ Very true. My Marketing/Sales manager might want to see sales by the sales date. He or she does not care about when the invoice is paid off. They just need to know if they hit their sales quotas for each month. Therefore, they might need a report that uses the default relationship and looks something like the following:

On the other hand, the Chief Financial Officer is not as concerned about when the sale took place, but when the invoice for the sale is due because only then is the income truly realized for the company. Thus the above report does not meet their needs.

Now suppose that I had a few additional columns, as mentioned earlier, that told me some other dates such as when the items were delivered and/or when the invoice for the sale was due. Let’s further suppose that these columns exist in the FactSales table and might appear like the following figure.

With these columns in my FactSales table, I would want to create relations between them and my date table (DimDate) so that I could generate reports using them. The following figure shows the Create Relationship screen in which I define a new relationship between the delivery date (DeliveryDate) column in FactSales and the date (DateKey) column in DimDate. When I click the Create button, the Power Pivot engine creates a second relationship between these two tables. (Remember the first relationship was between the sales date (Datekey) column in FactSales and the (Datekey) column in DimDate.

In a similar fashion, I create a third relation between these two tables to connect the invoice due date (InvDue) column in FactSales and the date (Datekey) column in DimDate. For the purposes of this demo, I am going to stop there. However, I could create additional relationships between any other date fields in the FactSales table and the date (DateKey) column in DimDate. Switching to the Diagram view, I could now see something like the following between these two tables.

Notice that there are three lines between FactSales and DimDate. One of these lines is solid and the other two lines are dashed. You can only have one active relationship between any two tables. That active relationship is represented by the solid line. The other two relations are inactive at the moment. These three relationships are collectively called role playing relations and DimDate is a role playing dimension because only one relation at a time can be active from the DimDate dimension. Think of it this way, the date column (DateKey) in DimDate can play one of three different roles. It can either play the part of the sales date, the delivery date, or the Invoice date. However, it can only play one role at a time. Unless all three dates were exactly the same, I have to choose which role I want DateKey in DimDate to play with the FactSales table by selecting one of the relationships.

The initial data import associate the DateKey column in DimDate to the DateKey value in FactSales which represented the date the item was sold. As long as the report requested wants to allocate sales to the sales date, I need to do nothing other than generate the Pivot Table report as shown below.

However, if I now have to generate a report for my CFO, I would have to go into the model and change the role that DateKey in DimDate plays. I might try to simply right click on the dashed line representing the connection between the DateKey in DimDate with the InvDue column in FactSales and select: Mark as Active.

This action would generate the following error message.

The problem is clearly stated. You can only have a single active relationship between the two referenced tables. Therefore, I must first right click on the current active relationship to deactivate it before activating a different relationship between the two tables.

Once the original relationship is deactivated, I can go back and activate the new relationship between DateKey and InvDue.

Knowing that only one role (relationship) can be active at a time, I am a little surprised that Microsoft did not automatically deactivate the original relationship when a new one is set active. However, like many things, they did not ask me. Plus they may have had other reasons for not automatically deactivating the current relationship. In any case, after making this change and returning to my pivot table (without making any changes to the pivot table itself), the data automatically updates using the new relationship to show the sum of sales by month of the invoice due date rather than the sales date.

If you examine the sales totals for either the years or individual months, you can quickly see that this role change for the DimDate dimension makes a significant change in the sales numbers reported each month.

That’s all for this week. I hope that now you have a better idea how you can use a single dimension to play different roles. (You could also have added the DimDate table three times to the model, once for each date column in FactSales so that each instance of the DimDate table could have a single relationship to FactSales. Then by changing which dimension you select to display in the Pivot Table, you can achieve essentially the same result. This may be easier for the end-user who does not have access to the data model to make the changes listed above. However, your power-users may prefer a simpler model with fewer tables and can change the role played by those dimensions as needed.)

C’ya next time.

Show Report Filter Pages

One of the interesting things to see when working with Pivot tables is to look at the differences between regular pivot tables that Excel has been able to create since about 1997 with Excel 97 (actually Excel 5 in1993 had basic pivot table functionality, but no wizard yet) and Power Pivot tables which has been a relatively recent addition (Excel 2010 with an add-in). Standard pivot table functionality still exists in Excel 2013 and serves as an alternatively tool for simple pivot tables that only require a single data source and less than a million rows of data.

In regular pivot tables, you might want to create a pivot table and use one of the dimensions as a filter. In the following example, I use a version of the FactSales table ripped from Contoso and I load it directly into Excel as one of the worksheets. Then I click on the Pivot Table command from the Insert ribbon to create a basic pivot table.

I added the Sales Amount field to my Values area. Being a numeric value, Sales Amount makes a good choice as a measure for a pivot table. I then added the dimensions for store and product as my horizontal and vertical dimensions generating a reasonable pivot table. However, I also added channel as the filter.

By default, after adding a filter, the pivot table still displays all values for the filter in the table. However, by using the dropdown, I can view the pivot table with one or more of the filter values at a time.

With only four possible values, it would not seem to be a big deal to view each of the channels one at a time. However, Excel provides another way to display the pivot table showing the pivot table on a series of pages with each page representing a different filter value. To do this, I can go to the PivotTable Tools menu group and select the Analyze ribbon. Then select Show Report Filter Pages… from the Options dropdown menu as shown in the image below.

When I select this option, Excel displays a dialog that let me pick the filter I want to expand. In this case, I only have a single filter on the Channel field so I select that filter and click the OK button.

Excel then generates a separate worksheet for each of the filter values and labels the worksheet tab with the filter value so you can easily click on a table to view the pivot table with the selected filter value applied.

This option makes it easier to go back and forth between views of the pivot table with different filter values applied. If you give a copy of the workbook to someone who might not be as familiar with pivot tables as you are, you can expand out the pivot table by the filter values and then lock the workbook so they cannot accidentally make changes to it. It also makes it easier to simply print a series of the pages to your printer if hardcopy is necessary.

So using the same data, I attempted to add the original data to the Power Pivot Data Model by using the Add to Data Model option in the PowerPivot ribbon

I then defined a Pivot Table from the data model using the same fields as before. When I then opened the Options dropdown menu from the Analyze ribbon of the PivotTable Tools group as I did before, I was surprised by the fact that the option to Show Report Filter Pages was greyed out as you can see below.

Apparently, this option is not available when displaying the data through the Power Pivot model, but is available from the basic Pivot table model. I’m sure there is a way to brute force some code to expand each filter value into a table in a new worksheet and rename that worksheet appropriately, but that kind of defeats the purpose of creating a “BI Tool for the Masses” which is the goal of most of my Pivot Table blogs.

C’ya next time.

The Role of Role Playing Dimensions

Perhaps you have heard the term Role Playing Dimension in regards to PowerPivot and/or Analysis Services cubes. This terms refers to the ability of one dimension to be linked to more than one fact in the fact table. Now strictly speaking, there is still only a single primary link between the dimension and the fact table and that link is used as the default link when aggregating data by that dimension. However, some dimensions, such as date dimension, often can be used against multiple fields in the fact table. For my example, I will use the Adventure Works DW data set that I use in many of my SQL Saturday presentations since it represents a typical sales database. Another good sample database is the Contoso dataset. Using the reseller sales table as my fact table (FactResellerSales), I can proceed to pull into my Excel PowerPivot table this fact table along with the dimensions for Product, Product Category, Product SubCategory, and Date. Notice that there is at least one relationship between each table and one other. This primary relationship is represented by a solid line in the figure below. However, there are three relationships between the FactResellerSales table and the DimDate table. One of the relationships is represented by a solid line and the other two are represented by dotted lines. Back in the source database, these three relationships are represented by regular indexes between these two tables. There is no indication to define which relationship is more important than the other. So how does PowerPivot select which one is the primary relationship between the two tables? The best I can figure out is that it is based on the order of the three different date keys in the fact table. In this case the OrderDateKey appears first and is therefore selected as the primary or Active relationship.

Looking at the raw sales data for a moment, I can see that the dates in the OrderDateKey, DueDateKey, and ShipDateKey are different with the order date occurring first followed by the due date and then the ship date. This would make sense in the real world. Unfortunately, if I did not have the ability to use role playing dimensions for each of these dates, I would either have to deal with only a single relationship between the date dimension table and one of these dates such as the order date or have multiple date dimension tables, one for each relationship. If I only used a single date dimension table, I would have to pick one date in my sales table to relate that dimension. Then any sum I calculated such as the sum of the order amounts, the sum of the ship amounts, or the sum of the due amounts would be associated with a single date such as the order date even though the due date may not occur until the next month or even next year. Unfortunately, this sample data does not show many such cases. However, I will show you one case at the end of the blog.

Our pivot table would then look something like this:

However, that would not be correct since we did not ship the items on the same day as the order nor was the amount due on the same day as the order. Therefore, we need another way to relate these two table when performing a summing calculation. That other way includes the use of the USERRELATIONSHIP() function which lets us define for the purposes of a single calculation which relationship we want to use. For example, to calculate the sum of the amounts ordered, we can sum the Sales Amount field by the OrderDateKey field as shown in the following equation:

When placed in our measure area of the FactResellerSales table, I can see that this calculation returns a value of over 80 million. (I will format as currency this in a moment.)

Similarly, the Shipped Amount can be calculated by summing the Sales Amount using the ShipDateKey as shown in this equation:

Finally the Due Amount can be calculated with the following equation using the DueDateKey field.

After formatting the measures as currency (which saves me time by not having to format this data in the resulting pivot tables separately), the measures appear as this:

Now I can proceed to build the pivot table by clicking on the PivotTable button in the Home ribbon and then selecting

PivotTable from the dropdown.

Excel prompts me to create the PivotTable in either a new worksheet or an existing worksheet. I will select a new worksheet here.

After clicking OK, Excel creates my empty PivotTable and opens the Field List to let me begin defining my PivotTable.

I choose a very simple layout with my calculated measures going across the top of the table and a hierarchy of time coming down the rows of the table.

By default when displaying the names of the months, Excel displays them alphabetically. Typically this will not please your manager. Therefore using a technique I discussed in an earlier blog, I define a column that contains the month number as the column by which I want to sort the column names.

Keep in mind that you only have to do this once. All subsequent PivotTables and PivotCharts will use the sort order assigned to the column EnglishMonthName without my having to do anything else.

Now the months appear in the correct order.

But more importantly for this demonstration, if I drill down into any of the months, I will see the sum of the order amounts, shipped amounts, and due amounts correctly summed by actual dates. Some positions will naturally be empty if there was no order, ship, or due activity for that day.

As promised, the following figure shows an example where orders were taken on two different days within the month but shipping and due dates only occurred once. The point being that the individual sums are correct for the days in which they appear.

Looking at the orders in April of the above figure, you may question the math saying that $2,204,542 plus $82 is $2,204,624, not $2,204,623. This is the result of rounding each individual amounts to whole dollars and is not an indication that Excel cannot perform math. All sums are calculated on the actual dollar amounts and then rounded.

So this is one simple example of how to use role playing dimensions. Dimensions other than dates can serve as role playing dimensions, but most people will encounter the need for role players when working with dates.

C’ya next time.

Truth in Advertising

I stopped at a local gas station the other day.  It is one of those that print coupons on the back side of the gasoline receipt.  I’m not going to tell you who it was, but I will say that it is a major chain that has been pushing hard in the Orlando area for the past year or more.  Anyway, after filling up my tank with gasoline, I finished the transaction and asked for a receipt.  On the back of the receipt they have pre-printed coupons usually for food items in the store.  Sometimes, I will immediately go into the store to use one or more of the coupons so of course I turned over the receipt to look at what I could get.  On the back were three items that I would consider getting since it was early evening and I was looking for a snack anyway.  But when I looked at the coupons, I saw that the expiration date had already passed.

Huh?

I just got the gasoline and printed the receipt.  How could the coupons have already expired?  Sure enough, the expiration date was December 31, 2013 and the current date was January 17, 2014, both clearly printed on either side of the receipt.  Of course this had to be a mistake so I went into the store anyway to see if they would still honor the coupons since they had clearly just be printed.  However, when I finally got the clerks attention away from the phone game she was playing, ‘Candy Crush’ I believe is the name, she said that she was not allowed to take expired coupons and had indeed been reprimanded previously for taking a coupon that had expired.

I tried in vain to explain that I had just purchased gasoline and this receipt was just spit out from the pump.  She said that it does not matter.  Rules are rules.  Ok, I get that, but why was the paper in the pump filled with expired coupons?  What does that say about how they really feel about their customers.  Afterall, this was more than two weeks after the expiration date.  I do kind of get that they did not want to throw out perfectly good receipt paper (except for the fact that the coupons had all expired).  But would they ever catch up to the point where the coupons were not only still valid, but had a few days before they would no long be good?

Sometimes it is the little things like this that cause people to like or dislike stores or vendors.  It is not just the big mistakes, but the lack of attention to details that make customers feel unwelcome and unappreciated.  After all, if it wasn’t these coupons, it would have been other coupons.  Isn’t the purpose of the coupons to get customers to first buy their gasoline at their store and then hopefully come into the store to buy other things besides the items on the coupons?  Good marketing and good customer service might say that it would be best to just honor the coupons regardless of the date and either replace the paper used by the pumps with new paper or just continue to honor the old coupons until the paper was used.

This weekend I celebrated a birthday.  Maybe I should not say celebrated as much as I observed the passing of another year.  I realize that I have not posted as often as usual because this past year has been quite a roller coaster ride and will try to get back on track with regular blog entries over the next couple of weeks, especially now that I have an external keypad for my iPad.

However, I would like to remind all of my local readers that the Tampa SQL Saturday is coming up later this month, February 22nd to be exact.  I hope to see at least some of you there.  The tracks should cover most areas of SQL Server.  However, it should be no surprise that I will be involved in the Business Intelligence tracks again.  For more information, go to: http://www.sqlsaturday.com and check for the Tampa event in the upcoming events list on the home page.  Hope to see you there.

C’ya next time.

What Have I Been Up To?

Last week I was at the PASS Summit in Charlotte, North Carolina.  It was my second time attending the summit and also my second time speaking at the summit.  I think they said that this was something like the 15th year for the summit.  There is nothing like getting together with a couple thousand other professionals who have professional and/or personal interests in the same things you do. It is an opportunity to talk with collegues that you may only communicate with through email or twitter throughout the rest of the year.  It is also a chance to reconnect with existing friends and to make new friends who share the same interests as you.

This year, the number of sessions dedicated to business intelligence and the Microsoft BI stack has grown.  Part of that is that the BI stack itself has expanded with everything from the tabular model in SSAS to augment the multidimensional model to Data Quality Services (DQS), Master Data Services (MDS), Power Pivot, Power View, Power Map, Power Query and Power Q&A.  (If you would like to try some of the latest members of the BI stack, go to: http://office.microsoft.com/en-us/excel/power-bi-FX104080667.aspx ).  Of course, PASS also supports a separate BI conference that is scheduled for the spring.  However, the interest in business intelligence was especially interesting to me because of my personal interest in the entire stack of tools since the introduction of PowerPivot several years ago.

Personally, I’m looking forward to one more SQL Saturday this year, the Tampa SQL Saturday, BI Edition (http://www.sqlsaturday.com/248/eventhome.aspx ) which happens on November 9th.  I will also be speaking there so if any of you have an interest in BI topics, come out to Tampa in a little over 2 weeks and say, ‘Hi!’.  The BI edition of SQL Saturday was founded in Tampa several years ago by Jose Chinchilla and I’m glad to have been a part of each subsequent annual event.

Other events that are coming up are the monthly meetings of both Orlando SQL groups.  The south group, MagicPASS (http://magicpass.sqlpass.org/ although I don’t think the web site has been updated), will be meeting in Celebration, FL on Wednesday, October 23rd and the north group, OrlandoPASS (http://orlando.sqlpass.org/ ), will be meeting in Sanford, FL on Thursday, October 24th.  If you are more into SharePoint, there is a meeting of the Tampa SharePoint group on Tuesday, October 22nd at 6:00 PM to 8:30 PM at the Microsoft Office, Tampa (https://www.eventbrite.com/event/8853808981/).  It is a busy week as you can see.

The other news that you may have already heard is that the CTP2 of SQL Server 2014 was released last week during the PASS Summit.  I have not looked into 2014 yet so I don’t have a recommendation one way or the other.  However, if you have processes that would benefit from improved performance, the new Hekaton engine in SQL 2014 may be able to provide some performance improvement with very little effort (read that as no program changes).  During the Summit, the Thursday keynote was presented by David DeWitt who tried to explain the technology behind how Hekaton achieves its performance gains.  Most people left the keynote with their heads hurting trying to understand the magic behind the technology.  However, it seems like the magic might be summarized by the combination of the elimination of latches and locks thus reducing the amount of contention when accessing data along with efficiencies with the new column store method allowing more data to be read into memory reducing disk access for many operations.  Of course that may be too simple and probably incomplete.  Afterall, my head started to hurt as well.

Well, that’s it for this week.  I guess it was more of a summary of what I’ve been up to and why I did not have much time to post any blogs for the last two weeks.  I’m sure my text here does not even begin to do justice to the excitement of the summit,  so perhaps it would be better if I just left you to go out to the summit site and watch some of the interviews and keynote talks here: http://www.sqlpass.org/summit/2013/PASStv.aspx.

C’ya next time.

A Short Guide To Installing MDS 2012

Before I can talk about using MDS (Master Data Services), I feel a small responsibility to help you get MDS up and running. Although MDS has been available since SQL Server 2008 R2, the method of installation has changed from the use of a separate installer file (.msi) to be baked into the installation wizard of SQL Server itself. . Fortunately, that change has been to make it easier. However, I will focus on the SQL Server 2012 installation here.

If you did not install all Shared Features, you may want to reinsert your SQL installation disk and step through the screens until you reach the Feature Selection screen as shown in the following figure. If the Master Data Services option is already selected, that means that you previously installed MDS as part of your current SQL Server installation. If it is not selected, you must select this Shared Feature and then proceed through the rest of the installation screens to add the necessary bits to your server.

After MDS has been installed, you must configure it. To do this, open the Master Data Services Configure Manager. You can find this application in the Microsoft SQL Server 2010 folder under the Master Data Services folder. Note, before you can run MDS Configuration, a check for prerequisites is made on the machine. MDS requires Windows PowerShell as well as Internet Information Services. In fact, if IIS is not installed, selecting MDS during the SQL install may not install MDS correctly and you will be prompted to reinstall it.

Assuming that your machine has all of the necessary prerequisites, you can proceed to the next step. The configuration manager consists of two pages of properties. The first page is the Database Configuration. Just like DQS (Data Quality Services) you must select or create a MDS database that MDS can use for its working area. To do this, you must define the database instance and the specific database to be used for Master Data Services. For now, we can leave the other System Settings at their default values.

The second page of the configuration manager prompts you to create the web site to be used with MDS (which is why you need IIS on the machine you are running MDS. You must also identify the database created on the first page of the configuration manager. Finally, you choose whether you want to integration your MDS installation with Data Quality Services (DQS). This feature only exists in SQL Server 2012 because DQS is new to SQL Server 2012.

At this point, you should be able to start up MDS. The main screen shown below adds a feature in 2012 that did not exist in the 2008 R2 version, the ability to use an Add-in for Microsoft Excel to help you load and administer your data.

WOW, it sure seems like Microsoft Excel is getting used throughout more and more of the Microsoft BI stack. First PowerPivot and Power View, then the tabular model in SSAS and now MDS. If you never took the time to learn Excel because you thought you were a SQL guru and did not need Excel, perhaps now is the time to brush up on your Excel knowledge.

Well, I going to stop here for this week so you can get MDS set up on your machine. Besides, it is Labor Day weekend and most of my United States readers probably have better things to do this weekend.

C’ya next time.

MDS: The New DQS? Not Really!

First, you need to know that the MDS I’m talking about here is Master Data Services introduced in SQL Server 2008 R2. I’ve already talked about Data Quality Services (DQS) which I believe was only introduced in SQL Server 2012.  So the natural question is what is MDS and why do I need it together with the newer DQS or is DQS just a replacement for MDS and can we forget about learning MDS now that we have DQS?

 

Previously I talked about DQS and showed how you can use it to create individual knowledgebase domains that can then be used to clean data.  The application of DQS on a regular based to clean data going into a production database should improve your overall efficiency compared to manually cleaning the same data.  This is because DQS remember the correction you made previously.  You can also define rules for ‘fixing’ individual domains.  Efficiency improves over time because the knowledge base understands how to handle an increasing number of problems without requiring manual intervention.  However, at the same time, users have to be realistic.  There will probably never come a time when absolute all the possible data issues have been addressed allowing for fully automatic data cleansing.

 

You could think of DQS as the detergent that helps you clean bad data from a data source getting rid of invalid and misspelled data.  It can also help you reduce duplicate records within that individual data source.  It can even output a data file of cleansed information with duplicate records removed or at least marked. However DQS provides no assistance in merging the clean output data file with another data source.  The ability to combine a freshly cleaned data source with a master data file is what Master Data Services can provide.  To some extent, both products have the ability to cleanse and match data from an input data source.  However, the rules in DQS are domain (column) centric while the rules in MDS are typically more record centric.   Also MDS as the name implies helps you import data from new sources into a common master table.  Another way of saying this is that MDS might be the gatekeeping of your enterprise data system that controls what data can or cannot be added to your data warehouse. 

 

Because both DQS and MDS provide the ability to define cleansing rules, users may jump to the conclusion that these two products are mutually exclusive.  However, the purpose of the cleansing rules between these two products needs to be clarified.  The purpose of the cleansing rules in DQS is primarily to correct and validate values added to each domain.  DQS rules can determine which values are allowed or not allo0wed in the domain.  These rules are not record based, but column based.

 

On the other hand, the business rules created in MDS primarily exist to prevent invalid records from being added to a master data table.   In other words, it keeps new records from entering or updating the master set of records which may sit in your enterprise data warehouse. 

 

Based on this description, you should see that DQS and MDS are clearly different products.  Furthermore, you may not always need to use both of them.  If your purpose is merely to clean bad data from a data entry system and possibly remove duplicate records, you may be perfectly satisfied with using only DQS.  Similarly, if your task is to merge clean data from multiple data sources such as sales data from different countries, you may be more interested in the abilities of MDS to combine records based on business rules that prevent record duplication while allowing data to be merged.  Of course if the data sources that you want to merge have not first been cleaned, you may need to first run them through a DQS project with a corresponding knowledgebase to ‘fix’ any domain issues such as different product codes for the same product in different countries or perhaps even differences in the way customer address information is recorded.

 

In future weeks, I will take you an exploration of MDS in between some other technical topics that I want to cover.  Before I end today, I want to remind all my readers of two events coming up very soon where I can meet you.

The first is the SQL Saturday Orlando event coming up on September 14th.  SQL Saturday events started right here in Orlando back in November 2007 by Andy Warren.  I’ve been fortunate to have been a speaker at every Orlando event.  This year I will be presenting an introduction to DQS that I call, ‘Putting Quality Back Into Your Data’. 

 

The second big event coming up is the 2013 PASS Summit that this year is being held in Charlotte, North Carolina during the week of October 15-18.  This will be my second year speaking at the PASS Summit.  If you have never been to a PASS Summit, you are missing the opportunity to meet, talk with and listen to presentations from SQL professionals from around the world.  I feel very honored to be able to give two presentations this year, one on DQS and the other on using SSAS’s Tabular Model to build Cubes that you can manipulate as Excel PowerPivot models.

If you have the opportunity to get to either of these events, please stop by to say Hello.  I would love to meet you all.

 

C’ya next time.

 

Create a Matching Policy – 2

Last week I left you with the thought that an exact match can be either a prerequisite or just one of several domains used to match records.  The purpose of using an exact match as a prerequisite is that the domain is so crucial to determining whether two or more records match that if the value of this domain is different between two records, there is no possibility for the records to match.  Therefore, there is no need to check any of the other domains when this prerequisite domain is different.  This may seem counter-intuitive at first, but we must remember that we are not looking for unique records here, but matching records and eliminating records by using a prerequisite will speed up the matching process.

Matching07

While you can have more than one prerequisite domain, not every domain that you are trying to match on should be a prerequisite field.  In fact, DQS requires that you have at least one domain which is not a prerequisite.  Fortunately, fields like business name and business address should not be prerequisites because there is a high probability that these could be misspelled or formatted differently.  For example, Adani & Baez Computers Inc. is probably the same business as Adani and Baez Computers Inc. and the same as Adani & Baez Inc. even though they are not exact matches.  If we used the Business Name field as a prerequisite exact match, these three names would not be considered a match.  A similar argument could be made for Business Address and even Business City.

Therefore, we would add these additional domain elements in our matching rule.  Each domain element other than any prerequisite domains must have a weight factor.  The weight factor determines how important matching domain values are in determining whether two records are a match.  Initially, you may have no real preference for how to weight the additional domain values.  A good place to start however is to make all the domain weights equal.  Therefore, if you have five additional domains, make each one 20%.  Note the total of the weighting factors must add to 100%.

Matching08

After evaluating all the domains for each set of records, DQS calculates a number between 0 and 100% to represent the chance that the two records are a match.  Of course, a score of 100 would indicate an exact match across all the domains used in the rule.  Of course, most pairs of records will probably have a value less than 100.  But how much less than 100 still indicates a match?  DQS uses a Minimum Matching Score to filter the potential matches.  For reasons beyond the scope of this blog, DQS places a minimum score limit of 80% on matching records.  While I can change the Minimum Matching Score to any value between 80 and 100, I cannot set it any lower than 80.  This value can be found below the Rule Name and Description as seen in the following figure.

Another option you must set is whether a record can appear in only a single matching set of records or whether it can belong to more than one set of matching records.  The first option is called non overlapping clusters and the second option is called overlapping clusters.  With non overlapping clusters, each record appears in only one cluster or is considered a unique record.  This is the option I would usually select if my goal was to eliminate matching records.  However, using overlapping clusters might be a good way to analyze for other relationships between records.  For my analysis here, I will use non overlapping records.

Matching10

Next, I click the Start button shown in the previous figure to begin the analysis.  This process can take a substantial amount of time if I am working with a large database table.  Therefore I recommend creating a smaller representative table to test out my matching rule definition before running the final rule against my full table.  After the matching process completes, a grid appears at the bottom of the screen.  In this grid, which is filtered on the matching records, you can see how DQS has grouped similar records based on the matching rule.  Each matching groups is called a cluster and is given a cluster id that has the same value as the first record id added to the cluster.

Matching12

Clicking on the Profiler tab provides some interesting results.  In the figure below, I can see that my test table had nearly 8000 records.  In the grid to the right, I can see the number of new and unique values for each of my domains.  New values are domain values that I may want to go back to add into the knowledge databases for the domains if they truly are unique and not just bad data values.  The Unique column tells me how many of the 8000 records had unique domain values.  I can see that there are only 5129 unique business names.  That means that I should have nearly 3000 potential duplicate records based only on the Business Name.  Looking at Business Address, there are 6683 unique addresses.  This tells me that there are at least some business records with the same business name, but different addresses.  This is entirely possible if the business has more than one location and may not be an error.  On the other hand, of 8000 records, there are only 224 unique city values, 32 unique states, and 465 unique zip codes.

Matching13

The low rate of unique values for these three fields tells me that their contribution in determining matching records is low.  At the same time, it makes each of these domains candidates for an exact match or even a prerequisite match.  In fact, that is why I selected Business States to be an exact prerequisite match.

If I previously cleaned my data elements to remove inconsistent abbreviations and spelling problems, I would want to look at the fields with the most unique values to be potentially more important when determining matching records.  However, just because a domain has a high uniqueness value, does not make it a good matching value.  Take Business Phone as an example of a poor domain to determine matching records.  Because a business could have more than one phone number at a given address, different phone numbers do not necessarily mean that the records are not a match.  Even a different address may not be as important as having the same business name.  Therefore, I use the uniqueness values as a guide, not an absolute to determine how I might want to reassign weights to each domain in the rule instead of weighing them all equally.

Another thing to consider is to examine each cluster and determine why some of the matching scores are less than 100%.  In the figure below, I have highlighted a record with only a 96% match score.  To find out why this score is not 100% I can right click on the row to display the details.

Matching14

The Matching Score Details dialog shows the two records that are being compared at the top with the overall matching score.  In the second grid, I see a grid of the other domain with their contributing score.  In the figure below, I can see that phone number score was only 79%.  However, looking at the two phone number I can see that difference was most likely a key entry mistake.  In any case, the 79% score on this domain is only worth 20% of the total matching score.  Since the other domains are exact matches, I have:

0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 100 + 0.2 * 79 = 96

But perhaps more importantly, this tells me that the phone number domain may not have been cleaned.  If I do not have a reliable way to do this without spending a lot of time or money, I may simply want to deemphasize the importance of the Business Phone domain and rerun the matching rule.

Matching15

Finally when I am done, I can proceed to the next page of the defining the matching policy where I will run the rule one last time.

Matching18

Clicking Finish will allow me to publish the matching rule back to the Knowledge Base as shown in the following dialog.

Matching19

After the publishing process completes, the final dialog informs me of its success as shown below.

Matching20

I now have a matching policy saved with the business address domains that I can use in a project with the full database table.

C’ya next time.