Appending Data Sources Manually in PowerBI

Last week I looked at how I can load into PowerBI an entire library of source data files and let PowerBI determine how to append the resulting files together. This method works great if all the data source files are of the same time and have the same schema/structure. But what if some of the files come from an Access data source, some from CSV files, some from Excel, etc. Furthermore, what if the data column order is different or perhaps if the column headers are different. These are cases in which you may need to import each data source individually. Then edit the query for each source to align the column, column names, data types, etc. before combining the data sets together. Let’s see how that looks.

For simplicity sake, I am going to start from the same folder as last week that holds 10 CSV files each representing a different sales region.

Although I am not going to show images from each data load, I would proceed to load each CSV file into my PowerBI data model one file at a time rather than loading the folder like last time. This method creates a separate table within PowerBI for each data set as shown in the following figure.

I can view the data in each table by selecting the table name in the Fields panel as shown above. The first thing I would want to do is to ‘fix’ any inconsistencies in the data schema in these tables by selecting the table and then clicking on the Edit Queries button in the External Data group of the Data Tools Modeling ribbon. I basically want to build a common table structure across all of the data sources including a consistent column name for each column that appears in multiple tables.

You should note that if I have a data source with a unique column that does not appear in the other tables, I can keep it. Later when I append the individual tables into a single table, the unique column will be brought into the final data model, but the field will be blank for all the other tables that do not have that column.

In my first example here, all of my data schemas from each data source are the same but they would not have to be. So after simply loading the data from each data source, I am ready to start combining the data into a single table for analysis. To do this, I click the Edit Queries button in the External Data group of the Home ribbon and then select one table to start the process.

There is a button on the far right of the Home tab while in Edit Query mode in the Combine group called Append Queries. I can use this button to begin the process of combining the tables.

The Append Queries button opens the dialog shown below that lets me select which table to append. I can only append one table at a time so to append together all 10 tables, I need to do this step 9 times.

As shown in the figure below, each appended query gets the generic name of: Appended Query followed by a number (after the first one which has no number).

This default name is not descriptive enough to help me identify which appended query refers to which data set. If I want to remove one of the data sets from the final table, I would have to click the settings button on the far right of the applied step (the gear icon) to reopen the Append dialog to see which table is being appended. Then if I want to remove that dataset, I could click the ‘X’ to the left of the applied step to delete that one step.

However, a better option is to right click on the default applied step name and select Rename from the popup menu that appears.

This option allows me to select the current applied step name and replace it with a more meaningful name.

The following figure shows a much clearer picture of which data set is being referenced in each step. It also shows that I have finished appending my 9 additional data sets.

Note that any steps applied to the individual data sets are still applied first prior to the data being appended to the final dataset.

Note: One thing that I did not do here but probably should have is to begin by making a duplicate copy of the table that I wanted to begin with so that I could preserve the original table with only its own custom transforms. Then using the duplicated table, I could append the rest of the data sets.

When all the data is appended into a single dataset, I can close and apply this transformation so that any data refreshes can repeat these steps.

I can now go to my Report tab and start to build the visualizations that I want. In the final figure for this session shown below, I create a table of Sales Amount by Sales Territory and included the Sales Territory Group, Region, and Country. I then click on the Sales Amount column to sort the table by this column in ascending order. You can see very quickly that most of the sales occur in Australia and the least sales are in the Central Region of the United States.

Beneath the table, I create another table with the Sales Territory Group and the Sales Amount. PowerBI automatically sums the sales for each Territory Group and displays a chart with only three segments. After creating the table, I change the visualization to the Donut chart to create the appearance shown below.

As with all PowerBI visualizations, the data in the table is linked to the data in the donut chart. If I click on the North America segment of the donut, the table on top refreshes to display only the 6 rows representing sales in North America. Similarly, if I click on the Pacific segment of the donut chart, the table above immediately updates and displays only the one line for Australia.

That’s it for this week. Come back next week for more PowerBI fun.

C’ya.

Advertisements

Loading and Combining Multiple CSV Files in Power BI

Suppose my job is to collect sales information for my company and I current receive text files of that data from each of our major sales offices around the world. Before I can do any analysis using Power BI, I will need to both load the data from each sales office and then combine the data into a single file. For today, let me assume that the format of the data from each sales office is exactly the same and the order of the fields is also exactly the same. As each sales office sends a copy of their data to me, I store their CSV file in a common folder called CSV_Sales as shown below, eventually getting data from all ten locations.

I am now ready to open Power BI and load my data to begin my analysis. I start by selecting the Get Data option after opening Power BI. This displays the following dialog which lets me specify the type of data I want to load. In the past, I showed several different ways to access individual files from different sources including CSV files. Indeed, I could again load each of the CSV files separately and then ‘somehow’ combine them into a single table for analysis purposes.

However, I notice an option that I had not selected before, Folder.

When I click the folder option, Power BI prompts me for the URL of the folder. If I am not sure of the path, I can click the Browse button and navigate to the folder and Power BI will figure out the path for me. Either way, I click on the OK button to continue.

Power BI then shows me the contents of the folder. For each file in the folder it provides metadata about the file such as its name, extension, date last accessed, date last modified and more. There is also a column at the far left of the grid named Content which in all cases has the word ‘Binary’ in it. This mysterious field in each row actually represents the data in the file. In fact, it is in most cases the only field that I care about.

If I click the Load button, highlighted the previous figure, Power BI loads the folder information into a table as shown in this figure. This is not what I want.

So instead, I click the Edit button on the previous screen which loads the data directly into the Query editor. (Yes, I could just click the Edit Queries button in the Home ribbon to get to the same place, but why go through two steps when one will do.)

However, as I said previously, I don’t need all these other columns that provide information about the data files. I only care about the data inside the files. Therefore, I select the first column, the Content column and from the menu select the submenu under Remove Columns and click the option to Remove Other Columns. This is a faster way of getting rid of columns I do not want rather than selecting each column and then clicking the Remove Columns option.

Once I have only the Content column, I can focus on the button on the right side of the column header. Notice that it is a little different than the buttons on the right side of the other columns. Instead of just a single arrowhead pointing down, this button has two arrows pointing down to a line. This button means that I want to download the actual data from within the binary files into a separate table. Therefore I want to click on it.

As you can see in the following figure, I now have all of the columns from the sales table. Because Power BI only presents a preview of the data in this mode, it is not clear whether this file just contains data from the first table or whether it contains data from all of the data tables in the selected folder.

Because I am still in Edit Query mode, I need to close and apply my transformations to the folder by clicking the Close & Apply button in the Home ribbon.

Now when I return to the Data page of the Power BI Desktop, I can see my data and in the Fields dialog on the right side, I can open the CSV_Sales table definition to see all the fields in the table. But I still don’t truly know if I have the data from all the CSV files or not.

Next I open the Report page and create a simple table that displays several of the columns from the CSV_Sales table. I select the fields: SalesTerritoryGroup, SalesTerritoryCountry, SalesTerritoryRegion, and SalesAmount. I can quickly see now that the table does indeed include all my regional data from all 10 sales regions.

Just for fun, I can build a second table that includes only the fields SalesTerritoryRegion and SalesAmount and then convert the table into a TreeMap chart by clicking the TreeMap visualization. This visualization shows the contribution by sales territory region to the total sales by creating proportional rectangles within a larger rectangle that represents total sales. If I hover over any of the boxes, a popup displays the name of the sales region along with the sales amount for that region.

But wait a minute, I previously said that there were 10 regions, but I only see 7 colored rectangles representing only seven of the regions. What happened to the other three? Well, they are actually there, but they are so small compared to the total sales that they are nothing more than slivers at this scale. In fact, even as I expanded the size of the chart. It was difficult to see these last three sales regions which even added together represent less than 0.1% of the total sales. However, they do exist along the right edge of the chart as you can see as I zoom into the bottom right corner.

Well, I hope you found that interesting. I’ll look at some more Power BI goodies next time.

C’ya.

Power BI Knows Dates

Back in PowerPivot days, if I wanted to be able to drill down through a report by dates (Year, Quarter, Month, Day), I had to build the hierarchy and typically I had to build custom columns to hold these date parts. I would even have to do some fancy ‘Sort by Column’ changes to get the months of the year to display in the correct order. Power BI takes a lot of that work away and makes it easy to drill down through your data.

Let me start by loading some data from the Adventure Works Data Warehouse database. In the figure below, you can see that I’ve loaded several of the tables from this database, but for today I’m going to focus on just two tables, FactInternetSales and DimDate. The first thing I notice is that these two tables are connected with three relations represented by three lines, one solid and two dotted. This is because DimDate is a role playing dimension since there are multiple dates in the FactInternetSales table that could be associated with the date table to create reports.

By clicking on any of the relationships, I can see which fields are associated with that relationship since the field names will be enclosed in a box. In the figure below, you can see that the active relationship (the solid line) connects the DateKey field in DimDate with the DueDateKey in the FactInternetSales table.

However, I want to use one of the other relationships. To change which relationship is active, I can open the Data Tools Modeling toolbar and select the Manage Relationships button.

You can see that 6 active relationships have been set between the tables and two are currently inactive.

Suppose I want to use the OrderDateKey for my report. It would be nice if I could just click on the relationship that I want and like radio buttons (or option buttons) the previous relationship would be marked inactive and the new relationship would become active. Unfortunately, this just results in an error screen as shown here.

You have to first mark the relationship that you do not want as inactive by clicking on the yellow checkbox to the left of the relationship. Then you can select the relationship you want as shown in the next figure.

Now when I return to my Relationships page, the active connection is now linking the DateKey field in DimDate with the OrderDateKey field in the FactInternetSales table

I did this just to show that I can still work with the tables like I did in PowerPivot, but I really did not need to do any of this. If I open the Reports page and in the right side fields panel select first Sales Amount and then Order Date which is defined as the data type Date/Time, I will get a table that looks something like the following:

If I then switch to a stacked column chart, I will see a chart like the following (after a few formatting changes) that displays the Sales Amount by Year.

If I click on any of the columns, a dialog box appears with some details about that column. However, that is not what I want. I want to be able to drill down on any given year when I click on it.

In a relatively recent enhancement to PowerBI Desktop, several buttons have been added to the header. If I click on the icon third from the right as shown in the next figure, I can toggle the Drill Down feature. Drill down is turned on when the icon is mostly black instead of white.

Now when I click on the third column from the left (the 3rd quarter) the chart automatically updates to display sales by month and it displays the months July, August, and September in the correct order as shown next.

Clicking on the August column of this chart drills down to the next date level which is days in the month as shown here.

At any time, I can move back up the hierarchy by clicking the icon on the far left side of the header which is the Drill Up icon.

In fact, if I click on this icon twice, I can return to the Sales by Quarter chart. Now suppose I want a chart that displays all the months in the year. The button second from the left lets me drill down for all the current column on the current chart to the next level. So if I am displaying quarters, it show me the sales by month.

As you can see in the following chart, all the months in the year now appear. But perhaps more surprising is that the months appear in the correct chronological order, something we would have had to do with a Sort by Column option in a standard PowerPivot chart.

I hope you found that ability to drill down through a date hierarchy automatically without having to build that hierarchy first as interesting as I did. I’m at a SQL Saturday today in Tampa presenting an introduction to Power BI to the attendees. Maybe I’ll see you there.

C’ya next time.

Using Power BI On-line

During the last several months, I’ve taken you through an initial exploration of Power BI Desktop, a program that you can download directly from Microsoft. However, there is another version of Power BI, one that resides in the cloud. Many of its features are similar to Power BI Desktop, but there are also significant differences. Let’s begin the new year by taking a look.

First you have to sign up to use Power BI. Click on the following link to go to the self-service signup page:

https://powerbi.microsoft.com/en-us/documentation/powerbi-service-self-service-signup-for-power-bi/

You should see a link on the page that says: self-service signup process which takes you to the link:

https://powerbi.microsoft.com/en-us/

The screen (shown below) tells you that you can sign up and get started with Power BI for free. That is almost true. There are some notable exceptions. For example, you cannot sign up with an email address provided by consumer email services or telecommunication providers, this includes the fact that you cannot sign up with a Microsoft Hotmail account. Also .gov and .mil addresses are not allowed at this time. I guess that surprised me a bit because government and military intelligence is something that we are in great need of. However, I guess if you want to call it BI, then strictly speaking it is business intelligence only. J

Another reason you may not be able to signup is because your IT department has disabled self-service signup for Power BI even though your organization may have the Office 365 SKU that supports Power BI. Again Microsoft provides instructions for your administrator on how to fix that problem should they so decide to let you use Power BI. Again, I can only assume you must work for an organization that has no interest in business intelligence if they tell you no.

Of course, Microsoft does provide a way for you to personally register for a new Office 365 trial subscription and use that ID to sign up to use Power BI.

You will get an email from PowerBI/Microsoft asking you to confirm your email address.

After you have successfully navigated all the steps to get a Power BI account, you should see an initial power BI screen like the following:

If it is not already there, you can downloaded the financial sample file from: http://go.microsoft.com/fwlink/?linkID=619356. I just don’t remember how I got it originally. To install the data after downloading it, I used the Get Data button at the bottom of the left navigation panel

I can then click the Get button within the Files box to load the CSV file I just downloaded.

This will provide me with a flat file dataset to work with as a sample. The next figure shows what appears to be an entry for both the Database group and the Dashboard group with one entry in each named Financial Sample.

By opening the Dataset Financial Sample, I can choose which fields I want to work with from the Fields list.

I can simply click the check box to the left of the field name or I can click on a field name and drag it over to the working area. For this demo, I will use Sales and Month Name.

While I do have the ability to customize some features such as the formatting of the title text and even to highlight one of the columns as shown below, I quickly realize that starting from a raw data file like this in Power BI, I can not define a custom order for the Month dimension to force the months to appear in chronological order.

Rather than being in chronological order, it appears that the month columns are ordered from the largest to the smallest in sales value. While possible, this is typically not what I would want to display.

I’m going to leave that problem as it is for now and attempt to click on the Dashboard entry for Financial Sample. I am told that I have a report with unsaved changes. When did I work on a report since I never clicked inside the Reports group? Well, it appears that when I define visualizations in the dataset area, I am actually creating a report and must save it if I want to keep it.

When I click Save, Power BI prompts for a report name. For now, I am just going to call it: TestReport.

With the report saved, I now can see the Financial Sample default dashboard.

The first thing I notice at the top of the page is a text box which is asking me to enter a question about the data. This is something new that we have not seen in the Power BI Desktop. The interesting thing is, I can enter a question (or even a statement since the syntax of a question is not as important as the words used in the sentence) that references one or more of the data fields in my dataset. In this case, I can ask a question like: What are the total sales? (Interestingly, I can just type: Total sales and get the same result.) As I type enough of the question for Power BI to guess at what I want, it starts displaying results

I can continue to drill down to get sales by product by just adding the words by product to my question/statement from the above example. Note that Power BI automatically decides to switch from a simple text answer for a single value to a bar chart with the bars representing my product dimension.

I can just as easily request to see total sales by month. Note in this image, the months are displayed in the correct chronological order even though the report I created earlier did not show them in that order by default.

I can even display sales by country as shown below.

By simply changing the visualization, however, I can create a map of sales by country which might be a better way to present geographic data to management.

So I’m going to stop here in this week’s tour of Power BI. I’ve discussed some of the issues you may need to deal with to get access to Power BI and have shown how to load and use a simple flat data file in the form of a CSV file. In future weeks, I will explore more of the features of Power BI and especially how you can use Power BI Desktop or even Excel to create the datasets you need to display in Power BI.

On last point that I will cover in more detail at a later time, but which you might want to be aware of at least for now. One of the advantages of using Power BI to display your data analysis is that you can save your dashboards and make them available to others within your organization as a simple download. Yes, with Power BI Desktop you can save your model and copy the file from your machine to others. However, if you make a change, you must redistribute the changed file each time. Also refreshing the data is a manual process at this time when using Power BI Desktop. We will see in future blogs how these issues are addressed by Power BI to make your life simpler.

C’ya next time I’ll take our Contoso data and show how easy it is to use it from a One Drive file.

The Power of the Power BI Query Editor

The power of the Query Editor in Power BI is more than that it allows you to load data from a variety of sources, transform that data to meet your analysis needs, create new columns of calculated results and define special views of the data. One of the things that I am impressed with is that it will remember all of the steps you put into getting your data ready for use and it can rerun all those steps the next time you reload raw data from your data source. So before we roll into a new year and begin looking at Power BI Desktop’s big brother, PowerBI.com, let’s take a look at the Power BI Desktop Query Editor.

In the past, when I first opened Power BI, I would immediately select the Get Data button from the Home ribbon. What if instead I immediately clicked on the Edit Queries button in the External Data group.

The Query Editor opens its own window with 4 tabs: Home, Transform, Add Column, and View. Looking at the Home ribbon, you should note that many of the most commonly used actions appear on this menu as well as on the other ribbons. At this time, there does not appear to be a way to customize the actions that appear on this and the other ribbon, but as with many things Power BI, that may come in time.

The first think I need to do is to open a data set to work with. When I click on Get Data, I get the same drop down of data source options that I would see from the Home ribbon of the Power BI Desktop screen.

In this case, I am going to load my standard Contoso dataset including the FactSales table with related product and date data. I will reference my local host SQL Server. Of course, you may have to reference a network SQL server. I generally do not enter the database name on the first dialog, but select the database from the second screen. Note that on the second screen, I can just click the down arrow to the left of the database name to open the database and display a list of the tables within it. This is in my mind a lot easier than spelling the name of the database correctly before selecting the tables I want to work with.

Because I am using a SQL Server database, I have the option of either importing the data or using DirectQuery which does not download a copy of the data to my local data model, but rather reads the data directly from the SQL Server instance.

Because I selected 5 tables from the database, the Query Editor list 5 queries along the left side of the screen, a preview of the data in the center and the query properties and any steps applied to the data on the right side of the screen.

I can click on any of the five queries to switch the one that I want to work on. In the above image, I elected the DimDate query. Now suppose that I want to make some changes to this data. Perhaps the first thing I want to do is to eliminate all the fiscal calendar fields from the table. One way to do this is the select the columns I want to remove. I can click on the first column and then while pressing the CTRL key click on each of the other columns I want to remove. I can also click on the first column I want to move and if the columns I want to remove are sequential, I can move to the last column I want to remove while press the SHIFT key and click on the column header of that last column.

Still on the home ribbon of the Query Editor, I can click on the Remove Columns button to get rid of the columns I do not want to keep in my final model. I notice that as I do this, another step appears in the Applied Steps section in the right column.

Next I want to add a new column to the date table that I will use to sort the months chronologically rather than alphabetically. That column will simply consist of the month number (January =1, February = 2, etc.). I can do this by clicking on the Add Column tab and selecting the Add Custom Column from the General group. Note the option Add Index Column. This option allows you to add a sequential number in a new column for each row that either begins with a ‘0’ or a ‘1’. You can also add a custom index that begins with any number and has a custom increment which may be different from ‘1’. This could be an interesting way to add a surrogate key to a table when you have to merge data from two difference sources (maybe domestic and international) into a single table for reporting purposes and you still need create a ‘new’ index because the domestic index and the international index may have overlapping values.

When I choose to add a custom column, the Query Editor displays a dialog to let me define the name for column as well as the formula. All of that is well and good, but the Query Editor provides an easier solution. Instead of clicking the Add Custom column, note that the Add Column ribbon has sections for working with text, numbers and dates. In this case, all I need do is to select the column [DateKey] and then click on the lower portion of the Date button to display the menu of options. If I select Month, I get a second set of options of different calculations related to month. The first option actually creates the new column and calculates the month number while the other options return the first day of the month and the last day of the month as date or the number of days in the month as an integer. That is pretty cool too.

Now the default name for the column is just Month. However, I can easily change that by right clicking on the column header and selecting Rename from the dropdown menu and entering the column name I want.

As I said before, each of these steps have been recorded in the Applied Steps section to the right of the screen. If I make a mistake or want to try something different, all I need do is click the ‘X‘ to the left of the step name to remove that step and then try something else. Remember that I am only working with a preview of the data at this point. I have not downloaded the data yet.

I could proceed with other data changes for this query as well as the other four queries, but I hope you get the idea that you can perform any transformation, create new columns, etc. that you may want. I will come back in the future to discuss some of the other features of the Query Editor that could save you time.

Let’s assume that I am done making changes to my data. It is now time to load the data from SQL Server into my model. From the Home ribbon of the Query Editor, I click the button: Close & Apply.

This will close the Query Editor window and begin loading the data from SQL Server into my local in-memory data model using the transformations in the applied steps of each query. This is the secret to how to minimize the data you bring into your model from a data source. Any excess columns will be removed. You can even get rid of specific rows (like we did in PowerPivot to eliminate some of the product categories when we loaded data many weeks ago). You can also create new columns, rename new or existing columns, and basic structure the data they way you want to use it. All of these transformations will happen as the data is loaded into your local data model. Then each time you refresh your data, these same query steps are performed consistently with each refresh.

Well, it is the beginning of a new year, at least a new calendar year. I think it is time that we move on to the big brother of PowerBI, PowerBI.com next time so that we can show how and when each tool should be used and how they can work together.

C’ya next time.

Power BI – Measures and Slicers

Sorry about missing last week but it was a holiday here in the states where parents encourage their kids to go up to strangers and ask for candy. Also people of all ages, not just kids, get dressed up in costumes that range from the cute Elsa princesses to the DBA zombies and the slutty managers. Yeah, some of them were really scary especially since it wasn’t always a costume.

Anyway, last time I talked about creating custom columns and added the column TotalProfit to the FactSales table of my model. (If you missed that discussion, go back to my blog from two weeks ago.) Column expressions are easy to create because they largely resemble row context expressions that you might add to an Excel spreadsheet. In fact, most of the syntax and functions are exactly the same as in Excel.

But we can also add custom measures. The difference between a column and a measure is that most columns can be used as dimensions in a pivot or matrix table especially the columns that contain alphanumeric data. Typically the columns that have numeric data appear as aggregated values, summed, averaged, minimized or maximized depending on the goal and the dimensions used. In fact, the TotalProfit column, while calculated for each row in the FactSales table, is typically displayed as a summed value such as in the following image taken from where we left off last time.

In this matrix table, the TotalProfit is first calculated for each row in the FactSales table that will be included in the final matrix table, but the individual calculated values are then summed by one of five channels that exist for sales. Thus with my sample data, the calculation for TotalProfit occurs a little over 3 million times, but these calculations are made only when the column is created, not for each visualization. Then those values are summed by channel for the above visualization. As this happens on my Surface computer in less than a second, that is still pretty impressive. However, there is another way we can calculate total profit for this table.

We can create a measure to calculate total profit. A measure is calculated each time it is used in a table. In a table like the one above, a total profit measure would only be calculated four times, once for each channel, but each calculation would include three sums over potentially hundreds of thousands of rows and these calculations occur when the visualization is created and in each visualization the data might be needed. To create a measure, find and click on the New Measure button in the calculations group of the Modeling ribbon.

In the expression box, I can see the start of the expression. Note that unlike PowerPivot, the measure name is followed by just the equal sign rather than a colon and equal sign. The rest of the expression is similar to what I used before except that I have to aggregate (in this case sum) each of the values in the expression (you cannot sum an expression). In order to distinguish this expression from the prior one, I include an underscore between ‘total’ and ‘profit’.

After defining the measure, it appears in the field list on the right. While it is still selected, I go to the formatting section of the ribbon and adjust the data format to $ English (United States) which gives me comma separators and two decimal places by default. Note that I can and should do this for any other table field that I will use in a visualization.

Now I can add this measure to my matric table from above replacing the TotalProfit calculated column with the Total_Profit measure. In order to calculate the total profit for channel sales, Power BI has to sum the SalesAmount column for all channel sales and subtract from it the sum of channel total costs and the sum of channel discount amounts.

Your first thought might be to ask how is this better than calculating the total profit for each row in the FactSales table and then simply summing the included rows into a single value. Well like a lot of things in the real world, it depends. Typically a calculated column increases data load times each time you load the model because only the expression is saved, not the individual values (or at least so I’ve been led to believe), but for many visualizations that use most if not all of the data, the impact on calculating the values for the visualization are no worse and possibly even less than using a calculated measure. On the other hand, for a calculated measure that only appears in a few visualizations or when the visualization has been filtered to include a smaller subset of the total data, a calculated measure can improve both data load times and dashboard display times.

But perhaps more importantly, not all expressions can be written as either a calculated column or a measure interchangeably as I have done here. For example, if I wanted to calculate the percent that each channel sales represents from the total sales, I would have to use a measure because there is no way to aggregate percentages over multiple rows. I want you to think about that and I may return with a detailed example in a future week. In the meantime, I want to explore one additional feature of this model.

Those of you who have followed me through my travels in using PowerPivot remember the concept of using slicers to allow the user to analyze different segments of the data by clicking on dimension values. For example, using the above table, I might want to see the profit numbers for different product categories. In PowerPivot, I could create a slicer and if I had more than one visualization on a page, I could associate each chart or table with that slicer.

In PowerBi, I can also define slicers for the reports on the page. To do so, I click in a blank area and select the field from the dimension table that I want to use as the slicer/filter. In this example, I will use the ProductName field from the dimProductCategory table. This dimension only has eight values. The table is relatively short and appears initially as shown below.

To convert this table list into a slicer, I need to select the Slicer visualization as shown in the following figure while the above table has focus. This tells PowerBI to use the table as a slicer rather than just displaying the values of the field. (Yes, I could add a filter to the visualization directly, but a slicer can automatically apply to multiple visualizations on the page.)

After being defined as a slicer, each value of ProductCategoryName has a selection box to its immediate left and two additional entries have been added to the top of the list to select all values or to select only those records that have a blank for the product category.

If I click on any one of the categories, the data in the other visualizations on the same page automatically filters out all records from other categories as shown below.

I can also select multiple categories by clicking on several of the checkboxes to include in the matrix sales from all of the selected categories. Note that in a case like this, the measure I defined actually will perform fewer calculations because the sum function only acts on the filtered records of the slicer, not on all records in the channel.

I can return to displaying all categories either by clicking the Select All option, by clicking on each of the individual category names, or by using the Erase icon to the right of the table name.

C’ya next time.

Deriving Columns from What You Know in Power BI

Before I get started with the topic of the day, I want to remind you that Power BI is still being updated by Microsoft. In fact, there was an update just last week on October 20th that appears to have added some missing features that I mentioned before. So if you haven’t updated Power BI recently, be sure to do that before reading the rest of this blog.

A few weeks ago when I was looking at defining relations between tables in the Relationship dialog I was disappointed in the fact that when I display my tables in what I would have previously called a Dialog view in PowerPivot, I could not use drag and drop to define my relations. Well, my disappointment is over. The latest update, among other things, adds this capability.

Since last Wednesday, October 21, was ‘Back to the Future’ day, I want to go back to last week’s blog just after I added the DimProductCategory table. This time however, rather than using the Management Relationships dialog, I am just going to the Relationships view and drag the field ProductCategoryLabel to ProductCategoryKey as shown in the following image.

Now I must say that I am use to just identifying the fields from my two tables that I want to connect to form the relationship and expect PowerPivot to figure out which is the one and which is the many table. Unfortunately, Power Bi still is not quite this smart. Because when I attempt to define this relationship from the one site to the many side, the relationship definition fails as shown in the following figure.

Maybe the next update will automatically reverse the direction of the relationship for me. However, for now I can click the OK button in the message box shown above and then redefine the relationship correctly from the DimProductSubcategory to the DimProductCategory table.

Now the relation is created and I am ready to continue. By the way, when I click on the relationship, Power BI highlights the two tables as well as the relationship line. It also encloses the connecting fields in a box to make it easy to identify the relationship fields.

If instead of left clicking on the relationship, suppose I right click on the relationship. Now I get an expected option menu which in this case consists of only a single option, Delete. Clicking this option will of course delete the relationship.


On the other hand, double clicking on the relationship opens the Edit Relationship dialog shown below. I can use this dialog to view the relationship or a sampling of the data, or make changes to the relationship.

That’s great. But like an infomercial, there’s more! In the Relationship view (Diagram view) I can also now right click on a field and delete a field, hide a field from the Report view or rename the field.

As with PowerPivot, if I delete a field, it is gone for good. If I realize later that I need the field, I would need to delete the table and reload it to get the missing field. Of course there are other consequences to doing this like needing to redefine the relationships and possibly rebuilding reports (visualizations) that used that table. Because at this time, I cannot visually define the fields to include or exclude during my original data load from my data sources, it is important that I know my data and delete any fields that I know that I will not need before I begin creating new columns, measures, or reports. In PowerPivot, we called these columns Useless columns.

I can also hide some columns from the Report view. For example, I can often hide columns used to define relationships between tables because end users typically do not include these columns in reports. Hidden columns are referred to as Technical columns because they are required in the data model and cannot be deleted without destroying relationships or perhaps calculations of other fields. I never want to show more columns to a user than they know what to do with.

Finally, renaming columns can be very beneficial. Often column names in databases have cryptic or abbreviated names. End users may not be comfortable with these shorten names. Use the Rename feature to make names user-friendly and descriptive.

Not only can I delete, hide, or rename columns in a table, but by right clicking on the table header, I can perform these same actions on an entire table. For example, I may not like dimension tables that begin with the letters ‘Dim’ like many DBAs prefer but end users may have no idea why the table is Dim. Similar to columns, I don’t delete tables from my model unless I am positive that I do not need them. Hiding tables only makes sense if I want to use only one or two fields from a table. In this case, I may need a column from another table for a calculation, but I would never display those columns directly in reports.

Returning to the data view of my tables, I can also right click on any of the column headers and delete, hide or rename the column. There are also several other options ranging from sorting, to creating new columns or new measures.

I can also right click on the column names in the fields list along the right side of the Data view. The dropdown list of options shown below is similar to options in the context menu above. So I have several different ways to manage columns

Let’s try something new, a new column in fact. In my FactSales table, I can find several columns like SalesAmount, TotalCost, and several others, but there is not Total Profit column. I can easily calculate that value from other values in the table. To begin, I need to create a new column in my data model. That means clicking the New Column button in the Modeling ribbon.

New columns are created at the right end of the table. By default, the column name is cleverly called Column. Of course I can change that by entering a new name. Then after an equal sign which indicates that an expression will be used to define the column, I can begin entering the column definition using DAX expressions. Yes, DAX is still alive and well. If you need a review of using DAX, I’ve covered multiple DAX topics over the last couple of years of this blog.

As in PowerPivot, I can select a column from the current table by just typing the left square bracket. This action opens a dropdown of all column names in the current table listed alphabetically. I can scroll down through the list and select a column by double clicking on its name. I can also type a few characters of the column name to narrow down my list as shown below.

My full expression to calculate TotalProfit is shown below

When I click the Enter key, the entered expression is used to calculate the values for all the table rows.

Before moving off the column, I might want to define custom formatting for the values. The formatting definition here is carried forward to all reports generated with the data. In the above example, I might want to only display the dollar amounts to two decimal places. (Actually, because of the size of aggregated data, I might later decide to format the values with no decimal places.)

Suppose that I want to display some of this data using a standard table with Channel names for the rows and a few select columns from the table. Notice that the values displayed here obey the formatting definition set on the Data page.

That’s it for this week. Next time I cover creating New Measures and why you might want to do so.

C’ya.

Creating Relationships Manually in Power BI

Last time we loaded a BI model with data from several different data sources, but the data sources were carefully selected so that the columns that would relate one table to another table would already have the same name and data type before attempting to upload the data to Power BI Desktop. Because of this assumption, I was able to let Power BI auto detect the relations between the tables. Unfortunately, in the real world, that does not always happen. So this time I will show you how to create relationships manually and how you may need to massage the data a little first before defining those relationships.

I am going to use the Contoso data again and will begin with the following tables already loaded into the model from my SQL Server instance: FactSales, DimChannel, DimDate, DimProduct, and DimProductSubCategory. To complete my model, I need the Product Category data which I am going to load from an Access table in this case.

Since we covered how to load data from SQL Server last time, I am going to pick this discussion up after I have loaded the SQL data and am about to load the Product Category data. The image below shows that I have already click the Get Data button and have selected Access database from the list of possible data sources and have defined where my Access database could be found. As you can see in this figure, Power BI only finds a single table in my Access database named DimProductCategory. When I select this table, I see a preview of the data.

In the table preview (on the right), I see that there are only two columns named ProductCategoryLabel and ProductCategoryName. You might be able to guess from the way the data in the column is formatted that the ProductCategoryLabel data has been defined as text, not numbers because it is left justified.

After clicking the Load button to import this data into my data model, I switch to the relationships view and see the table diagram shown below.

I see that there is no relationship defined with the DimProductCategory table because there are no connecting lines leading into or from this table. Therefore, I first want to select the Manage Relationships button. I can see the four relationships that tie together the other five tables, but I am missing a relationship. I may first try to click the auto detect button which I talked about last time to see if Power BI can find the missing relationship

Unfortunately, Power BI very quickly responds back that it cannot. The problem is that while the DimProductSubcategory table has a field named ProductSubCategoryKey, the new table, DimProductCategory, does not have a field with that name.

So I might try to create the relationship manually by clicking the New button on the Manage Relationships screen (after closing the Autodetect message screen of course).

On this screen I have to specify the names of the tables and the linking columns to define the relationship. I can begin with either table on the top. After I select the table, I see a grid of the available fields along with a couple of rows of data. To select the field that I want to use in the relationship, I merely need to click the column header to select the field.

I then specify the name of the linking table along with selecting the column that I want to link to.

There are some advanced options which I am going to skip over for now and simply click the OK button. Power BI then attempts to create the relationship. But wait a minute, you might say, the column names are different and the data types of these two fields are different. That is true. I am not surprised that I can link two tables on fields with a different name, but different data types? In fact, I expected Power BI to reject this relationship because of the different data types, but it did not. It created the relationship as shown in the following image

To find out if this relationship is really valid, I next go to the report desktop and select the ProductCatgoryName field from the DimProductCategory table and the ProductSubcategoryName field from the DimProductSubcategory table. You can see from the figure below that the data makes sense. The general category Audio should include subcategories like Headphones, Radios, and Speakers, but not Camcorders, Cameras, or Cell Phones. Somehow, Power BI was able to transform automatically one of the data types to the other (I am thinking it converted the string field to an integer) and then formed the relationship.

Yes, I could prove that this relationship is correct another way by creating a simple table that lists ProductCategoryLabel from the DimProductCategory table and ProductCategoryKey from the DimProductSubcategory table. You can clearly see that the relationship links these two tables correctly.

Surprised by this, I wanted to try something else that may not be as easy for Power BI to automatically convert. I took another table, the Stores table, and modified the Excel version of the table to change StoresKey to prefix the store number with the letter ‘S’. I then loaded that table into my model as shown below.

Next I went to the Manage Relationships dialog as before and attempted to add a relationship between the FactSales table and the Stores table as shown in the following figure.

Again Power BI did not complain about creating the relationship. However, when I went to the reports page this time and attempted to display a report of sales amount (from FactSales) against the StoreID from the Stores table, I got the result shown below which indicates a total sales for each year, but no StoreID value at all. (I used the Matrix visualization here with StoreID for the rows and YearLabel for the columns and SalesAmount for the value.) So clearly this relationship did not work.

Therefore, I need to edit one table or the other to ‘fix’ this relationship. I choose to edit the Stores table. Remember from last time, to edit a table, I click the Edit Queries button to open a new window where I can edit the tables.

In this case, the solution is to remove the ‘S’ from the front of each of the StoreID values. I can do this with the Split Column transformation and specifically to split the column based on the number of characters from the left rather than splitting the column based on any specific character or character string like I did in an earlier blog example.

This action results in two columns, the first, StoreID.1, contains only the first character of the StoreID field, the ‘S’. The second field, StoreID.2, contains the numeric portion of the store ID. As you can see in the formatting of this column in the following figure, Power BI also treats StoreID.2 as a numeric value.

I then removed the StoreID.1 column as something I will not need. I also renamed StoreID.2 to just StoreID.

I then clicked Close and Apply.

Now I can create a new relationship between FactSales and Stores using StoreKey and StoreID respectively as shown below. Note that I did not change the field names to match. If I had, I could probably use the auto-detect feature to find and create the relationship for me.

This time when I attempt to create the same report on sales by store and year, I get reasonable results as shown in the table below.

Well, I hope you found that interesting. Next time I plan to probe a little deeper on creating calculated columns within a query.

Till next time, c’ya!

Power BI can Load Data from Multiple Data Sources

This week I’ll show you how to use Power BI to pull together data from multiple data sources that you can then use to create visualizations.

The first step is to open your copy of Power BI Desktop. Then click on the Get Data button. I’m going to begin with some data from SQL Server so I do not have to click the More… button to see all the available data sources, SQL Server is right on the main dropdown.

To work with a SQL Server database, I must first enter the name of the SQL Server that I want to connect to. Note that while the dialog shown below prompts for the name of the database, this information is optional. You might think, ‘How can the database name be optional?’ In a moment, I will show you that Power Bi first needs to connect to the server, but then it can retrieve the database names that you have access to and display them for you to select. This protects you from the possible misspelling of the database name in this dialog. Since I am running everything on a single machine, my SQL Server name is LocalHost.

Of course before you can connect to the server, you must provide credentials. I am running everything on a Surface Pro 3 so I will use my Windows credentials. However, if you need to connect to another server on your network, you may need separate credentials.

The following dialog may or may not appear depending on the database you are connecting to and your version of SQL Server. However, in most cases, clicking the OK button allows you to connect to your data as long as you have rights to the SQL Server box.

When Power BI connects to the server, it first begins with a list of the possible databases along the left side of the screen. By opening a database (clicking on the arrow to the left of the database symbol, I can see the tables in the database. I can select multiple tables from this database. I can even select multiple tables from multiple databases. With each table I select, I will see a sample of the data in the right panel. Note that unlike PowerPivot, there is no way to limit the columns or rows from the table at this point. Maybe in the next major release.

You’ll notice that this week I return to my favorite sample dataset, Contoso.

If I select multiple tables to load, they all appear in the following dialog that shows me the progress of loading my data. In the image below, I can see that I’ve selected five tables from the database. As the data load progresses, Power BI shows me the number of records loaded in the line beneath the table name.

After the data loads, I notice that the table names appears in the Fields column on the far right of the desktop. But you say that these are table names? If I click the arrow to the left of each table name, the table expands to show the names of the fields essentially grouped by table.

Next I can click on one of the icons on the left side of the desktop. The first icon is the report surface. I typically work there after I have loaded all my data and made any changes to the columns and/or rows. So some might think that this icon should be on the bottom. The second icon is the data icon. I can use this icon to edit the column definitions, add new columns, or remove columns that I no long need. I showed some of these techniques in prior weeks of this blog.

However, I usually start with the bottom icon, Relationships. Relationships are the heart of any multi-table analysis. If I do not define the relationships between the tables first, I really cannot do any analysis. I might not be able to create some column either. So let’s start there.

The diagram that Power BI creates shows the tables that have been loaded into the model and if relationships were defined in the source data, it tries to create those relationships for me automatically. Notice that the diagram clearly identifies the one side of the relations with a ‘1’ and the many side of the relationship with an asterisk.

Unlike the diagram view in Power BI, there is no ability to edit, delete, or create new relationships from this dialog. Nor can I delete columns, define hierarchies, or rename columns here. I can hide individual columns or entire tables from the report view of the model and I can maximize the view of the individual table diagram, but that’s it. Again, maybe in the next major release of Power BI these additional features will appear so that the functionality matches that of PowerPivot diagram views.

Next I’m going to load some additional data into my model beginning with some data from an Access database and then data from Excel. I’m not going to take the space here to show screen images of these processes because between the above description and prior week blog discussions, I believe that you can figure out how to load data from different data sources. The only problem you may encounter is the need for new or updated connectivity tools to access your other data sources.

Once I have loaded all my data, my first action is to go to the Relationships view. You can see in the following figure that Power BI automatically shows my additional tables that I’ve loaded, but they show no connection to the rest of the tables in the data model. I need to fix this before going forward.

Find the Manage Relationships button in the Relationships group in the ribbon and click on it.

A dialog appears that displays the current relationships between the tables. This view is similar to the manage relationships dialog in Power Pivot. I can add, edit and even delete relationships here. However, the interesting button is the one labelled Autodetect

When I click Autodetect, Power BI attempts to find and create relationships between the tables that are not currently connected to the data model. It does this by analyzing the field names in the new tables with names in the tables of the data model to find other fields with the same name and the same data type definition. Of course, one table must serve as the parent (one-side) and the other table must serve as the child (many-side). In most cases, you will start with a fact table and create branching child tables so they usually serve as the one side in each of the new relationships. If Power IB is successful, it will attempt to create that relationship and will first tell you in a dialog box such as the one shown below how many relationships it detected.

After closing the information dialog reporting the number of relationships found, the relationships are created and the results for this model appear in the updated Manage Relationships dialog shown here.

Note at this point, I could edit any of the relationships that were not correctly created, I could also add relationships between tables that perhaps had different field names that could not be automatically detected. It is a good practice to check the Relationships diagram to insure that every table is connected to the model through at least one relationship.

I am now ready to go to the Reports desktop to create some simple charts.

Creating a report is as simple. First click in a blank area of the Reports desktop. Then click in the checkbox to the left of the value field you want to count, sum, or average such as Sales Amount. Then click on the field or fields by which you want to report Sales Amount (or any other value field). For example, I might want to see sales by year or by Product category or by Channel. You can create as many visualization as you want on the desktop although you should keep in mind that they need to be large enough to be readable. Let’s start with the three shown below.

Not a bad start. Here is a fun feature that Power BI supports that Power Chart does not. Just click on the Year 2009 column in the chart in the upper left. Immediately, all charts on the page change to shows the sales related to the year 2009 in a darker color and sales from other years in a light color.

Note that I did not have to do anything special to get this to work. I did not have to create a filter, a slicer, or anything. Similarly, I could click on columns in other charts to slice and dice the data different ways.

Filtering the data on the page is actually done entirely differently. To filter the data that appears on the page, you have to define a page filter by dragging the field or fields that you want to filter by into the area beneath Page Level Filters found in the column to the immediate left of the column with all the table fields. Filter fields typically are string values and typically, there are not a large number of unique value. To filter the page, just click in the checkbox to the left of each value that you want to include on the page. You can filter on one or many values at a single time.

After selecting a page level filter, the visualizations on the page change automatically to represent only the filtered data rather the entire dataset. The following desktop image represents only the data from the year 2009.

Well, that’s all for this week, but if you have been following along, save your data from this week because I’ll pick up this sample data from this point next time.

C’ya.

Formatting Power BI Visualizations

Being able to mash together data from multiple sources to create meaning full analysis in Power BI or even PowerPivot can be lots of fun, at least to my data geek friends, but presenting visualizations of that data to management without the ability to customize the look would fall flat on its face at your next executive presentation. Therefore, I am taking a side trip this week to show you a little about formatting your visualization, enough to get you started exploring how to make individual visualizations look good.

I also want to introduce you to a set of simple data table that you can use for free while learning how to use Power BI. A team from the Paris Technical college put together some data and perform some basic clean-up of the data. The team includes: Petra Isenberg, Pierre Dragicevic and Yvonne Jansen and the data can be found at: http://www.perso.telecom-paristech.fr/~eagan/class/as2013/inf229/labs/datasets.

For today’s blog, I am going to use their CARS dataset. While it has over 400 cars with 8 characteristics, it does not have the new Tesla X model L, but it is a good dataset to show some basic visualization formatting technics.

Thus I start by importing the CARS.CSV file by using the Get Data option.

After loading the dataset (shown below), you can see a preview of the data in which the first two data rows are not cars. In fact, the first row looks like it could serve as column headers or in terms of our data set, column names. While the second row looks like it defines the preferred data type for that column. To begin loading the data, click the Load button in the bottom right of the screen.

After loading the data, the first thing I need to do is to make some quick changes to the dataset by clicking the Edit Queries button.

In the Transform group of the Home ribbon of the Query Editor, I can click on the option: Use First Row As Headers. This is similar to the PowerPivot load technique in which you could define whether your data included headers in the first row.

Next I want to get rid of the row with the data type (which is now the first row). But before going further, I take note of the data type of each column. I will need that information in just a moment. In the Reduce Rows option group, there is an option to Remove Rows.

Click the bottom half of the Remove Rows button. This action opens a dropdown letting you specify how you want to remove rows. I am going with Remove Top Rows because the data type row is now the first row in the dataset. This option now opens a dialog that lets me specify how many rows I want to remove from the top of the dataset (Do you see why I had to deal with the header row first?). In this case, I only need to remove 1 row, but you may encounter other datasets that require you to remove more than a single row.

Next I noticed the car column appears to be a concatenation of the car make and model.

Looking through the data, the car model appears to always be a single word. Therefore, there is an easy way to split this column by click on the Split Column button and then select By Delimiter.

The obvious delimiter here is the left-most space character.

After splitting this column and renaming the columns as Make and Model, my data table looks like the following image.

Next, I noticed that all the numeric data was loaded as text. This is a direct result of reading data from the CSV file. Remember the second data row which told us the preferred data type? We can use that information to change the data type (in the Formatting group of the Modeling ribbon) of each column as shown in the following image.

I also want to change the year column to a four digit year. Because this data does not include cars from the 21st century, I can get away with simply concatenating all the values in the YEAR column with a prefix of ’19’.

After completing all my data preparation, I can click on Close and Apply so I can begin creating reports (visualizations) with this data.

Let’s begin with a simple table of average MPG for each year. On a new report page, I can drag the columns MPG and YEAR from my list of columns for the table. Note that by default, numeric values are summed. To change MPG to calculate the average MPG, open the dropdown for MPG in the Values section of the Visualizations column on the right and click on Average. If YEAR is being treated as a number, you can select ‘Do Not Summarize‘ from its dropdown. My table now looks like the following image.

I would rather see the year first, then the MPG so I need to change the order of the columns. You can do this by clicking on and dragging the field name in the Values section of the Visualizations (where I changed the aggregation types) and dragging the field up or down as appropriate to place the columns in the order I want.

While this table is interesting and shows a general increase in the average MPG of cars in the dataset, it is not as clear to see how the average rises and falls from one year to the next. Therefore, I might want to change to a column report.

While this is a nice basic chart, it needs labels to explain to someone viewing it for the first time what they are seeing. Let’s begin by clicking on the Format icon just below the Visualization selection grid.

Now instead of seeing the Values and Filters groups under the Visualizations, we see various chart components. Each of these components have a dropdown arrow on their left which will open that section to expose additional properties for that chart element. They also include a simple sliding switch to turn that element on or off within the chart. Note that currently the X-Axis and Y-Axis are turned on, but Data Labels, Title, and Background are not.

If we open the properties for the X-Axis, we see formatting options for this chart element. Although not needed here, I can change the axis type, the axis scale (linear or logarithmic), a start and end value for the axis, and a Title, Style, and Color. For the X-Axis, I only want to turn the Title property on by moving its slider to the On position.

Next I open the Y-Axis properties. I change my axis scale to begin at 10 and default to automatic for the end value. Often when data values are very similar, plotting their values with a y-axis that starts at zero de-emphasizes the variations from one value to the next. Setting the starting point of the y-axis to a value closer to the smallest y-axis value will emphasize those differences. Think of a stock that trades in the lower $200s range with a daily fluctuation of a single dollar or less. Those fluctuations are masked if the chart has a y-axis data range of 0 to $250. On the other hand, changing the y-axis to begin span only the range 200 to 250 emphasizes the daily fluctuations.

As you can see below, my chart is starting to look a little better.


Next I open the data colors group. I can change the default color which is an aqua shade to any of the other colors by selecting either from the Theme Colors provided in the Default Color dropdown by going to the Custom Color option and defining my own unique color for the chart.


However, this changes the color of all the columns to that one custom color. What if I want to display different columns using different colors? If I slide the Show All slider to On, the property dialog now shows a separate color selection for each column in my chart as shown below.


Did you notice that the years are not displayed in chronological order? Did you wonder why? The years are actually displayed in order by the size of the Average MPG so that you could, if you want, define a progressive shift in color shade as the value of Average MPG increases. That being said, I decided to customize the colors by year as shown in the following image.


Finally I went back to turn on the Y-Axis title to arrive at the visualization shown below.


To add detail to the chart, I also turned on the Data Labels group which causes the data value to display at the top of each bar.


Now my chart shows both the actual data values as well as the relative change from year to year.


Finally, I turn the title element on. Here, unlike the x-axis or y-axis, I can specify the text to display as the title. For the x-axis and y-axis labels, the text is taken directly from the column names. I can also specify the font color, a background color and the alignment of the title across the top of the chart.


My chart is starting to look pretty good but it is still missing something.


I open the last chart element group, Background, and turn on the background giving it a custom color.


Now I have a final version of the chart that I might want to publish and add to a dashboard (which I will talk about in a future blog.)


Okay, while I did not show everything that formatting can do, I did want to give you a start. Please note that if you use other visualizations, they have their own formatting options as well which work the same or similar to these. Have fun playing with visualizations until next time.

C’ya.