Calculating Percentages in PowerBI

Last time I showed how you can pivot a multivalued column from a survey downloaded from SharePoint to Excel. After some transformations, I created the following report that shows the number of career responses in each of the 18 categories. While interesting in of itself, it would be more interesting to see how the career choices break down by different dimensions such as the school or grade level, the gender of the student, or perhaps even by individual school.

In my report page, I am going to reserve an area along the left side of the page to list several of the dimensions I’m interest in and to display the total number of students that took the survey along with the number of students in the filtered subset when I filter by one of the dimensions. Basically that means building something like the following:

This image actually consist of four separate visualizations. The top visualization for Total Students uses the Card visualization and initially displays the count of the ID column in the survey table that has one record for each survey. I duplicate this visualization and drag it below the first one. Then for the top visualization, I open the Format option panel in the visualization section and make the following changes:

  • I turn off the Category Label
  • I turn on the Title and then open the Title section to add the Title Text: Total Students
  • I change the font size and color to make this information stand out.

I then repeat the process for the second visualization but with the Title Text: Sampled Students because this count will represent the number of students included in the visualizations taking into account any filters applied to the page.

I then add two single column tables, one for gender and one for grade level. Because I want to use the values in these table as page filters, I change the visualization of these two tables to the Slicer.

Along with the table I created in my last blog, I can work with the slicer values to explore how the career choices change based on gender and grade level by clicking on the check boxes. When I do this, I see that the chart automatically adjusts the columns based on the changes I make to the slicers. Also the number of sampled students in the second card visualization displays the count of students included after applying the slicer. Unfortunately, the total student count also changes. This I do not want. I want the total number of students to always represent all the students in the entire survey.

I can achieve this goal with a little DAX and a custom measure back in the data page for the survey table. The custom measure needs to count the IDs for all the records in the table ignoring any filters applied to the report page. I can do this by passing the survey table name to the ALL() function. This function ignores all other page filters. Then I use the COUNTAX() function which defines the data source as the output from the ALL() function and then performs a count the number of IDs. While this may sound complicated, it is as simple as the following equation:

Notice that I name the measure Students. I must provide a unique measure name for each measure I create. However, I can then use that measure in any visualization such as the card visualization for the total students in the survey.

Back on my report page, I select the top card visualization (for Total Students) and change the field used from ID to my new measure, Students.

Now if I select any of the values in the slicer visualizations, my sampled students card displays the number students included in the filter while the total students card displays the total number of student surveys taken as shown below.

I then add on the right some additional column visualizations to display other data fields such as which subject the student finds most interesting in school or charts that display career choices by gender, by grade level or by other criteria from the survey. Each of these charts begins with a simple table visualization in which I add the columns I want to use. I then convert the visualization to a column chart.

In the image below, you see the final result of the first page of my report. Notice that I also added a vertical line shape to separate the two card visualizations and the slicers from the other column charts.

Since each student was allowed to select one or more careers from the list of possible careers, the total number of career choices is significantly larger than the number of students. In the above figure, the count of career interests, if I were to add the values in each of the columns, would total the number of career selections which is over 17,000, not the number of students. Therefore, I might decide to display the same charts as a percent of all the students rather than a count of all career selections.

Again I need another custom measure to calculate the percent based on total students. Fortunately, I already have a measure that calculates the total student count ignoring any slicer selection or page filter. Therefore, I can generate a percentage using a formula similar to the following:

With this formula, I count the number of surveys filtered by the visualizations and slicers on the page divided by the total number of students who took the survey. The maximum value of this percentage would be 100% if all the students who took the survey selected the same career, such as computers, as one of their choices. Similarly, the minimum value would be 0% if no student selected a specific career as one of their choices. Because students could select more than one career of interest, the sum of the percentages of each of the columns will not add up to 100%, but some value greater than 100%. (Lesson learned: Next time ask for their preferred career choice, then their second career choice and finally their third career choice.)

Next I take the first page of my report and duplicate it by right clicking on the page tab and selecting the option to duplicate the page. (This is a lot faster than recreating the same visualizations on a new page, isn’t it?) On the duplicated page, I modify each of the column charts to display the Percent measure just created rather than the count of ID.

Why does this work? Well, because each student can only select any specific career one time even though they may select two, three, or more careers, I can simply count the number of filtered students in each column by the total number of students in the survey. Each chart already divides the students by career choice, one career for each column. Then additional filters from the slicers may limit the gender or grade level or both. Therefore, I can count students that match all those filter criteria and divide by the total student count to get a percentage of students who have an interest in that career. In fact, this measure also correctly calculates the percent of students interested in each school subject (of which they can only select one subject each) which I can verify by summing these percentages from the chart in the upper left and getting a total of 100%.

The following figure shows the percent page of my report.

I can then proceed to add other visualizations if I want on additional pages. But I’ll suspend this example at this point for now.

C’ya next time for more exciting ways to use Power BI.

Advertisements

The Need To UnPivot Data

Today I am going to take a look at some interesting issues I encountered when pulling data from a survey into Power BI to perform some analysis. First some background. The survey existed on our SharePoint site and all submissions were stored within SharePoint. However, as a list, it is always possible to download the details from the survey to an Excel file. When I open the resulting Excel file, I see the following data.

Most of the columns are straight forward text columns. But, looking closely at the rightmost column I see that the data structure is a bit more complex than usual. This is the result of having a survey question in which the respondent is allowed to select one or more items from the list of possible careers that they have an interest in. While some may only respond with a single career that they are focusing on, others may not be so sure yet and so they have selected multiple careers in which they may have an interest.

You can see that the format of the data consists of responses along with the corresponding ID values of the selected responses separated by semi-colons.

Obviously, I cannot perform analysis on such a column with multiple values. I need to get this data into a single vertical column with one row for each career choice selected by the respondent. The first step to achieve this result is to split the individual career values into separate columns. I can do this in Excel by selecting the column with the multiple values and then selecting the button: Text to Columns in the Data ribbon. This function allows me to split the text in each cell of the column into multiple columns each time a specific delimiter is found. In this case, I can split the original text at each semi-colon to create a new column.

After splitting the data into columns, there are some column that I no longer need and can delete. These are the columns that contain the ID values of the selected career values. I have no need for the ID values. I can also apply a series of Replace statements to clean up the rest of the career values to remove extra characters that are not part of the career name itself. After a few transformations, I’m left with the following set of spreadsheet columns which identify the respondent with their ID and then the careers in which they have an interest.

This is still not exactly what I want because I really need to normalize this structure to get multiple rows for each responder with one career choice per row. However, as I will show in a moment, I can perform that transformation in Power BI much easier than it can be done in Excel. So let me proceed to open Power BI and select Get Data.

When I open Power BI and choose Get Data, I select the Excel data type and click the Connect button. In the next screen (not shown) I will either enter the path to the Excel file or I can browse to the file using the Browse button. Once I select the Excel file I want to work with, Power BI opens the file and displays all the tables found in that file. In the figure below, you can see that there are several tables. Some of these tables consist of the original raw data and the transformation steps I used to create the datasets that I want to work with. Each table corresponds to a different tab in the Excel workbook otherwise known as an Excel Worksheet or simply Sheet.

After selecting the tables I want to work with, I click the Load button to import the data into my Power BI model. Some transformations might be needed to ‘convert’ some of the columns to user-friendly data names. I could have also done this in Excel by changing the column header text. However, the transformation I want to focus on is the career choice table which I call PreCareers. A portion of this table appears below. This table consists of the ID column used to identify the responder along with 18 columns representing their potential career choices. While most respondents only entered three or less career choices, some entered significantly more. By default, the data is sorted with all respondents who had only a single career choice displayed first. Those are followed by the respondents who selected two career choices and so on.

In order to create a table that has the ID of the responder and a separate row for each career choice they selected, I need to unpivot the 18 career choice columns. To do that I need to select the Edit Queries button in the Power BI Home ribbon and then select the PreCareers table. Next I have to select all 18 career choice columns. Then opening the Transform ribbon, I look for and select the Unpivot Columns command as shown in the following figure.

After selecting this option, Power BI performs the transformation to create a new table shown below which now has a column that has the previous column headers in a column named Attribute and a column named Values that contains the career choices. Of course I can rename these columns and probably will, but let save our work as a safety precaution before continuing.

To save my changes, I would select the Close and Apply button from the Home ribbon. Remember that I can save my transformations multiple times and return back to Edit Queries to insert additional transformations at a later time.

However when I attempt to close and apply my changes, I get the following error message.

Reading this error message I see that there is a problem with the Column ID in my transformed PreCareers table. If I open the tables in diagram view mode as shown below, the problem is evident.

You can see that the PreCareers table is linked to the K8_Survey table using a 1-to-1 relationship which if I were to look at the relationship is attempting to connect the ID column from one table to the ID column of the other table. The problem is that now that I unpivoted the selected careers in the PreCareers table, there are now multiple records with the same ID value, not just one record.

To solve this problem, I must remove the current relationship as by right clicking on the relationship and selecting Delete as shown in the following figure.

You may have also noticed a yellow bar across top of the page, shown below, that says that there are pending changes to your queries that have not been applied. With the relationship deleted, I can try to apply the changes again by click on the Apply Changes button.

As the changes apply, I see that the field names in the PreCareers table are updated as are the field names in GradeLevel table.

I can now use drag and drop to create new relationships between the tables. For example, I can click and drag the ID field from PreCareers table to the K8_Survey table as shown below. I can also connect the School field in GradeLevel with School in K8_Survey relating these two tables.

With the relationships on my restructured tables back in place, I can go to the Report page and begin to create a report table that displays the possible career choices along with a count of the number of times that career choice was selected.

With a simple change of the Visualization from a table to a column chart, I can visually show which careers were selected the most by the respondents. Note that because each respondent can select any number of careers and they can be selected in any particular order, there is no way to reliably rank the career preferences to say one career or another is truly the top career choice. The chart can only say which career choice was selected the most often of all careers the respondents had an interest.

Another important point to remember here is that the sum of the number of times each career choice was selected is NOT equal to the number of respondents because any respondent could select any number of career choices. Therefore if you want to know the percent of respondents who selected computers as a potential career, you need to know the total number of respondents, not the total number of responses.

I’ll delve more into the analysis of this data next time, so save this file.

BTW, this weekend I am at the Orlando Code Camp held at Seminole State College in Sanford, FL. If you happen to attend, please stop by to say hello. I’ll be conducting two BI related sessions, one for Power Pivot and one for Power BI.

C’ya.

Appending Data Sources Manually in PowerBI

Last week I looked at how I can load into PowerBI an entire library of source data files and let PowerBI determine how to append the resulting files together. This method works great if all the data source files are of the same time and have the same schema/structure. But what if some of the files come from an Access data source, some from CSV files, some from Excel, etc. Furthermore, what if the data column order is different or perhaps if the column headers are different. These are cases in which you may need to import each data source individually. Then edit the query for each source to align the column, column names, data types, etc. before combining the data sets together. Let’s see how that looks.

For simplicity sake, I am going to start from the same folder as last week that holds 10 CSV files each representing a different sales region.

Although I am not going to show images from each data load, I would proceed to load each CSV file into my PowerBI data model one file at a time rather than loading the folder like last time. This method creates a separate table within PowerBI for each data set as shown in the following figure.

I can view the data in each table by selecting the table name in the Fields panel as shown above. The first thing I would want to do is to ‘fix’ any inconsistencies in the data schema in these tables by selecting the table and then clicking on the Edit Queries button in the External Data group of the Data Tools Modeling ribbon. I basically want to build a common table structure across all of the data sources including a consistent column name for each column that appears in multiple tables.

You should note that if I have a data source with a unique column that does not appear in the other tables, I can keep it. Later when I append the individual tables into a single table, the unique column will be brought into the final data model, but the field will be blank for all the other tables that do not have that column.

In my first example here, all of my data schemas from each data source are the same but they would not have to be. So after simply loading the data from each data source, I am ready to start combining the data into a single table for analysis. To do this, I click the Edit Queries button in the External Data group of the Home ribbon and then select one table to start the process.

There is a button on the far right of the Home tab while in Edit Query mode in the Combine group called Append Queries. I can use this button to begin the process of combining the tables.

The Append Queries button opens the dialog shown below that lets me select which table to append. I can only append one table at a time so to append together all 10 tables, I need to do this step 9 times.

As shown in the figure below, each appended query gets the generic name of: Appended Query followed by a number (after the first one which has no number).

This default name is not descriptive enough to help me identify which appended query refers to which data set. If I want to remove one of the data sets from the final table, I would have to click the settings button on the far right of the applied step (the gear icon) to reopen the Append dialog to see which table is being appended. Then if I want to remove that dataset, I could click the ‘X’ to the left of the applied step to delete that one step.

However, a better option is to right click on the default applied step name and select Rename from the popup menu that appears.

This option allows me to select the current applied step name and replace it with a more meaningful name.

The following figure shows a much clearer picture of which data set is being referenced in each step. It also shows that I have finished appending my 9 additional data sets.

Note that any steps applied to the individual data sets are still applied first prior to the data being appended to the final dataset.

Note: One thing that I did not do here but probably should have is to begin by making a duplicate copy of the table that I wanted to begin with so that I could preserve the original table with only its own custom transforms. Then using the duplicated table, I could append the rest of the data sets.

When all the data is appended into a single dataset, I can close and apply this transformation so that any data refreshes can repeat these steps.

I can now go to my Report tab and start to build the visualizations that I want. In the final figure for this session shown below, I create a table of Sales Amount by Sales Territory and included the Sales Territory Group, Region, and Country. I then click on the Sales Amount column to sort the table by this column in ascending order. You can see very quickly that most of the sales occur in Australia and the least sales are in the Central Region of the United States.

Beneath the table, I create another table with the Sales Territory Group and the Sales Amount. PowerBI automatically sums the sales for each Territory Group and displays a chart with only three segments. After creating the table, I change the visualization to the Donut chart to create the appearance shown below.

As with all PowerBI visualizations, the data in the table is linked to the data in the donut chart. If I click on the North America segment of the donut, the table on top refreshes to display only the 6 rows representing sales in North America. Similarly, if I click on the Pacific segment of the donut chart, the table above immediately updates and displays only the one line for Australia.

That’s it for this week. Come back next week for more PowerBI fun.

C’ya.

Loading and Combining Multiple CSV Files in Power BI

Suppose my job is to collect sales information for my company and I current receive text files of that data from each of our major sales offices around the world. Before I can do any analysis using Power BI, I will need to both load the data from each sales office and then combine the data into a single file. For today, let me assume that the format of the data from each sales office is exactly the same and the order of the fields is also exactly the same. As each sales office sends a copy of their data to me, I store their CSV file in a common folder called CSV_Sales as shown below, eventually getting data from all ten locations.

I am now ready to open Power BI and load my data to begin my analysis. I start by selecting the Get Data option after opening Power BI. This displays the following dialog which lets me specify the type of data I want to load. In the past, I showed several different ways to access individual files from different sources including CSV files. Indeed, I could again load each of the CSV files separately and then ‘somehow’ combine them into a single table for analysis purposes.

However, I notice an option that I had not selected before, Folder.

When I click the folder option, Power BI prompts me for the URL of the folder. If I am not sure of the path, I can click the Browse button and navigate to the folder and Power BI will figure out the path for me. Either way, I click on the OK button to continue.

Power BI then shows me the contents of the folder. For each file in the folder it provides metadata about the file such as its name, extension, date last accessed, date last modified and more. There is also a column at the far left of the grid named Content which in all cases has the word ‘Binary’ in it. This mysterious field in each row actually represents the data in the file. In fact, it is in most cases the only field that I care about.

If I click the Load button, highlighted the previous figure, Power BI loads the folder information into a table as shown in this figure. This is not what I want.

So instead, I click the Edit button on the previous screen which loads the data directly into the Query editor. (Yes, I could just click the Edit Queries button in the Home ribbon to get to the same place, but why go through two steps when one will do.)

However, as I said previously, I don’t need all these other columns that provide information about the data files. I only care about the data inside the files. Therefore, I select the first column, the Content column and from the menu select the submenu under Remove Columns and click the option to Remove Other Columns. This is a faster way of getting rid of columns I do not want rather than selecting each column and then clicking the Remove Columns option.

Once I have only the Content column, I can focus on the button on the right side of the column header. Notice that it is a little different than the buttons on the right side of the other columns. Instead of just a single arrowhead pointing down, this button has two arrows pointing down to a line. This button means that I want to download the actual data from within the binary files into a separate table. Therefore I want to click on it.

As you can see in the following figure, I now have all of the columns from the sales table. Because Power BI only presents a preview of the data in this mode, it is not clear whether this file just contains data from the first table or whether it contains data from all of the data tables in the selected folder.

Because I am still in Edit Query mode, I need to close and apply my transformations to the folder by clicking the Close & Apply button in the Home ribbon.

Now when I return to the Data page of the Power BI Desktop, I can see my data and in the Fields dialog on the right side, I can open the CSV_Sales table definition to see all the fields in the table. But I still don’t truly know if I have the data from all the CSV files or not.

Next I open the Report page and create a simple table that displays several of the columns from the CSV_Sales table. I select the fields: SalesTerritoryGroup, SalesTerritoryCountry, SalesTerritoryRegion, and SalesAmount. I can quickly see now that the table does indeed include all my regional data from all 10 sales regions.

Just for fun, I can build a second table that includes only the fields SalesTerritoryRegion and SalesAmount and then convert the table into a TreeMap chart by clicking the TreeMap visualization. This visualization shows the contribution by sales territory region to the total sales by creating proportional rectangles within a larger rectangle that represents total sales. If I hover over any of the boxes, a popup displays the name of the sales region along with the sales amount for that region.

But wait a minute, I previously said that there were 10 regions, but I only see 7 colored rectangles representing only seven of the regions. What happened to the other three? Well, they are actually there, but they are so small compared to the total sales that they are nothing more than slivers at this scale. In fact, even as I expanded the size of the chart. It was difficult to see these last three sales regions which even added together represent less than 0.1% of the total sales. However, they do exist along the right edge of the chart as you can see as I zoom into the bottom right corner.

Well, I hope you found that interesting. I’ll look at some more Power BI goodies next time.

C’ya.

Power BI Knows Dates

Back in PowerPivot days, if I wanted to be able to drill down through a report by dates (Year, Quarter, Month, Day), I had to build the hierarchy and typically I had to build custom columns to hold these date parts. I would even have to do some fancy ‘Sort by Column’ changes to get the months of the year to display in the correct order. Power BI takes a lot of that work away and makes it easy to drill down through your data.

Let me start by loading some data from the Adventure Works Data Warehouse database. In the figure below, you can see that I’ve loaded several of the tables from this database, but for today I’m going to focus on just two tables, FactInternetSales and DimDate. The first thing I notice is that these two tables are connected with three relations represented by three lines, one solid and two dotted. This is because DimDate is a role playing dimension since there are multiple dates in the FactInternetSales table that could be associated with the date table to create reports.

By clicking on any of the relationships, I can see which fields are associated with that relationship since the field names will be enclosed in a box. In the figure below, you can see that the active relationship (the solid line) connects the DateKey field in DimDate with the DueDateKey in the FactInternetSales table.

However, I want to use one of the other relationships. To change which relationship is active, I can open the Data Tools Modeling toolbar and select the Manage Relationships button.

You can see that 6 active relationships have been set between the tables and two are currently inactive.

Suppose I want to use the OrderDateKey for my report. It would be nice if I could just click on the relationship that I want and like radio buttons (or option buttons) the previous relationship would be marked inactive and the new relationship would become active. Unfortunately, this just results in an error screen as shown here.

You have to first mark the relationship that you do not want as inactive by clicking on the yellow checkbox to the left of the relationship. Then you can select the relationship you want as shown in the next figure.

Now when I return to my Relationships page, the active connection is now linking the DateKey field in DimDate with the OrderDateKey field in the FactInternetSales table

I did this just to show that I can still work with the tables like I did in PowerPivot, but I really did not need to do any of this. If I open the Reports page and in the right side fields panel select first Sales Amount and then Order Date which is defined as the data type Date/Time, I will get a table that looks something like the following:

If I then switch to a stacked column chart, I will see a chart like the following (after a few formatting changes) that displays the Sales Amount by Year.

If I click on any of the columns, a dialog box appears with some details about that column. However, that is not what I want. I want to be able to drill down on any given year when I click on it.

In a relatively recent enhancement to PowerBI Desktop, several buttons have been added to the header. If I click on the icon third from the right as shown in the next figure, I can toggle the Drill Down feature. Drill down is turned on when the icon is mostly black instead of white.

Now when I click on the third column from the left (the 3rd quarter) the chart automatically updates to display sales by month and it displays the months July, August, and September in the correct order as shown next.

Clicking on the August column of this chart drills down to the next date level which is days in the month as shown here.

At any time, I can move back up the hierarchy by clicking the icon on the far left side of the header which is the Drill Up icon.

In fact, if I click on this icon twice, I can return to the Sales by Quarter chart. Now suppose I want a chart that displays all the months in the year. The button second from the left lets me drill down for all the current column on the current chart to the next level. So if I am displaying quarters, it show me the sales by month.

As you can see in the following chart, all the months in the year now appear. But perhaps more surprising is that the months appear in the correct chronological order, something we would have had to do with a Sort by Column option in a standard PowerPivot chart.

I hope you found that ability to drill down through a date hierarchy automatically without having to build that hierarchy first as interesting as I did. I’m at a SQL Saturday today in Tampa presenting an introduction to Power BI to the attendees. Maybe I’ll see you there.

C’ya next time.

Using Power BI On-line

During the last several months, I’ve taken you through an initial exploration of Power BI Desktop, a program that you can download directly from Microsoft. However, there is another version of Power BI, one that resides in the cloud. Many of its features are similar to Power BI Desktop, but there are also significant differences. Let’s begin the new year by taking a look.

First you have to sign up to use Power BI. Click on the following link to go to the self-service signup page:

https://powerbi.microsoft.com/en-us/documentation/powerbi-service-self-service-signup-for-power-bi/

You should see a link on the page that says: self-service signup process which takes you to the link:

https://powerbi.microsoft.com/en-us/

The screen (shown below) tells you that you can sign up and get started with Power BI for free. That is almost true. There are some notable exceptions. For example, you cannot sign up with an email address provided by consumer email services or telecommunication providers, this includes the fact that you cannot sign up with a Microsoft Hotmail account. Also .gov and .mil addresses are not allowed at this time. I guess that surprised me a bit because government and military intelligence is something that we are in great need of. However, I guess if you want to call it BI, then strictly speaking it is business intelligence only. J

Another reason you may not be able to signup is because your IT department has disabled self-service signup for Power BI even though your organization may have the Office 365 SKU that supports Power BI. Again Microsoft provides instructions for your administrator on how to fix that problem should they so decide to let you use Power BI. Again, I can only assume you must work for an organization that has no interest in business intelligence if they tell you no.

Of course, Microsoft does provide a way for you to personally register for a new Office 365 trial subscription and use that ID to sign up to use Power BI.

You will get an email from PowerBI/Microsoft asking you to confirm your email address.

After you have successfully navigated all the steps to get a Power BI account, you should see an initial power BI screen like the following:

If it is not already there, you can downloaded the financial sample file from: http://go.microsoft.com/fwlink/?linkID=619356. I just don’t remember how I got it originally. To install the data after downloading it, I used the Get Data button at the bottom of the left navigation panel

I can then click the Get button within the Files box to load the CSV file I just downloaded.

This will provide me with a flat file dataset to work with as a sample. The next figure shows what appears to be an entry for both the Database group and the Dashboard group with one entry in each named Financial Sample.

By opening the Dataset Financial Sample, I can choose which fields I want to work with from the Fields list.

I can simply click the check box to the left of the field name or I can click on a field name and drag it over to the working area. For this demo, I will use Sales and Month Name.

While I do have the ability to customize some features such as the formatting of the title text and even to highlight one of the columns as shown below, I quickly realize that starting from a raw data file like this in Power BI, I can not define a custom order for the Month dimension to force the months to appear in chronological order.

Rather than being in chronological order, it appears that the month columns are ordered from the largest to the smallest in sales value. While possible, this is typically not what I would want to display.

I’m going to leave that problem as it is for now and attempt to click on the Dashboard entry for Financial Sample. I am told that I have a report with unsaved changes. When did I work on a report since I never clicked inside the Reports group? Well, it appears that when I define visualizations in the dataset area, I am actually creating a report and must save it if I want to keep it.

When I click Save, Power BI prompts for a report name. For now, I am just going to call it: TestReport.

With the report saved, I now can see the Financial Sample default dashboard.

The first thing I notice at the top of the page is a text box which is asking me to enter a question about the data. This is something new that we have not seen in the Power BI Desktop. The interesting thing is, I can enter a question (or even a statement since the syntax of a question is not as important as the words used in the sentence) that references one or more of the data fields in my dataset. In this case, I can ask a question like: What are the total sales? (Interestingly, I can just type: Total sales and get the same result.) As I type enough of the question for Power BI to guess at what I want, it starts displaying results

I can continue to drill down to get sales by product by just adding the words by product to my question/statement from the above example. Note that Power BI automatically decides to switch from a simple text answer for a single value to a bar chart with the bars representing my product dimension.

I can just as easily request to see total sales by month. Note in this image, the months are displayed in the correct chronological order even though the report I created earlier did not show them in that order by default.

I can even display sales by country as shown below.

By simply changing the visualization, however, I can create a map of sales by country which might be a better way to present geographic data to management.

So I’m going to stop here in this week’s tour of Power BI. I’ve discussed some of the issues you may need to deal with to get access to Power BI and have shown how to load and use a simple flat data file in the form of a CSV file. In future weeks, I will explore more of the features of Power BI and especially how you can use Power BI Desktop or even Excel to create the datasets you need to display in Power BI.

On last point that I will cover in more detail at a later time, but which you might want to be aware of at least for now. One of the advantages of using Power BI to display your data analysis is that you can save your dashboards and make them available to others within your organization as a simple download. Yes, with Power BI Desktop you can save your model and copy the file from your machine to others. However, if you make a change, you must redistribute the changed file each time. Also refreshing the data is a manual process at this time when using Power BI Desktop. We will see in future blogs how these issues are addressed by Power BI to make your life simpler.

C’ya next time I’ll take our Contoso data and show how easy it is to use it from a One Drive file.

The Power of the Power BI Query Editor

The power of the Query Editor in Power BI is more than that it allows you to load data from a variety of sources, transform that data to meet your analysis needs, create new columns of calculated results and define special views of the data. One of the things that I am impressed with is that it will remember all of the steps you put into getting your data ready for use and it can rerun all those steps the next time you reload raw data from your data source. So before we roll into a new year and begin looking at Power BI Desktop’s big brother, PowerBI.com, let’s take a look at the Power BI Desktop Query Editor.

In the past, when I first opened Power BI, I would immediately select the Get Data button from the Home ribbon. What if instead I immediately clicked on the Edit Queries button in the External Data group.

The Query Editor opens its own window with 4 tabs: Home, Transform, Add Column, and View. Looking at the Home ribbon, you should note that many of the most commonly used actions appear on this menu as well as on the other ribbons. At this time, there does not appear to be a way to customize the actions that appear on this and the other ribbon, but as with many things Power BI, that may come in time.

The first think I need to do is to open a data set to work with. When I click on Get Data, I get the same drop down of data source options that I would see from the Home ribbon of the Power BI Desktop screen.

In this case, I am going to load my standard Contoso dataset including the FactSales table with related product and date data. I will reference my local host SQL Server. Of course, you may have to reference a network SQL server. I generally do not enter the database name on the first dialog, but select the database from the second screen. Note that on the second screen, I can just click the down arrow to the left of the database name to open the database and display a list of the tables within it. This is in my mind a lot easier than spelling the name of the database correctly before selecting the tables I want to work with.

Because I am using a SQL Server database, I have the option of either importing the data or using DirectQuery which does not download a copy of the data to my local data model, but rather reads the data directly from the SQL Server instance.

Because I selected 5 tables from the database, the Query Editor list 5 queries along the left side of the screen, a preview of the data in the center and the query properties and any steps applied to the data on the right side of the screen.

I can click on any of the five queries to switch the one that I want to work on. In the above image, I elected the DimDate query. Now suppose that I want to make some changes to this data. Perhaps the first thing I want to do is to eliminate all the fiscal calendar fields from the table. One way to do this is the select the columns I want to remove. I can click on the first column and then while pressing the CTRL key click on each of the other columns I want to remove. I can also click on the first column I want to move and if the columns I want to remove are sequential, I can move to the last column I want to remove while press the SHIFT key and click on the column header of that last column.

Still on the home ribbon of the Query Editor, I can click on the Remove Columns button to get rid of the columns I do not want to keep in my final model. I notice that as I do this, another step appears in the Applied Steps section in the right column.

Next I want to add a new column to the date table that I will use to sort the months chronologically rather than alphabetically. That column will simply consist of the month number (January =1, February = 2, etc.). I can do this by clicking on the Add Column tab and selecting the Add Custom Column from the General group. Note the option Add Index Column. This option allows you to add a sequential number in a new column for each row that either begins with a ‘0’ or a ‘1’. You can also add a custom index that begins with any number and has a custom increment which may be different from ‘1’. This could be an interesting way to add a surrogate key to a table when you have to merge data from two difference sources (maybe domestic and international) into a single table for reporting purposes and you still need create a ‘new’ index because the domestic index and the international index may have overlapping values.

When I choose to add a custom column, the Query Editor displays a dialog to let me define the name for column as well as the formula. All of that is well and good, but the Query Editor provides an easier solution. Instead of clicking the Add Custom column, note that the Add Column ribbon has sections for working with text, numbers and dates. In this case, all I need do is to select the column [DateKey] and then click on the lower portion of the Date button to display the menu of options. If I select Month, I get a second set of options of different calculations related to month. The first option actually creates the new column and calculates the month number while the other options return the first day of the month and the last day of the month as date or the number of days in the month as an integer. That is pretty cool too.

Now the default name for the column is just Month. However, I can easily change that by right clicking on the column header and selecting Rename from the dropdown menu and entering the column name I want.

As I said before, each of these steps have been recorded in the Applied Steps section to the right of the screen. If I make a mistake or want to try something different, all I need do is click the ‘X‘ to the left of the step name to remove that step and then try something else. Remember that I am only working with a preview of the data at this point. I have not downloaded the data yet.

I could proceed with other data changes for this query as well as the other four queries, but I hope you get the idea that you can perform any transformation, create new columns, etc. that you may want. I will come back in the future to discuss some of the other features of the Query Editor that could save you time.

Let’s assume that I am done making changes to my data. It is now time to load the data from SQL Server into my model. From the Home ribbon of the Query Editor, I click the button: Close & Apply.

This will close the Query Editor window and begin loading the data from SQL Server into my local in-memory data model using the transformations in the applied steps of each query. This is the secret to how to minimize the data you bring into your model from a data source. Any excess columns will be removed. You can even get rid of specific rows (like we did in PowerPivot to eliminate some of the product categories when we loaded data many weeks ago). You can also create new columns, rename new or existing columns, and basic structure the data they way you want to use it. All of these transformations will happen as the data is loaded into your local data model. Then each time you refresh your data, these same query steps are performed consistently with each refresh.

Well, it is the beginning of a new year, at least a new calendar year. I think it is time that we move on to the big brother of PowerBI, PowerBI.com next time so that we can show how and when each tool should be used and how they can work together.

C’ya next time.

Bubbles – A Power BI Visualization

Bubbles – A Power BI Visualization

This week I am going to take another look at a custom visualization. This one is called Bubbles and is in some ways similar to last week’s look at the Word Cloud visualization. At least it looks similar in that the bubbles can appear as different colors, they are grouped together like the words were in a random pattern, and the size of the bubble, like the size of the text in the Word Cloud is a function of another numeric parameter. This time rather than the frequency of the word as in the Word Cloud, I can select any other numeric measure to define the relative size. In this case, I’ll use sales of the items within a product category and let each bubble represent one of the categories.

I’ll start by going to the Power BI Visualizations page found at: https://app.powerbi.com/visuals and click on the Bubbles tile. This displays the following dialog which tells me a little about the visualization including giving credit to its creator. Since I want to use this visualization, I next click on the Download Visual button.

While I will not show it here, there is an intermediate dialog that display the licensing terms for most visualizations. If you accept the license terms and proceed, your browser should download the visualization to one of your local folders. For me that is the Downloads folder. After it finishes downloading, I will move it to a folder where I keep other Power BI Visualization files.

Next I can open my desktop version of Power BI. In the background, I will load sales and product data from my Contoso sample dataset that I pften use and then switch to the Report page. On this page, I can click on the three ellipses at the bottom of the Visualizations group which is the Import from File button.

I will be shown a dialog which lets me locate the visualization file that I downloaded. Before loading the file, Power BI warns me about the perils of importing custom visualizations because they could contain code that could either circumvent the security on my computer or even access private information in my files. The fact that these files are hosted on a Microsoft site is not a guarantee of safety. However, they are more likely to be safe than other visualizations I may find elsewhere on the Internet.

After I start the import, I receive one more dialog when the process is complete.

The Bubble Visualization tile now appears in the bottom row of the Visualizations section. I’ve been asked whether I can rearrange the visualization tiles, perhaps alphabetically, by type or by some other factor. At this time, I have not discovered a way to do that so if someone has, please let me know. Of course the next update of Power BI may include that functionality so who knows.

Next I drag the fields I want to use into a blank area of the Report page. In this case, I want both the Sales Amount field and the Product Category Name. Initially, these data elements may display as a table or column chart (depending on which field you dragged into the report first). In either case, I can then click on the Bubble visualization and the character (text) field will be used to define the bubbles and the numeric field will be used to define the relative sizes of each bubble. The figure below the column chart version of the initial data added to the report page.

As I have shown before, to change the visualization, you only need to click on the tile in the Visualizations panel to change the currently selected visualization to another. Therefore, clicking on the new Bubble visualization results in the following:

If you hover over any of the bubbles with your mouse, Power BI shows the product category name value and the sales value for that category. Interestingly, because it was not what I expected, hovering anywhere outside the bubble but in the overall background (light grey here) displays the name and value of the largest bubble, not the total sales which is what I expected.

Like many other visualizations, you can click on any of the bubbles and that automatically filters any other visualizations on the page by the selected product category. I could also add a second table on the report page such as Calendar Year from the dimDate table and then define it as a Slicer. Then selecting different years in the slicer changes the Bubble visualization to represent data only for the year(s) selected.

The Bubble visualization has a few formatting options which I can get to by clicking on the pen (pencil) in the Visualization panel. I probably would like to see more options to control the colors of the bubbles which sometimes appear all the same color and sometimes as a few different colors. Perhaps if you too would like to see more control over the formatting of this or any other visualization, you should remember that the opening dialog from the Power BI Visualization download page provides a link to the author where you can send your thoughts and suggestions.

That’s it for this week. Next weekend is Christmas and then the week after is New Years. I will try to find time from celebrating to cover two other visualizations before getting back to some hard core data analysis next year.

C’ya next time.

Using a Word Cloud to Analyze Responses

Have you ever had to analyze the results of a survey? If your survey consists of multiple choice, true/false or even Likert Scale type questions, the results are fairly easy to tabulate. All you need to do is to count the number of each unique response for each question and then display the data using a table, column/bar chart, or even a pie chart (please only if there are less than or equal to 6 choices).

But what if you have a question like: ‘Describe with a single word your boss’ and you allow them to enter any word truncating the response at the first blank. I suppose you could count the number of times each word is used and then display the results in a table or a column/bar chart. However, because there are potentially dozens of words that could be used, a column/bar chart could get crowded pretty fast.

Perhaps even more difficult to analyze is asking the responder to describe what they find important in a web site especially when you give then a free-form multi-line text field. Now you have to deal with entire sentences like: ‘The most important thing to me is a strong search engine to make it easier to find the information I want.’ How do you parse that type of response into meaningful data? That is what I am going to look at this week.

To begin, I need to capture the survey data in a format that I can import into Power BI. Generally I will have two choices for this, a simple text file of the column(s) from the survey that I want to analyze or perhaps the survey software (like SharePoint) can export the data to an Excel spreadsheet. In either case, I can then import the data into Power BI.

In terms of how I might want to visualize the data in Power BI, column and bar charts are not an option as I mentioned before because there are simply too many words and thus columns to display anything meaningful. However, a common visualization for just this purpose is something called a Word Cloud. Word clouds have been around for some time especially in blog sites to help identify the most common words used in the blog. Fortunately, Power BI has such a visualization in their Visuals Gallery site which can be found at: https://app.powerbi.com/visuals. This site contains visualizations that have been created and submitted by the Power BI community. There are instructions on the site on how to submit custom visualization that you create using the Power BI Developer tool. The Microsoft team will then review the visualization and if it appears stable and relevant, they will publish it to the site. Anyway, there is a visualization on this site called: WordCloud.

To use a visualization from this site, click on it. This displays a dialog as shown in the following figure which includes some information about the visualization, who created it, the license information, support, etc. However note the Download Visualization button

Clicking the download button may include an agreement to the terms of usage of the visualization and will then download a file with the extension .pbiviz. I recommend creating a separate folder on your machine where you can download this and other visualizations from this site. Next I find the box in my Power BI visualizations Report page and locate the button with the 3 dots (ellipsis).

Clicking on this button first displays a warning about importing custom visualizations as shown below.

Next I will use a standard Open dialog box to locate the file that I just downloaded.

Now with my WordCloud visualization available and my data loaded into Power BI, I could simply drag the column that contains my text onto the blank Reports page (or a blank area on the page if I had other visualizations already started, and then with the data selected, choose the WordCloud visualization.

Initially, the display is cluttered with common words like: the, and, an, a, is, had, have, other, my, I, and many other words that we use in everyday conversation. This noise detracts from the real information we want to gather from the word cloud. So we need to find a way to eliminate (or minimize) the occurrence of these words.

How can we do this? The brute force method (and for now the only method I know) is to return to the Data page, select the Edit Queries button on the Home tab and then use the replace values option in the Any Column group of the Transform ribbon to define values to replace.

Now is where I have to be a bit careful. If I simply try to replace “an” with the empty string: “”, I could alter other words like “change” to “chge” or “analysis” to “alysis”. Neither of these words would make sense. Therefore, I decided to always add a blank as a prefix and a suffix to whatever word I wanted to find and eliminate and replace it with a single blank. (Because if I don’t, a phrase like: “search for an employee phone number” would become “search foremployee phone number”.)

The problem is that if the words I am trying to eliminate begin a line or are followed by a period or comma, my replace will not work (unless I also do replaces with periods and commas.)

Anyway, my point is that this can be a time consuming activity of performing a few dozen replace statements and then switching to the Report page to see what the word cloud now includes. Let’s assume that I’ve already done several rounds of replace statements and my word cloud looks like the following:

Another thing I can do to focus on the most important words is to change some of the properties of the visualization. Click on the Pencil in the Visualizations column to display the properties and open the General property group as shown below:

Notice that you can change the maximum number of words displayed, the minimum and maximum font size and whether word breaking is on or off.

First the number of words displayed greatly affects the appearance of the cloud. The following two images show both 200 words and 50 words. (The previous image was 100 words.)

Obviously changing the minimum and maximum font size will help visually give me an idea of which words occurred the most often. The fourth option is the control that breaks the column into individual words. Otherwise, the entire phrase in the column is used which may be good for short predefined responses, but not so good for my case here.

You should have also noted that each time I change one of the properties, the word cloud regenerates. Another somewhat less obvious thing that occurs is that special characters are converted to blanks when the words are parsed. Therefore, a word like: “site’s URL” gets parsed into three words: “site”, “s” and “URL” which explains why the single letters “s” and “t” often appear even in a 50 word cloud. Of course, it might be possible to go back and replace “‘s ” with ” ” or “‘t ” with ” “. Again, I need to be very creative and very careful to come up with a final word cloud.

Anyway, I hope you found that interesting. In future weeks I will look at additional Power BI features and visualization. C’ya then.

Power BI – Measures and Slicers

Sorry about missing last week but it was a holiday here in the states where parents encourage their kids to go up to strangers and ask for candy. Also people of all ages, not just kids, get dressed up in costumes that range from the cute Elsa princesses to the DBA zombies and the slutty managers. Yeah, some of them were really scary especially since it wasn’t always a costume.

Anyway, last time I talked about creating custom columns and added the column TotalProfit to the FactSales table of my model. (If you missed that discussion, go back to my blog from two weeks ago.) Column expressions are easy to create because they largely resemble row context expressions that you might add to an Excel spreadsheet. In fact, most of the syntax and functions are exactly the same as in Excel.

But we can also add custom measures. The difference between a column and a measure is that most columns can be used as dimensions in a pivot or matrix table especially the columns that contain alphanumeric data. Typically the columns that have numeric data appear as aggregated values, summed, averaged, minimized or maximized depending on the goal and the dimensions used. In fact, the TotalProfit column, while calculated for each row in the FactSales table, is typically displayed as a summed value such as in the following image taken from where we left off last time.

In this matrix table, the TotalProfit is first calculated for each row in the FactSales table that will be included in the final matrix table, but the individual calculated values are then summed by one of five channels that exist for sales. Thus with my sample data, the calculation for TotalProfit occurs a little over 3 million times, but these calculations are made only when the column is created, not for each visualization. Then those values are summed by channel for the above visualization. As this happens on my Surface computer in less than a second, that is still pretty impressive. However, there is another way we can calculate total profit for this table.

We can create a measure to calculate total profit. A measure is calculated each time it is used in a table. In a table like the one above, a total profit measure would only be calculated four times, once for each channel, but each calculation would include three sums over potentially hundreds of thousands of rows and these calculations occur when the visualization is created and in each visualization the data might be needed. To create a measure, find and click on the New Measure button in the calculations group of the Modeling ribbon.

In the expression box, I can see the start of the expression. Note that unlike PowerPivot, the measure name is followed by just the equal sign rather than a colon and equal sign. The rest of the expression is similar to what I used before except that I have to aggregate (in this case sum) each of the values in the expression (you cannot sum an expression). In order to distinguish this expression from the prior one, I include an underscore between ‘total’ and ‘profit’.

After defining the measure, it appears in the field list on the right. While it is still selected, I go to the formatting section of the ribbon and adjust the data format to $ English (United States) which gives me comma separators and two decimal places by default. Note that I can and should do this for any other table field that I will use in a visualization.

Now I can add this measure to my matric table from above replacing the TotalProfit calculated column with the Total_Profit measure. In order to calculate the total profit for channel sales, Power BI has to sum the SalesAmount column for all channel sales and subtract from it the sum of channel total costs and the sum of channel discount amounts.

Your first thought might be to ask how is this better than calculating the total profit for each row in the FactSales table and then simply summing the included rows into a single value. Well like a lot of things in the real world, it depends. Typically a calculated column increases data load times each time you load the model because only the expression is saved, not the individual values (or at least so I’ve been led to believe), but for many visualizations that use most if not all of the data, the impact on calculating the values for the visualization are no worse and possibly even less than using a calculated measure. On the other hand, for a calculated measure that only appears in a few visualizations or when the visualization has been filtered to include a smaller subset of the total data, a calculated measure can improve both data load times and dashboard display times.

But perhaps more importantly, not all expressions can be written as either a calculated column or a measure interchangeably as I have done here. For example, if I wanted to calculate the percent that each channel sales represents from the total sales, I would have to use a measure because there is no way to aggregate percentages over multiple rows. I want you to think about that and I may return with a detailed example in a future week. In the meantime, I want to explore one additional feature of this model.

Those of you who have followed me through my travels in using PowerPivot remember the concept of using slicers to allow the user to analyze different segments of the data by clicking on dimension values. For example, using the above table, I might want to see the profit numbers for different product categories. In PowerPivot, I could create a slicer and if I had more than one visualization on a page, I could associate each chart or table with that slicer.

In PowerBi, I can also define slicers for the reports on the page. To do so, I click in a blank area and select the field from the dimension table that I want to use as the slicer/filter. In this example, I will use the ProductName field from the dimProductCategory table. This dimension only has eight values. The table is relatively short and appears initially as shown below.

To convert this table list into a slicer, I need to select the Slicer visualization as shown in the following figure while the above table has focus. This tells PowerBI to use the table as a slicer rather than just displaying the values of the field. (Yes, I could add a filter to the visualization directly, but a slicer can automatically apply to multiple visualizations on the page.)

After being defined as a slicer, each value of ProductCategoryName has a selection box to its immediate left and two additional entries have been added to the top of the list to select all values or to select only those records that have a blank for the product category.

If I click on any one of the categories, the data in the other visualizations on the same page automatically filters out all records from other categories as shown below.

I can also select multiple categories by clicking on several of the checkboxes to include in the matrix sales from all of the selected categories. Note that in a case like this, the measure I defined actually will perform fewer calculations because the sum function only acts on the filtered records of the slicer, not on all records in the channel.

I can return to displaying all categories either by clicking the Select All option, by clicking on each of the individual category names, or by using the Erase icon to the right of the table name.

C’ya next time.