Appending Data Sources Manually in PowerBI

Last week I looked at how I can load into PowerBI an entire library of source data files and let PowerBI determine how to append the resulting files together. This method works great if all the data source files are of the same time and have the same schema/structure. But what if some of the files come from an Access data source, some from CSV files, some from Excel, etc. Furthermore, what if the data column order is different or perhaps if the column headers are different. These are cases in which you may need to import each data source individually. Then edit the query for each source to align the column, column names, data types, etc. before combining the data sets together. Let’s see how that looks.

For simplicity sake, I am going to start from the same folder as last week that holds 10 CSV files each representing a different sales region.

Although I am not going to show images from each data load, I would proceed to load each CSV file into my PowerBI data model one file at a time rather than loading the folder like last time. This method creates a separate table within PowerBI for each data set as shown in the following figure.

I can view the data in each table by selecting the table name in the Fields panel as shown above. The first thing I would want to do is to ‘fix’ any inconsistencies in the data schema in these tables by selecting the table and then clicking on the Edit Queries button in the External Data group of the Data Tools Modeling ribbon. I basically want to build a common table structure across all of the data sources including a consistent column name for each column that appears in multiple tables.

You should note that if I have a data source with a unique column that does not appear in the other tables, I can keep it. Later when I append the individual tables into a single table, the unique column will be brought into the final data model, but the field will be blank for all the other tables that do not have that column.

In my first example here, all of my data schemas from each data source are the same but they would not have to be. So after simply loading the data from each data source, I am ready to start combining the data into a single table for analysis. To do this, I click the Edit Queries button in the External Data group of the Home ribbon and then select one table to start the process.

There is a button on the far right of the Home tab while in Edit Query mode in the Combine group called Append Queries. I can use this button to begin the process of combining the tables.

The Append Queries button opens the dialog shown below that lets me select which table to append. I can only append one table at a time so to append together all 10 tables, I need to do this step 9 times.

As shown in the figure below, each appended query gets the generic name of: Appended Query followed by a number (after the first one which has no number).

This default name is not descriptive enough to help me identify which appended query refers to which data set. If I want to remove one of the data sets from the final table, I would have to click the settings button on the far right of the applied step (the gear icon) to reopen the Append dialog to see which table is being appended. Then if I want to remove that dataset, I could click the ‘X’ to the left of the applied step to delete that one step.

However, a better option is to right click on the default applied step name and select Rename from the popup menu that appears.

This option allows me to select the current applied step name and replace it with a more meaningful name.

The following figure shows a much clearer picture of which data set is being referenced in each step. It also shows that I have finished appending my 9 additional data sets.

Note that any steps applied to the individual data sets are still applied first prior to the data being appended to the final dataset.

Note: One thing that I did not do here but probably should have is to begin by making a duplicate copy of the table that I wanted to begin with so that I could preserve the original table with only its own custom transforms. Then using the duplicated table, I could append the rest of the data sets.

When all the data is appended into a single dataset, I can close and apply this transformation so that any data refreshes can repeat these steps.

I can now go to my Report tab and start to build the visualizations that I want. In the final figure for this session shown below, I create a table of Sales Amount by Sales Territory and included the Sales Territory Group, Region, and Country. I then click on the Sales Amount column to sort the table by this column in ascending order. You can see very quickly that most of the sales occur in Australia and the least sales are in the Central Region of the United States.

Beneath the table, I create another table with the Sales Territory Group and the Sales Amount. PowerBI automatically sums the sales for each Territory Group and displays a chart with only three segments. After creating the table, I change the visualization to the Donut chart to create the appearance shown below.

As with all PowerBI visualizations, the data in the table is linked to the data in the donut chart. If I click on the North America segment of the donut, the table on top refreshes to display only the 6 rows representing sales in North America. Similarly, if I click on the Pacific segment of the donut chart, the table above immediately updates and displays only the one line for Australia.

That’s it for this week. Come back next week for more PowerBI fun.

C’ya.

Loading and Combining Multiple CSV Files in Power BI

Suppose my job is to collect sales information for my company and I current receive text files of that data from each of our major sales offices around the world. Before I can do any analysis using Power BI, I will need to both load the data from each sales office and then combine the data into a single file. For today, let me assume that the format of the data from each sales office is exactly the same and the order of the fields is also exactly the same. As each sales office sends a copy of their data to me, I store their CSV file in a common folder called CSV_Sales as shown below, eventually getting data from all ten locations.

I am now ready to open Power BI and load my data to begin my analysis. I start by selecting the Get Data option after opening Power BI. This displays the following dialog which lets me specify the type of data I want to load. In the past, I showed several different ways to access individual files from different sources including CSV files. Indeed, I could again load each of the CSV files separately and then ‘somehow’ combine them into a single table for analysis purposes.

However, I notice an option that I had not selected before, Folder.

When I click the folder option, Power BI prompts me for the URL of the folder. If I am not sure of the path, I can click the Browse button and navigate to the folder and Power BI will figure out the path for me. Either way, I click on the OK button to continue.

Power BI then shows me the contents of the folder. For each file in the folder it provides metadata about the file such as its name, extension, date last accessed, date last modified and more. There is also a column at the far left of the grid named Content which in all cases has the word ‘Binary’ in it. This mysterious field in each row actually represents the data in the file. In fact, it is in most cases the only field that I care about.

If I click the Load button, highlighted the previous figure, Power BI loads the folder information into a table as shown in this figure. This is not what I want.

So instead, I click the Edit button on the previous screen which loads the data directly into the Query editor. (Yes, I could just click the Edit Queries button in the Home ribbon to get to the same place, but why go through two steps when one will do.)

However, as I said previously, I don’t need all these other columns that provide information about the data files. I only care about the data inside the files. Therefore, I select the first column, the Content column and from the menu select the submenu under Remove Columns and click the option to Remove Other Columns. This is a faster way of getting rid of columns I do not want rather than selecting each column and then clicking the Remove Columns option.

Once I have only the Content column, I can focus on the button on the right side of the column header. Notice that it is a little different than the buttons on the right side of the other columns. Instead of just a single arrowhead pointing down, this button has two arrows pointing down to a line. This button means that I want to download the actual data from within the binary files into a separate table. Therefore I want to click on it.

As you can see in the following figure, I now have all of the columns from the sales table. Because Power BI only presents a preview of the data in this mode, it is not clear whether this file just contains data from the first table or whether it contains data from all of the data tables in the selected folder.

Because I am still in Edit Query mode, I need to close and apply my transformations to the folder by clicking the Close & Apply button in the Home ribbon.

Now when I return to the Data page of the Power BI Desktop, I can see my data and in the Fields dialog on the right side, I can open the CSV_Sales table definition to see all the fields in the table. But I still don’t truly know if I have the data from all the CSV files or not.

Next I open the Report page and create a simple table that displays several of the columns from the CSV_Sales table. I select the fields: SalesTerritoryGroup, SalesTerritoryCountry, SalesTerritoryRegion, and SalesAmount. I can quickly see now that the table does indeed include all my regional data from all 10 sales regions.

Just for fun, I can build a second table that includes only the fields SalesTerritoryRegion and SalesAmount and then convert the table into a TreeMap chart by clicking the TreeMap visualization. This visualization shows the contribution by sales territory region to the total sales by creating proportional rectangles within a larger rectangle that represents total sales. If I hover over any of the boxes, a popup displays the name of the sales region along with the sales amount for that region.

But wait a minute, I previously said that there were 10 regions, but I only see 7 colored rectangles representing only seven of the regions. What happened to the other three? Well, they are actually there, but they are so small compared to the total sales that they are nothing more than slivers at this scale. In fact, even as I expanded the size of the chart. It was difficult to see these last three sales regions which even added together represent less than 0.1% of the total sales. However, they do exist along the right edge of the chart as you can see as I zoom into the bottom right corner.

Well, I hope you found that interesting. I’ll look at some more Power BI goodies next time.

C’ya.