Today I am going to take a look at some interesting issues I encountered when pulling data from a survey into Power BI to perform some analysis. First some background. The survey existed on our SharePoint site and all submissions were stored within SharePoint. However, as a list, it is always possible to download the details from the survey to an Excel file. When I open the resulting Excel file, I see the following data.
Most of the columns are straight forward text columns. But, looking closely at the rightmost column I see that the data structure is a bit more complex than usual. This is the result of having a survey question in which the respondent is allowed to select one or more items from the list of possible careers that they have an interest in. While some may only respond with a single career that they are focusing on, others may not be so sure yet and so they have selected multiple careers in which they may have an interest.
You can see that the format of the data consists of responses along with the corresponding ID values of the selected responses separated by semi-colons.
Obviously, I cannot perform analysis on such a column with multiple values. I need to get this data into a single vertical column with one row for each career choice selected by the respondent. The first step to achieve this result is to split the individual career values into separate columns. I can do this in Excel by selecting the column with the multiple values and then selecting the button: Text to Columns in the Data ribbon. This function allows me to split the text in each cell of the column into multiple columns each time a specific delimiter is found. In this case, I can split the original text at each semi-colon to create a new column.
After splitting the data into columns, there are some column that I no longer need and can delete. These are the columns that contain the ID values of the selected career values. I have no need for the ID values. I can also apply a series of Replace statements to clean up the rest of the career values to remove extra characters that are not part of the career name itself. After a few transformations, I’m left with the following set of spreadsheet columns which identify the respondent with their ID and then the careers in which they have an interest.
This is still not exactly what I want because I really need to normalize this structure to get multiple rows for each responder with one career choice per row. However, as I will show in a moment, I can perform that transformation in Power BI much easier than it can be done in Excel. So let me proceed to open Power BI and select Get Data.
When I open Power BI and choose Get Data, I select the Excel data type and click the Connect button. In the next screen (not shown) I will either enter the path to the Excel file or I can browse to the file using the Browse button. Once I select the Excel file I want to work with, Power BI opens the file and displays all the tables found in that file. In the figure below, you can see that there are several tables. Some of these tables consist of the original raw data and the transformation steps I used to create the datasets that I want to work with. Each table corresponds to a different tab in the Excel workbook otherwise known as an Excel Worksheet or simply Sheet.
After selecting the tables I want to work with, I click the Load button to import the data into my Power BI model. Some transformations might be needed to ‘convert’ some of the columns to user-friendly data names. I could have also done this in Excel by changing the column header text. However, the transformation I want to focus on is the career choice table which I call PreCareers. A portion of this table appears below. This table consists of the ID column used to identify the responder along with 18 columns representing their potential career choices. While most respondents only entered three or less career choices, some entered significantly more. By default, the data is sorted with all respondents who had only a single career choice displayed first. Those are followed by the respondents who selected two career choices and so on.
In order to create a table that has the ID of the responder and a separate row for each career choice they selected, I need to unpivot the 18 career choice columns. To do that I need to select the Edit Queries button in the Power BI Home ribbon and then select the PreCareers table. Next I have to select all 18 career choice columns. Then opening the Transform ribbon, I look for and select the Unpivot Columns command as shown in the following figure.
After selecting this option, Power BI performs the transformation to create a new table shown below which now has a column that has the previous column headers in a column named Attribute and a column named Values that contains the career choices. Of course I can rename these columns and probably will, but let save our work as a safety precaution before continuing.
To save my changes, I would select the Close and Apply button from the Home ribbon. Remember that I can save my transformations multiple times and return back to Edit Queries to insert additional transformations at a later time.
However when I attempt to close and apply my changes, I get the following error message.
Reading this error message I see that there is a problem with the Column ID in my transformed PreCareers table. If I open the tables in diagram view mode as shown below, the problem is evident.
You can see that the PreCareers table is linked to the K8_Survey table using a 1-to-1 relationship which if I were to look at the relationship is attempting to connect the ID column from one table to the ID column of the other table. The problem is that now that I unpivoted the selected careers in the PreCareers table, there are now multiple records with the same ID value, not just one record.
To solve this problem, I must remove the current relationship as by right clicking on the relationship and selecting Delete as shown in the following figure.
You may have also noticed a yellow bar across top of the page, shown below, that says that there are pending changes to your queries that have not been applied. With the relationship deleted, I can try to apply the changes again by click on the Apply Changes button.
As the changes apply, I see that the field names in the PreCareers table are updated as are the field names in GradeLevel table.
I can now use drag and drop to create new relationships between the tables. For example, I can click and drag the ID field from PreCareers table to the K8_Survey table as shown below. I can also connect the School field in GradeLevel with School in K8_Survey relating these two tables.
With the relationships on my restructured tables back in place, I can go to the Report page and begin to create a report table that displays the possible career choices along with a count of the number of times that career choice was selected.
With a simple change of the Visualization from a table to a column chart, I can visually show which careers were selected the most by the respondents. Note that because each respondent can select any number of careers and they can be selected in any particular order, there is no way to reliably rank the career preferences to say one career or another is truly the top career choice. The chart can only say which career choice was selected the most often of all careers the respondents had an interest.
Another important point to remember here is that the sum of the number of times each career choice was selected is NOT equal to the number of respondents because any respondent could select any number of career choices. Therefore if you want to know the percent of respondents who selected computers as a potential career, you need to know the total number of respondents, not the total number of responses.
I’ll delve more into the analysis of this data next time, so save this file.
BTW, this weekend I am at the Orlando Code Camp held at Seminole State College in Sanford, FL. If you happen to attend, please stop by to say hello. I’ll be conducting two BI related sessions, one for Power Pivot and one for Power BI.