Using a Word Cloud to Analyze Responses

Have you ever had to analyze the results of a survey? If your survey consists of multiple choice, true/false or even Likert Scale type questions, the results are fairly easy to tabulate. All you need to do is to count the number of each unique response for each question and then display the data using a table, column/bar chart, or even a pie chart (please only if there are less than or equal to 6 choices).

But what if you have a question like: ‘Describe with a single word your boss’ and you allow them to enter any word truncating the response at the first blank. I suppose you could count the number of times each word is used and then display the results in a table or a column/bar chart. However, because there are potentially dozens of words that could be used, a column/bar chart could get crowded pretty fast.

Perhaps even more difficult to analyze is asking the responder to describe what they find important in a web site especially when you give then a free-form multi-line text field. Now you have to deal with entire sentences like: ‘The most important thing to me is a strong search engine to make it easier to find the information I want.’ How do you parse that type of response into meaningful data? That is what I am going to look at this week.

To begin, I need to capture the survey data in a format that I can import into Power BI. Generally I will have two choices for this, a simple text file of the column(s) from the survey that I want to analyze or perhaps the survey software (like SharePoint) can export the data to an Excel spreadsheet. In either case, I can then import the data into Power BI.

In terms of how I might want to visualize the data in Power BI, column and bar charts are not an option as I mentioned before because there are simply too many words and thus columns to display anything meaningful. However, a common visualization for just this purpose is something called a Word Cloud. Word clouds have been around for some time especially in blog sites to help identify the most common words used in the blog. Fortunately, Power BI has such a visualization in their Visuals Gallery site which can be found at: https://app.powerbi.com/visuals. This site contains visualizations that have been created and submitted by the Power BI community. There are instructions on the site on how to submit custom visualization that you create using the Power BI Developer tool. The Microsoft team will then review the visualization and if it appears stable and relevant, they will publish it to the site. Anyway, there is a visualization on this site called: WordCloud.

To use a visualization from this site, click on it. This displays a dialog as shown in the following figure which includes some information about the visualization, who created it, the license information, support, etc. However note the Download Visualization button

Clicking the download button may include an agreement to the terms of usage of the visualization and will then download a file with the extension .pbiviz. I recommend creating a separate folder on your machine where you can download this and other visualizations from this site. Next I find the box in my Power BI visualizations Report page and locate the button with the 3 dots (ellipsis).

Clicking on this button first displays a warning about importing custom visualizations as shown below.

Next I will use a standard Open dialog box to locate the file that I just downloaded.

Now with my WordCloud visualization available and my data loaded into Power BI, I could simply drag the column that contains my text onto the blank Reports page (or a blank area on the page if I had other visualizations already started, and then with the data selected, choose the WordCloud visualization.

Initially, the display is cluttered with common words like: the, and, an, a, is, had, have, other, my, I, and many other words that we use in everyday conversation. This noise detracts from the real information we want to gather from the word cloud. So we need to find a way to eliminate (or minimize) the occurrence of these words.

How can we do this? The brute force method (and for now the only method I know) is to return to the Data page, select the Edit Queries button on the Home tab and then use the replace values option in the Any Column group of the Transform ribbon to define values to replace.

Now is where I have to be a bit careful. If I simply try to replace “an” with the empty string: “”, I could alter other words like “change” to “chge” or “analysis” to “alysis”. Neither of these words would make sense. Therefore, I decided to always add a blank as a prefix and a suffix to whatever word I wanted to find and eliminate and replace it with a single blank. (Because if I don’t, a phrase like: “search for an employee phone number” would become “search foremployee phone number”.)

The problem is that if the words I am trying to eliminate begin a line or are followed by a period or comma, my replace will not work (unless I also do replaces with periods and commas.)

Anyway, my point is that this can be a time consuming activity of performing a few dozen replace statements and then switching to the Report page to see what the word cloud now includes. Let’s assume that I’ve already done several rounds of replace statements and my word cloud looks like the following:

Another thing I can do to focus on the most important words is to change some of the properties of the visualization. Click on the Pencil in the Visualizations column to display the properties and open the General property group as shown below:

Notice that you can change the maximum number of words displayed, the minimum and maximum font size and whether word breaking is on or off.

First the number of words displayed greatly affects the appearance of the cloud. The following two images show both 200 words and 50 words. (The previous image was 100 words.)

Obviously changing the minimum and maximum font size will help visually give me an idea of which words occurred the most often. The fourth option is the control that breaks the column into individual words. Otherwise, the entire phrase in the column is used which may be good for short predefined responses, but not so good for my case here.

You should have also noted that each time I change one of the properties, the word cloud regenerates. Another somewhat less obvious thing that occurs is that special characters are converted to blanks when the words are parsed. Therefore, a word like: “site’s URL” gets parsed into three words: “site”, “s” and “URL” which explains why the single letters “s” and “t” often appear even in a 50 word cloud. Of course, it might be possible to go back and replace “‘s ” with ” ” or “‘t ” with ” “. Again, I need to be very creative and very careful to come up with a final word cloud.

Anyway, I hope you found that interesting. In future weeks I will look at additional Power BI features and visualization. C’ya then.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s