To celebrate our fourth week of Power Query month, we are going to use Power Query to analyze text data. Inspired by Chris Webb’s blog post, we are going to analyze word usage in Mary Wollstonecraft Shelley’s Frankenstein; Or, The Modern Prometheus.
The Power Query techniques used in this week’s challenge are a bit above beginner level. You can find ways to accomplish the same result using just the Power Query Editor GUI and formula bar, if the Advanced Editor doesn’t interest you right now. If you get stuck you can copy Chris’s code and remove the last two steps.
In addition to grouping and counting words in the text, we’re going to remove the stop words to make our analysis a bit more useful. Stop words are the most common words in a language that get filtered out during processing of natural language (text) data. Examples of stop words in English include “a”, “the”, and “an”.
Although we are doing text analysis, I’m asking you to avoid word clouds in this challenge. For a good explanation of word clouds and their limitations, check out “Word Clouds are Lame“. I chose a bar chart in my solution, but feel free use another alternative to a word cloud.
- Retrieve the text of Frankenstein from https://www.gutenberg.org/files/84/84-0.txt.
- Make all words lowercase.
- Remove punctuation from the text.
- Split the text into one word per row and be sure to remove blanks.
- Group the table by word and add a column that shows the count of occurrences of each word.
- Retrieve the list of stop words from https://data.world/mlongoria/pbiwowwk30/workspace/file?filename=ExcludeWords.csv.
- Merge the query containing the list of stop words with the query containing the book text in a way that filters out the words from the stop words list.
- Ensure that the query containing the stop words does not appear as a table in the fields list in Report View.
- Visualize the top 30 most used words using any visual of your choosing other than a word cloud.
- Add a smart narrative visual that includes the following:
- the most used word
- the number of occurrences of the most used word
- the number of distinct words analyzed after removal of stop words
- the number of words used only once
- Add an image of your choosing to the page (not the page background, although you can do that as well if you’d like) and set the alt text. The visual must be related to the subject of the data. Note: The visuals used in the solution above are free to use. You can get the URL on the second page of the report.
This week’s dataset consists of two sources:
- The text of Mary Wollstonecraft Shelley’s Frankenstein; Or, The Modern Prometheus from Project Gutenberg: https://www.gutenberg.org/files/84/84-0.txt
- A list of stop words to remove from the text analysis: https://data.world/mlongoria/pbiwowwk30/workspace/file?filename=ExcludeWords.csv (Data.World requires a login, but it is free)
The stop words are the same as the default stop words used in Microsoft’s Word Cloud custom visual.
After you finish your workout, share on Twitter using the hashtags #WOW2021 and #PowerBI, and tag @JSBaucke, @MMarie, @shan_gsd and @dataveld. Also make sure to fill out the Submission Tracker so that we can count you as a participant this week in order to track our participation throughout the year.