Introduction to Text Analysis
Text analysis and other digital text mining tools can provide us with new insight, providing clarity on word usage, repetition, themes, and more. Further, these tools can allow us to effectively engage with large and unwieldy data sets, such as an author's full corpus or a complete discography. Today we'll be working with Google NGrams Viewer and Voyant Tools, two easy-to-use introductory text analysis tools.
Google Books NGram Viewer
Starting with Google NGrams
To begin, we'll look at the Google Books NGram Viewer. This tool searches Google Books for the frequencies of words or phrases, plotting the results on a graph. Results come from a user-selected Google Books corpora, with text corporas available in English, English Fiction, Chinese, French, German, and more. Multiple terms can be searched alongside each other, allowing for comparison between phrases. Searches with the NGram Viewer will pull results from over 5 million books published between the 16th century and 2008.
After navigating to the NGram Viewer, we'll familiarize ourselves with its interface. For our sample search, we'll look at three common terms for World War I—'Great War', 'World War I', and 'First World War', as shown below.
Each of our terms or phrases is separated by commas, which tells the NGram Viewer to search for each separately, then plot them together on the same graph. Terms and phrases are case sensitive. If a term is commonly capitalized or you're seeing unusual results, consider checking the 'case-insensitive' box.
In addition to entering our phrases, we have the option to choose a date range, smoothing, and corpus. The date range will adjust our results on the X-axis. If you're searching something in a specific era, consider narrowing this time frame to better visualize your results. Smoothing flattens our results to account for spikes in publication years, giving a cleaner graph. When working with very narrow date ranges, reducing the smoothing will likely provide more accurate results.
The corpus selection is where we'll choose what collection of Google Books materials we will work with. In addition to a variety of language corpora, English has a variety of selections including English Fiction, American English, British English, and English, which primarily consists of academic English materials. When searching, consider the history of the words and phrases you're using. Are there alternative spellings, such as 'color' and 'colour'? Were there previously used slang terms? Are the results different in varying corpora?
For our search on World War I, we'll start our search with: "Great War,World War I,First World War" between 1900 and 2000 from the corpus English with smoothing of 3.
Lets compare our results to see if they stay consistent through different Corpora. Let's switch to British English. Notice the difference below?
Now that you've tried out this example, test out some more on your own. Is there anything surprising you find from your results?
Adding Texts to Voyant
To begin, navigate to Voyant Tools. Once we're on the Voyant main page, we'll need to add texts to analyze. Texts can entered in one of three ways:
- Directly copy and paste texts into the text area
- Enter a URL into the text area
- Upload a file to analyze
If entering a URL directly, be aware that Voyant will often bring extra text from the page, such as headers, advertisements, etc. When uploading a file, Voyant supports plain text, HTML, XML, MS Word, RTF, and PDF file formats. Once you've added your text(s), go ahead and click 'Reveal'.
To help declutter results, Voyant maintains a list of stopwords—words that will be excluded from searches. This list is editable, and depending on the text you're using, you may want to edit this list. To edit Voyant's stopwords list, open the options menu, which is shown by hovering on the right side of the program's top bar (shown below).
As an example, let's take a look at the lyrics from Michael Jackson's "Billie Jean". As the chorus of this song is repeated over and over, some of these words are showing up as the most frequent in the corpus. Based off the results shown below, let's add 'billie', 'jean', 'lover', 'kid', 'son', and 'says' to our stop word list. Notice how the most frequent words and visualizations change.
Cirrus Word Cloud
One of the more visually engaging and frequently used features of Voyant is the Cirrus word cloud. After uploading your text, the word cloud will automatically generate. This cloud will show vary the size of words based off of their frequency, generally also grouping the most frequently used words towards the center. The 'terms' bar on the lower-left of the visualization will adjust the number of words shown. Additionally, hovering over words in the word cloud will show their frequency in the corpus.
The word cloud has a separate options menu (similarly located in the top right of the word cloud) that will allow for various revisions. In addition to adding stopwords, this options list allows you to change the font family and color palette, create word categories, and white list words.
Word categories allow you to create lists of specific categories of words, with the ability to assign colors and fonts to specific categories of words. White listing words tells Cirrus to only include those specific, listed words on the graphic. Both categories and white listing can be handy tools for visualizing the frequency of specific groups of words.
In addition to the Cirrus word cloud, Voyant displays word trends on the right side of the screen. These word trends display the frequency of words and where in the text they occur most often. This can be quite handy for displaying sentiment over time or changes in viewpoint and terminology.
By default, Trends will plot the 5 most frequently used words in the corpus. Selecting other words will display them instead, or users can enter words into the search box below the trends visualization. In addition to the line graph displayed, by selecting the 'Display' icon the visualization can be changed to columns, area, and stacked bar charts. Shown below are the five most frequently appearing words in 'Billie Jean' as both a line chart and a bar chart.
The Contexts tool appears by default in the lower right corner of Voyant Tools. This tool displays words in context with the phrases occuring immediately before and after in the text. Contexts can be a great tool for sentiment analysis as well as for context comparisons. Each instance of a term is shown on a separate row, with ability to expand the row on the far left side of the row. Divided into four columns, Contexts is organized by Document | Left | Term | Right.
Document shows the text the term is located in. Left and right respectively show the phrases immediately to the left and right of the term. Term lists the specific term searched. Upon opening, Contexts will show the term with the highest usage. Additional terms can be searched for using the search box in the lower left corner of the Contexts too.
As with the other tools, further modification to the Contexts results are completed through the search box and bottom sliders. On the bottom center of the tool are shown two sliders: Context and Expand. Sliding Context will change the amount of context text is shown in the Left and Right columns. The Expand slider will change how much context is shown when rows are expanded. Additionally, a Scale drop-down is shown, allowing you to select which texts you would like displayed in the results.
The last Voyant tool we'll examine is Document Terms. This tool presents a list of all terms in the document, along with their frequencies and patterns of occurrence. The Document Terms tool displays information in five columns: Text Number (#) | Term | Count | Relative | Trend
If multiple texts have been uploaded, text number displays which document is being referenced with the displayed frequency counts. Term is the word in question. Count displays the number of times the term appears within the text. Relative displays the relative frequency per 10 million words in the document. Trend plots the occurrences of a term throughout a text, presenting a one-word plot similar to the Trends function.
Now that you've tested out four of the main Voyant tools, try visualizing your own data. In addition to the four tools we've covered today, Voyant provides various other tools for more specialized functions. Feel free to reference the Voyant documentation below for guidance on how to use these tools.
With a wide-ranging user base, there are a considerable number of web resources for both Google NGram Viewer and Voyant Tools. When using Google NGrams, Annie Swafford's various projects with NGrams are an engaging primer, and MakeTechEasier's guide for Ngram Viewer is a great resource for advanced searching.
McGill University's Voyant Tools documentation is a fantastic resource for getting the most out of your Voyant text analysis project. Medium's guide on the alternative functionalities of Voyant is also quite worth a read.
For similar examples to our in-class activity analyzing song lyrics, be sure to visit The Pudding's The Language of Hip Hop, Abigail Joffe's Comparison of Popular Song Lyrics Using Voyant, and Kyle Brynteson's Distant Reading Analysis of the Red Hot Chili Peppers: Examining Lyrical Themes.