Text Analysis In R On Dr. Seuss – Fox In Socks

Hello there. In this page, the focus is on text analysis (text mining) with the R programming language on the Fox In Socks book. Read out aloud Youtube videos such as this are available.

 

Source: http://ecx.images-amazon.com/images/I/51KGN0dwloL._SL600_.jpg

 

 


Topics

  1. References
  2. Getting Started
  3. Wordcounts In Fox In Socks
  4. Bigrams Counts In Fox In Socks
  5. A Look At Positive and Negative Words With Sentiment Analysis

 


References

  • R Graphics Cookbook By Winston Chang
  • http://ai.eecs.umich.edu/people/dreeves/Fox-In-Socks.txt
  • http://lightonphiri.org/blog/ggplot2-multiple-plots-in-one-graph-using-gridextra
  • Text Mining with R – A Tidy Approach by Julia Silge and David Robinson
  • https://stackoverflow.com/questions/3472980/ggplot-how-to-change-facet-labels

 


Getting Started

In R, you want to load the following libraries:

  • dplyr for data manipulation
  • ggplot2 for data visualization graphics
  • tidyr for data cleaning
  • tidytext for obtaining word counts and doing sentiment analysis

 

To load libraries into R, use the library() function. For installation of packages, use install.packages(“pkg_name”).

 

 

Note that the Fox In Socks book is obtained from http://ai.eecs.umich.edu/people/dreeves/Fox-In-Socks.txt as indicated in the comments in the above code.

 


Wordcounts In Fox In Socks

With text mining/analysis, it is possible to obtain word counts from books or any piece of text. Knowing word counts from a book gives an idea of what the book is about and which words are emphasized.

The text file can be found online. There is no need for setting a directory or copying and pasting.

 

 

Notice that there is the title and a bunch of dashed lines at the top of this text file (website link). The title and dashed lines are not of importance and can be removed in R. This can be done by selecting only from the fourth line onwards.

 

 

From the tidytext package in R, the unnest_tokens() function is the first step to obtaining word counts from the Fox In Socks book.

 

English words such as for, the, and, me, myself carry very little meaning on their own. These words are called stop words. An anti join can be used to keep words that are not stop words in Fox In Socks.

 

 

From foxSocks_words, word counts can be obtained with the use of the count() function from R’s dplyr package.

 

 

The word counts results can be displayed as a horizontal bar graph with the use of ggplot2 graphics in R. Here is the code and output for the top twenty five words in Fox In Socks (after filtering out the stopwords).

 

 

Top words include sir, socks, knox, fox, tweetle, and battle. From the top 25 words, there are repeats in the sense of having plurals and singular forms.

 


Bigrams Counts In Fox In Socks

Instead of the counts of single words, counts of two word phrases or bigrams can be obtained.

 

 

Removing stop words from the bigrams requires a bit more work. In this case, tidyr and the dplyr packages are used together in R. First, the separate() function from tidyr is used to split the bigrams into their two separate words. Any stopwords that are in the bigrams are removed with two filter() functions. After filtering, counts are obtained.

 

 

The separated words can be reunited together with tidyr’s unite() function.

 

After unification, the results can be displayed with ggplot2 graphics.

 

 

 

The bigrams tied for first place at a count of 5 are:

  • tweetle beetles
  • tweetle beetle
  • slow joe
  • knox sir

 

Notice how a lot of these bigrams do rhyme. Having kids get used to rhymes works on reading and listening skills.

 


A Look At Positive and Negative Words With Sentiment Analysis

A big part of sentiment analysis involves the analysis of negative and positive words in text. The three main lexicons which (subjectively) scores and/or categorizes words are nrc, bing and AFINN.

In my other text mining posts, I have separate plots for sentiment analysis for each of the three lexicons. This time around the three sentiment plots will be displayed all in one.

 

In the code below, the word_labels_nrc variable stores plot labels which is used later. An inner_join along with get_sentiments(“nrc”) is used to select words from foxSocks_wordcounts that are also in get_sentiments(“nrc”). Since the nrc lexicon has additional sentiments such as trust, fear and anger, the filter() function is used to select only words that are categorized as negative or positive.

 

 

Instead of displaying the nrc sentiment plot right away, the plot is saved into the variable nrc_plot. The intent is to save the three sentiment plots into three separate variables and then use them to display all three bar plots in one graph.

 

 

With the bing lexicon, words are categorized as either negative or positive. The code for dealing with the bing lexicon is very similar to the one with the nrc case. Instead of the filter() function, there is the ungroup() function.

 

 

The bing sentiment plot in R is saved into the bing_plot variable for later use.

 

 

Words under the AFINN lexicon have a numeric score from -5 to +5. I have included a new column which categorizes the word as negative if the score is below 0 and positive if the score is above 0.

 

 

Multiple Plots In One Graph

Putting multiple plots into one graph is actually quite simple. The grid.arrange() function from the gridExtra R package takes in plot objects and an argument for the number of columns.

 

 

 

From the nrc lexicon sentiment analysis results, the top positive words are sir, paddle and luck. I think sir is considered positive as it is used as a sign of respect. I am not sure why paddle is positive. Negative words from the nrc lexicon include battle, goo, trick and sue. Note that lexicons are not great with context. Sue can be a verb or a name. In this case, nrc recognizes sue as a negative verb.

The bing lexicon does not recognize battle as a negative word but it does recognize the words slow, poor and freeze as negative. Positive words from the bing lexicon include luck, likes, easy, free and breeze. Sir is not included here.

In the AFINN lexicon results, negative words include battle, blocks, stop, sick, poor and fight. Positive words does not include sir but it does include the words luck, likes, easy, free, slick and fun.

According to bing and AFINN, the Fox In Socks is more negative than positive if you are looking at word counts (after filtering out stopwords). Since nrc considers sir as positive, nrc considers the book as more positive than negative. These lexicons are subjective and not perfect. Some information is better than none in this case.

Leave a Reply