Wordcounts & Wordclouds In R On The Dr. Seuss – Green Eggs & Ham Book

Hi there. In this page, I use the R programming language to do text analysis and text mining to obtain wordcounts and wordclouds from the Dr. Seuss – Green Eggs & Ham book. The topic of bigrams (two word phrases) is not discussed here this time around.

 

Source: http://mommyneedsabottle.com/wp-content/uploads/2015/08/GreenEggs_Ad.png

 


Sections

  • Introduction & Getting Started
  • Wordcounts & Wordclouds In Green Eggs & Ham
  • Bigrams In Green Eggs & Ham
  • References & Resources

 


Introduction & Getting Started

One of the first children’s book I was introduced to was Dr. Seuss – Green Eggs & Ham. I would read this book a lot at the doctor’s office when I was young.

A .txt version of the book can be found online through this link. Since there is no title or weird characters, there is no need for data cleaning in R.

Wordcounts and wordclouds are generated in the tidy way as described from the (online) book Text Mining With R: A Tidy Approach by Julia Silge and David Robinson.

 

Loading Libraries In R

The R packages of interest are dplyr, tidyr, ggplot2, tidytext, wordcloud and gridExtra.

 


Wordcounts & Wordclouds In Green Eggs & Ham

With the tidytext package in R, you can obtain wordcounts from pieces of text. To be able to generate wordclouds, you would require the wordcloud R package. My other text mining posts mention creating wordclouds with the use of the tm package but in this case I am using the tidytext and wordcloud packages.

There is a text version of the Green Eggs & Ham book online here. This text file is the book itself so there is no need for data cleaning. To read in the file, use the readLines() function in R.

 

 

From the tidytext package, the unnest_tokens() function converts the text in a way such that each row is just a single word.

 

Normally, I want to remove stopwords from the text as they carry very little meaning on their own. This time around, I will obtain word counts in Green Eggs & Ham when the stopwords are filtered out and the word counts of the original book itself. To filter out the stop words the anti_join() function from R’s dplyr package is used. The variable which is associated with the filtered text is greenEggs_words_filt.

 

 

With the use of dplyr’s pipe operator (%>%) and its count() function, counts for each word can be obtained for the filtered case and the non-filtered case.

 

 

Generating The Plots

Case One: Wordcounts Plot and Wordcloud With Stopwords

Plots are generated with the use of R’s ggplot2 data visualization package. The plots are saved into variables which will be used the grid.arrange() function later for multiple plots.

From the unfiltered version, I take the top 15 most common words in the Green Eggs & Ham book. The results from the plot are not too inspiring besides the name sam.

 

 

Most of the preprocessing has already been done with the dplyr functions. Generating the wordcloud does not take much extra code.

 

 

 

Case Two: Wordcounts Plot and Wordcloud Without Stopwords

The code is not much different from case one. In this case, the filtered version of the word counts is used.

 

 

 

From the results, top words include:

  • eat
  • sam
  • green
  • eggs
  • ham
  • mouse
  • house
  • fox

 

These top words indicate that the book has something to do with sam, eggs, ham, eating and the colour green.

 

Generating the wordcloud in R with the wordcloud package is not much different as in the first case.

 

 

 

 

Combining The Bar Plots Into One Graph With grid.arrange()

The horizontal bar graphs from earlier were saved into variables. From the gridExtra package in R, the two variables containing the plots can be used in the grid.arrange() function to generate a plot with multiple graphs.

 

 

 

There is a clear and definite difference with the graphs when the English stopwords such as I, the, of, will and with are removed. The results carry more meaning.

 


References & Resources

  • R Graphics Cookbook By Winston Chang
  • Text Mining With R: A Tidy Approach By Julia Silge and David Robinson
  • https://www.clear.rice.edu/comp200/resources/texts/Green%20Eggs%20and%20Ham.txt
  • http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

Leave a Reply