Using R Programming & Text Analysis On The Dr. Seuss – The Cat In The Hat Kids Book

Hi there. In this page, I share some experimental work in the programming language R. I use R and text analysis to analyze the words in the Dr. Seuss – The Cat In The Hat kids book.

 

Source: http://wooderice.com/wp-content/uploads/2014/04/catinthehat.jpg

 


Sections

 

 


Introduction

A text version of the book can be found from https://github.com/robertsdionne/rwet/blob/master/hw2/drseuss.txt. The contents are copied and pasted to a different .txt file for offline use.

The R packages that are loaded in are:

  1. dplyr
  2. tidyr
  3. ggplot2
  4. tidytext
  5. wordcloud
  6. tm

 

 

Source: http://steve-lovelace.com/wordpress/wp-content/uploads/2013/08/cat-in-the-hat-in-comic-sans.png


Wordclouds Of The Most Common Words In The Cat In The Hat Book

To start, I load in the The Cat In The Hat book from the offline text file with the readLines() function. Afterwards, the readLines() object is put into a VectorSource and then into a Corpus.

Once you have the Corpus object, the tm_map() functions can be used to clean up the text. Options include removing punctuations, converting text to lowercase, removing numbers, removing whitespace and removing stopwords (words like the, and, or, for, me).

 

 

The next step is to convert the tm_map() object in a Term Document Matrix and then into a data frame. Once a data frame is obtained, wordclouds along with bar graphs can be generated.

 

 

The wordcloud() function from the wordcloud package allows for the generation of a colourful wordcloud as shown below.

 

 

To make the wordcloud smaller you can raise the minimum frequency requirement for words by changing the value of the min.freq argument in wordcloud().

 

 

It appears that the word like is the most common along with the words will, sir, fish, things and grinch.

 


The Most Common Words In The Cat In The Hat Book

In my other text mining/analysis projects in R pages, I use the tidytext approach with the tidytext package and the unnest_tokens() function to obtain the most common words in the The Cat In The Hat book. However, in this page I still use code from the previous section. The data_text object is already preprocessed with the tm_map() functions and is ready for plotting with ggplot2.

I take the top 25 most common words from The Cat In The Hat book. To obtain the bars, you need the geom_col() function. Sideways bars can be obtained with the coord_flip() addon function. Labels and text can be added with the labs() function and the geom_text function respectively. The theme() function allows for adjustment of aesthetics such as text colours, text sizes and so forth.

 

 

In the wordclouds, you were unable to determine the counts associated with each word. With the bar graph with numeric texts, you can clearly see the counts with the words.

The most common words in The Cat In The Hat include like, will, said, sir, one, fish and say.

 


Sentiment Analysis

Sentiment analysis looks at a piece of text and determines whether the text is positive or negative (depending on the lexicon). Three lexicons are used here for analyzing words.

Do keep in mind that each lexicon has its own way of scoring the words in terms of positive/negative sentiment. In addition, some words are in certain lexicons and some words are not. These lexicons are not perfect as they are subjective with the scoring.

I read in the book into R (again) and convert the book into a tibble (neater data frame). The head() function is used to preview/check the start of the book.

 

 

The unnest_tokens() function is then applied on the data_frame() object. Each word in The Cat In The Hat now has its own row. An anti_join() is used to remove English stop words such as the, and, for, my, myself. A count() function is used to obtain the counts for each word with the sort = TRUE argument.

 

 

nrc Lexicon

The nrc Lexicon categorizes words as either having the sentiment of trust, fear, negative, sadness, fear, anger or positive. Here, the sentiments of interest from the nrc lexicon are negative and positive.

 

 

Here is the code and output for the word counts influenced by the nrc Lexicon for The Cat In The Hat book. There is a lot of code in the section below as I wanted to make the plot look nicer than usual.

 

bing Lexicon

Words under the bing lexicon categorizes certain words as either positive or negative. In the bar plot below, you will see that the selected top words are different than the ones from the nrc lexicon. (These lexicons are subjective.)

 

 

 

The top negative word according to bing is bump. Other intriguing negative words include sue, funny, trick and noise. The word sue is either a verb as in to sue someone or it could be a name. I am not sure if I agree funny being a negative word. The word trick can be used as a verb as in to trick someone or as a noun such as a magic trick. Bing interprets trick more as a verb I presume.

 

AFINN Lexicon

Words from the AFINN lexicon are given a score from -5 to + 5 (whole numbers only). Scores below zero are for negative words and positive numbers are for positive words. I have used the mutate() function from R’s dplyr package to add a new column which indicates whether a word is positive or negative. This extra column helps in creating separate plots into one plot under ggplot2.

 

 

 

Under AFINN, the most negative word is battle and the most positive word is fun. The word fun is featured in all three lexicons and the “negative” word bad is featured in all three as well. As different as these lexicons are in terms of categorization, there are a few common words between the three lexicons.

The nrc lexicon scores the The Cat In The Hat book more positively than bing and AFINN. bing gives the book a more negative score overall and the AFINN results are fairly balanced.


References

  • R Graphics Cookbook By Winston Chang
  • Text Mining With R: A Tidy Approach By Julia Silge & David Robinson
  • http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
  • https://www.youtube.com/watch?v=JoArGkOpeU0
  • https://github.com/robertsdionne/rwet/blob/master/hw2/drseuss.txt
  • https://stackoverflow.com/questions/3472980/ggplot-how-to-change-facet-labels

Leave a Reply