Text Mining And Analysis On A Children’s Book With R

Hi there. I have been playing around with text mining and the R programming language in analyzing a children’s book. Here is some work and my findings.

 


Sections

 

 


A Walk Among Trees (Free Children’s Ebook)

I wanted to search for a short children’s ebook that was accessible. What I found was this children’s book called A Walk Among Trees. After some fast skimming, this book is about fruit trees and it is educational in nature. This book can be found from this link here.

 


Importing The Book Text

The R programming language is not able to import .pdf files. Text from the .pdf file was copy and pasted into a .txt file. Titles were not copied into the text file.

Note that when you are reading in a text or .csv file offline, you need to set your working directory. If the file you are reading in is in your Downloads folder (as an example), you need to set your working directory to that folder in R/RStudio.

In the code below, I load in some R packages and read in the text version of the book.

The head() function in R is a preview type of function which allows to view part of the data object.

 


Preprocessing & A Wordcloud

This section is about cleaning up the data such that we can produce wordclouds. Wordclouds provided an idea of what type of words are in text in an artsy fashion.

The trees_book object is turned into a VectorSource and then into a Corpus. Cleaning the text involves the use of the tm_map() function with its associated arguments. Arguments such as removePunctuation, content_transformer(tolower), removeNumbers, stripWhitespace are used to extract the lowercase words without numbers or whitespace.

There are words in the English language that do not have much meaning on their own but are used to make sentences flow. Words such as I, me, my, we, our, yourself, he him, the are considered stopwords and can be removed from the wordcloud and word counts.

After cleaning up the text, wordclouds can be presented with the use of the wordcloud() function. You can specify the colours and the minimum word counts which appear in the cloud.

 

The size of the wordcloud gets smaller if the minimum frequency for word counts is increased. The min.freq has been raised from 5 to 8.

 


Common Words

Wordclouds are neat visuals for getting a small idea which words are common. If we want to know actual counts, we need something else other than word clouds.

The copied and pasted text file is loaded into R with the readLines() function. A preview of the data is done with the head() function.

 

(Note: There are issues with punctuation with words like won’t and can’t. This needs further investigation.)

Next, I convert the data object into a neater data frame called a tibble. The unnest_tokens() functions is used.

Words such as and, the, of, and so on are not important words and are referred to as stop words. The use of an anti_join() will remove stop words which are in trees_words.

To achieve the word counts, the count function from R’s dplyr package is used to obtain counts. Adding the sort = TRUE argument will sort the counts.

We can now make a plot of the word counts.

Note: This â character is there. Again, this needs further investigation.


Sentiment Analysis

For the most, sentiment analysis determines whether a piece of text is positive or negative. Keep in mind, that different people will look at certain words in different ways. This sort of analysis is not perfect and does not describes universal views but it is better than nothing.

There are three main lexicons which assign single words to scores or positive/negative. These three lexicons are:

  • AFINN
  • bing
  • nrc

For this analysis, the AFINN lexicon will be used. This lexicon assigns a score for each word. Negative numbers refer to negative words while positive numbers refer to positive words. Scores of zero refer to neutral words. (Do note that context is not assumed.)

You can get a preview of the AFINN lexicon and its scoring with the get_sentiments() function.

We use the tree_wordcounts data object from the previous section to assign AFINN lexicon scores to words. The sentiment_score takes the number of words multiplied by the AFINN lexicon score. A new column is added to indicate whether the sentiment score is positive or negative. This will help for filtering.

The trees_AFINN tibble (data frame) looks like this:

The filter() function from R’s dplyr package is used to make subsets. One subset is for positive words while the other is for negative scoring words.

 

We can now plot the results with a plot from the ggplot2 package. The first plot features sentiment scores of all (or most) of the words in the text.

From the plot above, the word loved is the most positive while the word hard is the most negative.

This next plot features the positive words and their counts. The bars this time around are lime green. Love is the most common positive word followed by vitamin.

 

With the negative words plot, the bars are red and the word hard is the most common. The word hard can either mean something that is difficult or something that is tough on the outside. In this context, hard should not really be a negative word for fruits. This is one example where a lexicon is not able to factor in context. Be aware of this.

 

 

Leave a Reply