Finding The Most Frequent Words In Text With R

Hi there. This page is about using the statistical programming language R for obtaining the most frequent words in text.

One approach is with a wordcloud. The second approach is through obtain counts for words and presenting them in a bar graph.

(It is assumed that the reader is familiar with the dplyr package in R and its %>% pipe operator.)

 


Sections

 


The Peter Pan Ebook Text

For this example, I analyze a text file version of the book Peter Pan (1904). The link is from http://www.textfiles.com/etext/FICTION/barrie-peter-277.txt.

Before reading in the text, I load the wordcloud and tm libraries into R.

 

The head() function in R is used to preview the text.

 


Data Cleaning & Wordclouds

From the tm package in R, I insert the peter_pan variable into the VectorSource() function which then goes inside the Corpus() function. A corpus is a collection of text documents.

The tm_map() functions are then used to extract words from text. This is done by removing whitespace, punctuation, numbers and converting letters to lowercase.

 

Stopwords In English

There are a bunch of words in the English language that are used to make sentences flow but don’t have much meaning on its own. These words include: the, and, but, through, over, under, a, an, he, she, him, her and so on.

The tm_map() function is used again to remove the stopwords from the text.

 

Creating Wordclouds

Once the text is all “clean” (reformatted), you can create the wordcloud. Making the wordcloud is not too difficult as it requires just the wordcloud() function.

From the wordcloud above you can see that peter and wendy stick out. The word said stands out too and could have been considered a stop word that would be removed. The bottom right contains the word hook as in Captain Hook and not so much a regular hook.

To reduce the size of the wordcloud, I can raise the number in the min.freq argument in wordcloud(). In this case, I raise it to 70 such that the words in this wordcloud appear at least 70 times in the Peter Pan text.


Most Frequent Words In Peter Pan

Wordclouds help the viewer determine popular words in text. They are also fun and entertaining to look at. The problem with wordclouds though is that you do not really the counts for each word.

This second approach consists of tidying the data and displaying the word counts in a bar graph. The dplyr, ggplot2 and tidytext packages are used here.

(Reference: Text Mining With R [Online Book])

The first couple of lines of code consist of loading in the appropriate packages and reading the Peter Pan text.

The R programming language keeps growing with these new packages, topics and concepts. This so called tibble is just a neater data frame. (I only heard of this tibble recently.)

Instead of using data.frame(), it would be data_frame().

 

The unnest_tokens() function from the tidytext package picks out the individual words and places them as rows.

An anti_join() is used to remove stopwords from peter_words().

The count() function with the %>% pipe operator from the dplyr package is used to obtain counts of the words.

The data now has a column for words and a second column for the word counts. A bar graph can be prepared with the ggplot2 function ggplot().

 

The geom_text() part of code is key for displaying the counts on the bars. This eliminates the guesswork from the viewer. The word said was featured in the wordcloud but it does not appear here.

 


References

  • https://stackoverflow.com/questions/8175912/load-multiple-packages-at-once
  • http://www.textfiles.com/etext/FICTION/barrie-peter-277.txt
  • Text Mining In R: A Tidy Approach by Julia Silge & David Robinson [Online Book]
  • R Graphics Cookbook By Winston Chang

Leave a Reply