Making Word Clouds In R Using Twitter Data

Hi there. I stumbled upon these two great Youtube videos showing how to make word clouds in R using Twitter data. My guide will be very similar to his but instead of looking for tweets for the Liverpool Football club, I will be finding math tweets.

The reference videos are here: Part One, Part Two.


Topics


Setup

Before getting into the R/RStudio software, we need to set up access to Twitter data.

Here are the steps to set up Twitter access and retrieve the necessary codes for access through R.

  1. Go to https://apps.twitter.com
  2. Login with a Twitter account. (Create a new Twitter if needed)
  3. Create new app if you do not have one.
  4. Create a new name, description
  5. Use http://test.de/ for website.
  6. Go to Keys and Access Tokens tab (3rd from left):
  7. Create access token.

For R, you will need your:

  1. Consumer Key (API Key)
  2. Consumer Secret (Consumer Secret)
  3. Access Token
  4. Access Token Secret


Access To Twitter Data In R

After you know your 4 codes, start R. Load the libraries RCurl, twitteR, wordcloud and tm. If you need to install a package, you can use the code install.packages(“package”).

Next setup your code similar to this:

The main function above is the setup_twitter_oauth() function. This is needed to connect to Twitter and extract tweets into R. Think of it like a login to Twitter from R.


Extracting Tweets

To extract tweets into R, the searchTwitter() function will do the trick. For full details and documentation type in ?searchTwitter.

I will be searching for 1000 recent tweets in English containing math. The more tweets you search the longer it takes but you can (theoretically) create a bigger word cloud.

It is also verified that the searched tweets is a list.

We can also take a look of some of the tweets we searched. Content does vary and I am not responsible for other people’s opinions and “language”.


Formatting and Cleaning the Tweets Data

We now convert the tweets into characters using the sapply function. The sapply() function takes a list or vector and a user specified function such that the function is applied to the entire list/vector. (It is a more condensed version of a loop.)

The next step is to create a collection of documents/tweets. This collection is called a corpus. Once we have a corpus, we will need to clean up some of the tweets such that we pick out some key words in a wordcloud.

Now that we have a corpus object, we use the tm_map functions to clean the tweets. We will remove punctuations, convert to lowercased words, remove filler English words such as the, a, an, me, remove numbers, slang words and white space.


Making The Wordcloud

Making a wordcloud in R is quite easy with wordcloud library. You just need the wordcloud() function. When in doubt, it is best to check the doumentation. In this case type ?wordcloud to figure out the arguments/parameters needed and their default values.

I am making a colorful wordcloud instead of one with just one colour. I use rainbow(50) for colours along with random.color = TRUE. Having random.color = FALSE does not really give the desired rainbow effect. I use min.freq = 20 such that words with frequencies less than 20 are not in the word cloud.

twitter_wordcloud

Arguments in the wordcloud can be altered such as the min.freq, scale, colour and so on. My wordcloud will differ from yours as the gathered tweets may not be the same.

The word cloud above does not seem to have that many math related terms. It is something that can’t be controlled as we go with the information we have.


References

The featured image is from http://thinktostart.com/wp-content/uploads/2013/05/twitter_auth.png.

Leave a Reply