Text Mining/Analysis On A Few Rap Lyrics With R Programming

Hi there. In this page, I share some work in the R programming language where I analyze rap lyrics. The rap lyrics are analyzed with text mining. Results are shared through code output and plots.

References include the R Graphics Cookbook by Winston Change and Text Mining with R – A Tidy Approach by Julia Silge and David Robinson. Lyrics were taken from a lyrics website (azlyrics.com I think).



  • Introduction
  • Fort Minor – Remember The Name Lyrics Analysis
  • Eminem – Lose Yourself Lyrics Analysis



Looking at rap music is one option when it comes to analyzing text. Rap lyrics generally contain more words than other musical genres.

In this page, I share some experimental work in the R programming language where I analyze rap lyrics from the tracks Fort Minor – Remember The Name and Eminem – Lose Yourself.

The R packages that I use are dplyr for data wrangling and data manipulation, ggplot2 for plotting results, tidytext for text analysis, and tidyr for data cleaning and data formatting.



Fort Minor – Remember The Name Lyrics Analysis

The artist Fort Minor was a side project from the Linkin Park member Mike Shinoda. Mike sometimes provides vocals in the form of rapping in a few of the Linkin Park tracks. Fort Minor’s Remember The Name single was released in 2005 and was featured in the video game NBA Live 06, the 2006 & 2007 NBA Playoffs and in the 2008 NBA draft.

Source: https://www.rarerecords.com.au/wp-content/uploads/2016/05/FORT-MINOR-Remember-The-Name.jpg




The key function that will help in obtaining word counts is the unnest_tokens() function. Each word from the song lyrics will be in a row.


Word Counts In Fort Minor – Remember The Name

There are words in the English language that do not offer much meaning on its own but it helps make sentences flow. These words are called stop words. An anti_join() from the dplyr package in R is used to remove stop words from the Remember The Name lyrics.


The count() function is used to obtain the words counts. These results are plotted as a sideways bar graph with the ggplot2 package functions.



In Fort Minor – Remember The Name, the word percent is used a lot. The word skill comes in at second place with a count of 6. Other common words include number words, reason, power, pleasure, and pain. These common words pretty much come from the chorus.

This is ten percent luck, twenty percent skill,

Fifteen percent concentrated power of will

Five percent pleasure, fifty percent pain

And a hundred percent reason to remember the name!


Bigrams: Two Word Phrases


Just like with the single words, we want to remove stop words in the bigrams. We can’t easily remove the stop words in the bigrams and the two word phrases are in one column. The separate() function from R’s tidyr package is used to separate the words into two separate words. From R’s dplyr package, the filter() function is used to remove words that are stop words.


The words are united with the use of the unite() function along with their counts. A bar graph with the help of ggplot2 can be generated.



These results from the bigram end up being not as interesting. The common bigrams are pretty much from the chorus.


Trigrams: Three Word Phrases

Out of curiousity and experimentation, I wanted to look at three word phrases or trigrams.

Phrases which contain three words are called trigrams. In the unnest_tokens() function, you would need to change the n = argument to 3.



The code below is very similar to the one for the bigrams. R’s separate function is used to separate the three words. The filter() function is used to remove the stop words from the trigrams. Lastly, the unite() function is used to put the filtered words back into trigrams.





Filtered trigrams can be plotted with the help of the ggplot2 graphics.

Like in the results from the bigrams, the common trigrams feature words from the chorus. The top trigram is twenty percent skill. (What the heck is fuckin nihilist porcupine?)


Sentiment Analysis

The sentiment analysis done here looks at whether the words in the lyrics are either positive or negative. There are three main lexicons which are nrc, AFINN and bing. Here, the nrc lexicon and bing lexicon results are presented.


nrc Lexicon



With the use of the nrc lexicon, the sentiment analysis results show that there is a near 50-50 balance of negative to positive words. This is somewhat misleading as the rapping in the track is quite aggressive in tone and there are quite a few swear words.


Bing Lexicon






The results under the bing lexicon are much different that the one with the nrc lexicon. Under the bing lexicon, there are more negative scoring words while the top positive word has a higher count than the top negative word.


 Eminem – Lose Yourself Lyrics Analysis

For the second rap song, I have chosen to look at the rap song Lose Yourself by Eminem. This track was featured in the movie 8 Mile (2002).

Source: http://mimo.recordingconnection.com/wp-content/uploads/2013/09/eminem-lose-yourself-628×628.jpg

The R code here is not much different than the one for Fort Minor – Remember The Name. (There are no trigrams though.)



Word Counts In Eminem – Lose Yourself



Word Counts Bar Graph Plot


The most frequent word from Lose Yourself is shot. In addition, you know you have a rap song when you have words such as yo and da. Some other high frequency words in the track include miss, lose, opportunity, moment, lifetime and chance.


Bigrams In Lose Yourself




Bigrams Plot

The most frequent bigram is lifetime yo followed by da da. These bigram counts results ended up being not too interesting.


Sentiment Analysis On Eminem – Lose Yourself

The nrc and bing lexicons are used again.


nrc Lexicon


Under the nrc lexicon, there is a near 50-50 split on positive and negative words in Eminem’s Lose Yourself. The top positive word is opportunity and the top negative word is shot. I am skeptical of the word music being a positive word.


bing Lexicon



From using the bing lexicon, we find that the results are more skewed to the negative side than with the nrc lexicon.

Leave a Reply