R Programming – Text Mining/Analysis On Paul Van Dyk feat. Plumb – I Don’t Deserve You Music Track

Hi there. Out of my own curiousity, I wanted to do some text mining/analysis on the (trance/dance) music track I Don’t Deserve You by Paul Van Dyk featuring the vocals of Plumb. There is another version by Plumb herself named Don’t Deserve You.

I heard of this track back around at the end of 2013 from Armin Van Buuren’s A State Of Trance radioshow. Back then, I did not think much of the track. Recently in 2018, I got into the track. My favourite version is the trance remix by Giuseppe Ottaviani.

 

 

The lyrics for both the tracks are very similar except for one part. In this page, I analyze the lyrics from the Paul Van Dyk version.

Good resources/references include the R Graphics Cookbook by Winston Chang and Tidy Text Mining With R By Julia Silge and David Robinson.

 


Sections

 

 


Loading The Lyrics Into R

The lyrics for Paul Van Dyk feat. Plumb – I Don’t Deserve You were taken from azlyrics.com. I have copied and pasted the lyrics into a .txt file (with Notepad++).

I first load the dplyr, ggplot2, tidytext and tidyr packages into R. The dplyr and tidyr packages is for data manipulation, the ggplot2 package is for data visualization and the tidytext package is for word counts and text mining.

 

 

With the readLines() function in R, you can read in .txt files. I read in the lyrics .txt file into R and convert it into a data frame. The head() function in R allows for partial printing of an object.

 

 

The key function that is needed for text analysis is the unnest_tokens() function. This function separates all the words in a text in a way that each row has its own word.

 

 


Word Counts In Paul Van Dyk feat. Plumb – I Don’t Deserve You

There are words in the English language that are not useful in terms of meaning. Words such as the, and, me, you, myself, and of allows for sentence flow but carry little meaning on their own.

 

From the head() output, the word deserve and the word love have a count of 6. To better display the results, it is preferable to use a bar graph from R’s ggplot2 package.

 

 


Sentiment Analysis

In R’s tidytext package, there are three main lexicons which analyze the sentiment/feelings of words. The three lexicons are nrc, AFINN and bing. The code and output below are from the three lexicons.

 

nrc Lexicon

 

 

 

bing Lexicon

 

 

AFINN Lexicon

 

 


Bigrams Count

Analyzing single words may not be enough. This section deals with analyzing two word phrases or bigrams. From the unnest_tokens() function, the key arguments include token = “ngrams” and n = 2.

 

For this case, I have decided not to remove stop words in the bigrams as this would remove too many words. The separate() function from R’s tidyr package takes a value and splits it into parts based on a separator argument. I split the bigram into its two separate words. After the separation, I obtain counts with the count() function.

Now that the counts are obtained, the unite() function is used to combine the two words into bigrams along with their counts.

 

A plot of bigrams counts can be made with ggplot2 graphics in R.

 

 

It is expected that the bigram don’t deserve has a high count as it is in the song title. Since stop words are not removed in the bigrams, these results from the plot are not very interesting nor informative. (We also have a small sample size from the lyrics.)

Leave a Reply