Using R Programming For Text Mining/Analysis on Cancer Survivor Stories

Hi there. In this page, I use the R programming language to conduct text mining/analysis on cancer survivor stories. I want to investigate the most common words in the stories. In addition, sentiment analysis is done to determine whether the stories are positive or negative.

 


Sections

 


Introduction

In R, I load in the appropriate packages which allows for data cleaning, data manipulation, text mining, text analysis and plotting.

The cancer survivor stories are taken from Cancer Treatment Centers Of America (CTCA) at the website cancercenter.com. These stories are taken from the cancer survivor themselves. In the pages, there are some subheadings and titles which were not copied. Only the stories themselves are copied and pasted to a text file called cancer_survivor_stories.txt.

This text file is read into R with the file() and readLines() functions. Do remember to set your working directory to the folder where the text file is.

Do note that some of the pages contain more than one cancer survivor story. The types of cancers from the survivor stories do vary. Cancer types include:

  • Leukemia
  • Breast Cancer
  • Prostate Cancer
  • Lung Cancer
  • Stomach Cancer
  • Brain Cancer
  • Ovarian Cancer
  • Bladder Cancer
  • Melanoma
  • Colorectal Cancer

 

 


Wordclouds On Most Common Words

In this section, I use R to generate wordclouds on the most common words. The survivor_stories object is put into the VectorSource() function and that is put is into the Corpus() function.

Once you have the Corpus object, the tm_map() functions can be used to clean up the text. In the code below, I convert the text to lowercase, remove numbers and strip whitespace.

 

In tm_map(), you can remove English stopwords such as and, the, of, much, me, myself, you, etc. Add in the options removeWords and stopwords(‘english’).

From investigation of the texts, there is this weird â€™ thing which is there instead of the apostrophe in words like don’t, they’re and won’t. This strange thing is removed in the code later below.

Now, the stories are converted in a term document matrix and then in a data frame.

This next section of code is focused on removing this â€™ thing. After some research online and a lot of trial and error, I have come up with this solution. (I don’t think it is perfect but it does the job.)

I first check the structure of the data frame d and find out that the word column is a column of factors. I want this column to be a column of characters or strings. (The output for str(d) is omitted to save space.)

 

With the use of the a for loop and the gsub() functions in R, I remove the â€™ thing wherever there is one in a row of the dataframe d in the word column. The replacement for â€™ is the apostrophe ‘ as intended.

(There is one row with â€ but I have decided not to remove this row as this â€ thing won’t show up in the bar graph results.)

Now that the data is cleaned up, wordclouds can be generated with the wordcloud() functions.

 

This particular wordcloud is quite big and contains a good variety of words. The most common words include cancer, treatment, ctca, care as those words are the largest in the wordcloud. (There is this â€ thing in the wordcloud which was not removed.)

To achieve a smaller wordcloud, you can raise the minimum frequency and lower the maximum words inside the wordcloud() function.

 

From all the tm_map() preprocessing and obtaining the dataframe from the term document matrix, we can display the most common words with the use of the ggplot2 package. The mutate() function is used to sort the words from the most common word to the least common word from the stories. Aesthetic add-on functions in ggplot2 include labs(), theme() and geom_text().

 

The top word from the stories is cancer. The second common word is ctca which refers to the Cancer Treatment Centers Of America (CTCA). Other notable common words which are related to the topic of cancer include:

  • treatment
  • care
  • chemotherapy
  • surgery
  • doctor
  • hospital
  • team
  • radiation
  • oncologist

 


Word Counts In Cancer Survivor Stories – A tidytext approach

The previous section had a sideways bar graph of the most common words with the use of tm_map(). This section looks at word counts from the tidytext approach. The main reference book is Text Mining With R – A Tidy Approach by Julia Silge and David Robinson.

 

Next, the unnest_tokens() function from the tidytext R package is used. This will convert the words in the text file in a way such that each row has one word.

 

 

In any piece of text, there are words in the English language which makes sentences flow but carry no/little meaning on their own. These words are called stop words. Examples include the, and, me, you, that, this.

R’s dplyr function provides the count() function for obtaining word counts for each word.

Like in the previous section, there needs to be a way of removing strange characters such as the â character. From the dplyr and stringr packages the filter() and str_detect() functions are used to select rows that do not have the â character.

 

Results of the word counts can be plotted with the use of ggplot2 graphics and functions.

Although the top two common words are cancer and ctca, this most common words bar graph is slightly different from the one in the previous section. Notable common words that are related to cancer include:

  • treatment
  • care
  • chemotherapy
  • surgery
  • doctor
  • hospital
  • team
  • oncologist
  • doctors
  • diagnosis
  • breast
  • medical
  • pain

This tidytext approach bar graph on common words includes the words breast, diagnosis, medical and pain.

 


Bigrams In The Cancer Survivor Stories

In the previous section, we looked at the most common words. This section looks at the most common bigrams in the cancer survivor stories. A bigram is a phrase which consists of two words.

 

Like in the single words, the stop words need to be removed from the bigrams. Removing the stopwords in bigrams takes a little bit more work. R’s tidyr package and its separate function will be used here. The separate() function will split the bigram into two separate words, the filter() functions will keep the words that are not stop words from each of the two separate words and the count() function will give counts.

 

Now that the counts are obtained, the separated words can be gathered together with the use of the unite() function from R’s tidyr package.

In the bigrams, there is americaã‚ ctca with the weird character ã and i㢠ve. These weird characters need to be removed.

 

With the use of R’s ggplot2 package, a bar graph of the bigrams can be produced.

 

 

From the results, top bigrams include:

  • cancer treatment
  • care team
  • treatment centers
  • breast cancer
  • medical oncologist
  • lymph nodes
  • treatment plan
  • cancer diagnosis
  • ct scan
  • radiation treatment
  • radiation therapy

 


Sentiment Analysis

Sentiment analysis looks at a piece of text and determines whether the text is positive or negative. The lexicons determine the positivity or negativity of a piece of text. Three main lexicons are used in this analysis. The three lexicons are:

  • nrc
  • bing
  • AFINN

Keep in mind that these lexicons are highly subjective and are not perfect. For more information of these lexicons, please refer to section 2.1 of Text Mining With R with Julia Silge and David Robinson. (Link).

 

nrc Lexicon

The nrc lexicon categorizes words with the sentiment of either:

  • trust
  • fear
  • negative
  • sadness
  • anger
  • positive
  • negative

 

From the nrc lexicon, the sentiments of interest are positive and negative. From the survivor_wordcounts object, an inner_join is used to match the words with the nrc sentiment words. A filter() function is used to extract words with a positive or negative sentiment from the nrc lexicon.

Note that the word_labels_nrc part is for the labels in the facet_wrap() add on function. This yields the labels Negative Words and Positive Words in the plot below.

 

A horizontal bar graph is produced for words with a count over 70 from the cancer survivor stories.

 

 

The top negative words include cancer, radiation, diagnosis, pain, biopsy, feeling, tumor and disease. It is somewhat weird that the nrc lexicon puts the words feeling, spoke and immediately as negative (even if the lexicons are subjective).

Positive words include doctor, received, medical, found, information, feeling, hope and journey. The words found, received and feeling are more neutral words than positive words. Lexicons do not consider context and are subjective.

 

bing Lexicon

With the bing lexicon, words are categorized as either positive or negative.

 

With the use of ggplot2 graphics, a bar graph can be generated. A filter for word counts over 50 is applied.

 

 

 

There may be less negative words but the top negative word is still cancer with a count of 1408. Positive words include helped, support, recommended, patient, love, strong and positive. The word patient is probably mislabeled as a positive adjective versus a noun as in a hospital patient.

 

AFINN Lexicon

In the AFINN lexicon, words are given a score from -5 to +5 (inclusive). Negative scores indicate negative words while positive scores indicate positive words. I have used the mutate() function after the inner_join() part to categorize words that have either a positive or negative score. This new column from mutate() is called is_positive and will help with the plots.

 

 

Plotting the bar graph for the AFINN lexicon case is not much different than the one under in the bing case.

 

 

Featured negative words from the AFINN lexicon are cancer, pain, tumor, difficult, hard, fight and lost. Top positive words include care, support, recommended, feeling, hope, strong, positive, healthy, god and faith. I like the AFINN lexicon results as the word care is included. Care plays a big part for those going through cancer treatment.

 


References/Resources

  • https://www.cancercenter.com/community/survivors/
  • R Graphics Cookbook By Winston Chang
  • https://stackoverflow.com/questions/3472980/ggplot-how-to-change-facet-labels
  • Text Mining With R with Julia Silge and David Robinson
  • https://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-string-match
  • https://stackoverflow.com/questions/24576075/gsub-apostrophe-in-data-frame-r
  • http://stat545.com/block022_regular-expression.html
  • https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html

 

Leave a Reply