Analyzing Yoga Google Searches Data In R

Hi. There is this website which stores data from Google Trends. Topics include the Olympics, Eurovision, the Panama Papers and the major tennis tournaments. The link is here.

The dataset I chose to play around with and analyze is this dataset on yearly yoga searches by State in the United States.


Topics

Importing the Yoga Dataset

Cleaning the Yoga Data

Finding Some Information in the Yoga Data


Importing the Yoga Dataset

The link above has a large collection of datasets and each of them has a Download link on the right where you can download the .csv file. You can download the csv into a folder and load the data from that folder into R. I’m doing a slightly faster way of importing the data from the web.

01-yogagoogler

We save the web link into the url function and store it as the variable url. The read.csv() file is used using the url variable, setting header = TRUE and use “,” in sep to separate commas between values.

We find out the dimensions of the dataset using dim() and preview the first 6 rows (by default) using head. We find out that this dataset is “messy”.

02-yogagoogler

Note: There is more to the head() output. I did not want to put all of it here to spare you the boredom.

The dimensions of the yoga dataset is 14 rows by 52 columns. This is a wide dataset. The column names are states.

We can rearrange the dataset such that we have a column of States with its corresponding values. Also, the first column has years with the first row containing “Values show search interest per year in yoga and have been indexed to 100, where 100 is the maximum value”. I am not exactly sure what the exact meaning of this search interest score is. I would think that the higher the score the more  frequent the search. The first row starting from the second column from the left all contain NA values. We can fix this.

We can also find more information about our column variables using the str() function in R. The full output is not shown as it is long.

03-yogagoogler


Cleaning the Yoga Data

Personally in my statistics courses, I was taught to mostly analyze data in R and fit an appropriate model to the data. My focus was mostly on statistical modeling and model diagnostics. The data given was ready for analysis.

Here, the data is not ready for analysis from the start, we have to format the data in such a way that it is ready for analysis and statistical work.

We start by removing the first row of data with the “Values show search interest per year in yoga and have been indexed to 100, where 100 is the maximum value. ” part and the NA values.  (No output shown here)

You can also check to see if there are any NA/missing values left.

There are no missing values in the data, so that is good. We can proceed.

We now fix the columns using the tidyr package in R. If tidyr is not installed in the system, use the code install.packages(“tidyr”). (You may need to update to a newer version of R.)

In tidyr, the gather() function is used to “gather” the columns into rows. The column names (states) will be under a new column called State and the values associated from the previous columns will be in a new column called Yoga.Index.Search.

The dimensions of this dataset is 663 rows (observations) by 3 columns (variables). It is now a long formatted dataset instead of wide. The first column name X needs to be changed into Year.

04-yogagoogler

String Manipulation in R

The data looks better thus far but we need to format the values under the state column. The ..us.xx. part and the end of each state is not appealing.

The substr() command will be used to extract a substring from each value in State. The starting index is 1 (not 0 like in most other programming languages). The stop index is the number of characters minus 8. (I counted 8 in ..us.xx.where xx is a wild card.)

05-yogagoogler

It may not be shown here but we have states which had a dot in between. Examples include New.Jersey, North.Dakota and so on. These dots will be replaced by a space using the gsub() function on the second column.

For backup purposes, save this clean dataset into a new variable.


Finding Some Information In The Data Yoga

Now that the data is all “cleaned” up and is in a format ready for data analysis, we can find some information in our dataset.

To help us with data manipulation, the dplyr package in R will be loaded. (Use install.packages(“dplyr”) to install dplyr in R.)

Suppose we wanted to find the five states with the highest search index number per year (Maximum index score is 100). We can do this in dplyr using this code.

06-yogagoogler

07a-yogagoogler

07b-yogagoogler

In the code, the yoga_clean dataset is used and it is grouped by Year. We arrange the columns by Year in ascending order (default) and Yoga.Index. Search is arranged from highest to lowest. To get the five states per year requires filter(row_number() <= 5) in which I had to Google.


The state of Vermont does show up often in the above. We can choose Vermont in the dataset as follows.

08-yogagoogler

The 2016 data is not the most reliable as the year of 2016 has not been completed yet.


Here are the states and their years with a Yoga.Index.Search number over 50.

09-yogagoogler

There are 21 cases where the state in a certain year have a Yoga.Index.Search of over 50.

Again, the state of Vermont shows up many times. One could investigate further why the state of Vermont likes to Google search yoga.

Plotting

Finding information in a dataset is nice and all but remember that a picture is worth a thousand words. Through some experimenting, this is the best visual of the data I could come up with.

yogaplot

Pdf Link of Plot For Clearer Image: yogaplot

The states are in the horizontal axes and the Yoga.Index.Search values are in the vertical axes. The points are coloured by Year to symbolize different years from 2005 to 2016.

It does look messy on the bottom with the 51 states. But if used x = Year instead of x = State and used col = State instead of col = Year, I would get the years on the bottom. Years on the bottom would be less messy but we would have 51 colours for the 51 states. This would be harder to read.

The theme(axis.text.x = element_text(angle = 90, hjust = 1)) part rotates the axis labels by 90 degrees. If this was not used that the states would overlap on the bottom and the states would not be legible.


We can also look at just the 2016 portion of the data. Here is the code for subseting the data and plotting it.

10-yogagoogler

2016_yoga

This plot looks less scary as the points are all in one colour as only the year 2016 is dealt with.

Note

I will probably never know what it is meant by “Values show search interest per year in yoga and have been indexed to 100, where 100 is the maximum value.” The search index is not a count or anything. It is some sort of ranking score set by Google.

 


References

The featured image is from http://paperpencilwriteup.com/wp-content/uploads/2015/12/yoga-153436_960_720.png.

The data can be found at http://googletrends.github.io/data/.

Leave a Reply