Correlation Plots Using The corrplot and ggplot2 Packages In R

Hi there. Here is some work on correlation plots in R I have worked on. Most of my findings have been through trial and error with some references.


Sections

A Look At The Data

Correlation Plots Using The corrplot Package

Using ggplot2 To Create Correlation Plots

References


A Look At The Data

Before looking at the data, I first load the faraway and corrplot packages into R. (The faraway package is a dataset package.)

From the faraway package, there is a dataset called teengamb. This dataset is about teen gambling and more information on this dataset can be found by typing in ?teengamb. I save this teengamb data into a new variable (copy) called gamb_data.

Using the head() and tail() functions, I can preview the data by looking at the first 6 rows and the last 6 rows of the data.

One could further examine the data by using the summary() and str() functions.

The column names could be fixed by capitalizing them. This can be done by using the colnames() function in R.

In base R, a correlation table can be created by using the cor() function.

In a correlation matrix, the numeric entries along the main diagonal from top left to bottom right are ones. One could show (by hand) that the correlation of two identical random variables is one. (I.e. Correlation of status and status is one).

Notice that the correlation matrix is a symmetric matrix. The transpose of a symmetric matrix is the same matrix as before. As an example, the correlation of status and income (row 2, column 3) is -0.2750340 which is the same as the correlation of income and status (row 3, column 2) which is also -0.2750340.

 


Correlation Plots Using The corrplot Package

This section will deal with creating correlation table plots using the corrplot package. Making simple correlation plots using corrplot is not very difficult.

At the end of the previous section, the correlation table is saved into a variable called corr_gamb. This corr_gamb variable is needed into the corrplot() function in the corrplot package.

I present five different correlation plots which I have come with in R. Other variations do exist as you can change the arguments in terms of titles, fonts, colours and so on. (The title is somewhat messed up and the image that produces is too zoomed in. I would have to look into it for a fix.)

The plot looks okay but it could use labels. Also, it is not necessary to show the full matrix. Since the correlation matrix is symmetric, the lower or upper triangular form of the full matrix is enough.

The second version is a lower triangular version of the correlation matrix.

 

Labels are added in version three.

 

Adding labels does help in assessing correlation strengths with variable pairs. Version four shows how to change the colours.

 

In version five, I change the background colour from white to gray.

 

 


Using ggplot2 To Create Correlation Plots

The ggplot2 package is a very good package in terms of utility for data visualization in R. Plotting correlation plots in R using ggplot2 takes a bit more work than with corrplot. The results though are worth it. To prepare the data for plotting, the reshape2() package with the melt function is used.

Load the ggplot2() and reshape2() packages first.

The melt function is used to break up the correlation table into a long format table. This table will consist of 25 rows (5 variables times 5 variables) with three columns. The first two columns consist of variable pairs and the third column will the contain the correlation measures for the variable pairs.

 

A Full Correlation Plot Using ggplot2

In this table, red tiles represent negative correlations between the two associated variables and the blue tiles represent positive correlations between two variables. The correlations along the main diagonal are ones.

 

Version Two: Upper Triangular Correlation Plot using ggplot2

The full correlation matrix provides more than enough information. An upper triangular matrix of the correlation matrix provides less cluster and there is no loss of information. (Recall that the correlation matrix is a symmetric matrix so we can afford to drop the multiple entries.)

For upper and lower triangular matrices, there is some additional data manipulation work needed to have the data prepared for plotting.

Having a upper triangular matrix version of the correlation matrix is less intimidating and it is easier to read. You may have to explain that for example that the correlation of income and status and the correlation of status and income are the same (due to symmetry).

(This upper triangular matrix form is not exactly like the one from linear algebra but I think the above image is good enough for displaying purposes.)

 

Version Three: Lower Triangular Correlation Plot using ggplot2

Here is the code and output for the lower triangular part of the correlation matrix. The code is similar to the upper triangular case.

 


References

R Graphics Cookbook by Winston Chang (2012)

http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization

Leave a Reply