Plotting Normal Distributions In R Using ggplot2

Hi. I have been experimenting with creating (standard) normal distributions (Gaussians) using R and ggplot2. Here is what I have come up with. It is assumed that the reader is familiar with the normal distribution, Z-scores, and standard deviations.


Table Of Contents

Part One

Simple Normal Distribution Plots

Plotting Two Normal Distributions In The Same Plot

 

Part Two – Plotting Standard Normal Distributions With Areas

Filling The Area Within One Standard Deviation Of The Mean

Filling The Area Within Two Standard Deviations Of The Mean

Filling The Area Within Three Standard Deviations Of The Mean

 

Part Three – Creating Normal Distribution Summary Plots

This section has three versions of the normal distribution with percentage labels with the area under the normal curve. The code and outputs are provided.

Notes

References


Part One

 

Simple Normal Distribution Plots

In the ggplot2 package in R, plotting the normal distribution is not very difficult. In the code below, I specify a domain for the x-values. After the ggplot(), it is important to have the stat_function() part with fun = dnorm. The dnorm function refers to the (standard) normal distribution density function with mean = 0 and standard deviation of 1. (Do load the ggplot2 package in R by typing in library(ggplot2).)

Adding Labels

Labels can be added by using the annotate function on top of ggplot(). Here is an example of it as shown by the code and output.

 

 

Plotting Two Normal Distributions In The Same Plot

Multiple probability distributions and functions can be plotted together. In this example, I plot a standard normal distribution and a normal distribution with a mean of 2 and a standard deviation of 3. The standard normal distribution is represented from dnorm and the other normal distribution is represented by a custom function in each of the stat_function() parts.

Note that the vertical dashed lines represent the means of their respective normal distributions. The red dashed line is for the normal distribution centered at the mean of 0 and the brown dashed line is for the normal distribution centered at the mean of 3.

 


Part Two – Plotting Standard Normal Distributions With Areas

In part two, areas under the standard normal distribution curve will be filled in. I fill in areas under the normal curve from within one standard deviation of the mean to three standard deviations of the mean.

As a side note, recall that a key formula for standardizing a normal random variable is Z = \dfrac{X - \mu}{\sigma} or X = \mu + \sigma Z. If \mu = 0 and \sigma = 1 then Z = X which means that the standard normal variable Z is equal to the normal random variable X.

 

Filling The Area Within One Standard Deviation Of The Mean

For filling in the area withing one standard deviation of the mean, it means that the area is filled in underneath the curve in the interval of -1 to 1. (In other words, I fill in the area underneath the standard normal curve within the z-scores of -1 to 1.) With this in mind, I create a custom function in R describing this.

This next line of code is for labelling and for educational purposes. If I want the percentage area under the standard normal distribution within one standard deviation, I compute the area going to x = 1 (horizontally) minus the area going to x = -1. This can be achieved in R using the pnorm() function.

(Note that the pnorm function is R’s version of the cumulative density function/CDF where it computes the area of the random variable equal or less than a specified amount. In math notation we have P(Z \leq z) for a standard normal random variable Z and a fixed known quantity z.)

We see that the area underneath the standard normal curve within one standard deviation is 0.6827. Remember that entire area (all x values or x \in \Re) of the standard normal curve is 1 (or 100%). The round() function is used to have the answer withing 4 decimal places.

These next lines of code and its output combines the above parts to create the plot. This plot has the standard normal curve with a label and a filled in area within one standard deviation of the mean centered at 0.

 

For the shaded area effect it is key to put in your custom function (dnorm_one_sd), geom = “area” and the fill colour in stat_function().

Notice that two stat_function() functions were used. One is for creating the normal curve indicated by the black curve outline and the second stat_function() is for the yellow fill which is within one standard deviation from the mean of 0. The alpha argument determines the colour shading/contrast.

 

Filling The Area Within Two Standard Deviations Of The Mean

The thought process and code is similar to the case above. Another custom function is made with another separate label for the code and plot.

 

Filling The Area Within Three Standard Deviations Of The Mean

Here is the code and output for the case of the area under the normal curve withing three standard deviations of the mean.


Part Three – Creating Normal Distribution Summary Plots

In this section, I will present code and output for creating informative normal distributions plots with useful labels such as the one below:

Source: http://www.muelaner.com/wp-content/uploads/2013/07/Standard_deviation_diagram.png

 

 

Version One

The output comes out really nicely and can be used for informative purposes. The three percentage labels comes up from the three geom_texts().

Do keep in mind that the code here does refer to custom functions and variables which were made from the previous section.

The paste0() function combines/concatenates strings and variables (outputs). I used paste0() to achieve the labels with arrows.

 

Version Two

This version is not much different that the first one above. It is just that the labels are different. I use the 68%, 95% and 99.7% labels which appeal to a lot of introductory statistics students.

 

Version Three

This version three normal distribution summary plot is different from the first two. The two plots above are within 1, 2 and 3 standard deviations from the mean.

This third plot provides a more comprehensive breakdown. As an example, a percentage label of 13.59% appears for the region from 1 standard deviation above (right of) the mean to 2 standard deviations above the mean. (The mean of a standard normal is 0.)

To show the symmetry of the standard normal distribution, I have inserted a vertical dashed line at the mean = median = mode = 0.


Notes

The ggplot2 data visualization package in R allows users to create some appealing and creative graphs, plots and visuals to display data. These colourful visuals can be very useful aids in communicating findings from data.

It does require a bit of knowledge in mathematics, probability, statistics and programming to understand the code and output.


References

http://www.sthda.com/english/wiki/ggplot2-add-straight-lines-to-a-plot-horizontal-vertical-and-regression-lines

R Graphics Cookbook by Winston Chang (2012)

R Documentation (for the pnorm, dnorm functions).

Leave a Reply