Hi there. I have been playing around with this dataset which contains marks from a statistics class at the University of Michigan.

**Table Of Contents**

Preparing The Data Before Plotting

**A Look At The Dataset**

This dataset from the faraway dataset library is on marks in a statistics class. I have included a screenshot image below providing the details of the data.

We first load the faraway, ggplot2 and tidyr packages into R.

1 2 3 4 5 6 7 |
# Data Analysis In R: Marks In A Statistics Class Dataset # Marks from Statistics 500 one year at the University of Michigan # R Graphics Cookbook by Winston Chang (2012) library(faraway) # Dataset library library(ggplot2) # Data visualization. library(tidyr) # For data cleaning/formatting |

This data is called stat500. I have saved this into a variable called stats_mark. I then use the head() and tail() functions to preview the data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
> # Stats Marks: > > stats_marks <- data.frame(stat500) > > # Preview data: > > head(stats_marks) midterm final hw total 1 24.5 26.0 28.5 79.0 2 22.5 24.5 28.2 75.2 3 23.5 26.5 28.3 78.3 4 23.5 34.5 29.2 87.2 5 22.5 30.5 27.3 80.3 6 16.0 31.0 27.5 74.5 > > tail(stats_marks) midterm final hw total 50 18.0 30.0 24.0 72.0 51 22.5 27.0 27.5 77.0 52 15.0 26.5 27.5 69.0 53 22.5 23.0 29.0 74.5 54 26.5 33.0 27.5 87.0 55 23.5 28.0 24.3 75.8 |

The structure of this data can be examined by using str().

1 2 3 4 5 6 7 8 |
> # Check Structure: > > str(stats_marks) 'data.frame': 55 obs. of 4 variables: $ midterm: num 24.5 22.5 23.5 23.5 22.5 16 27.5 22.5 25 30 ... $ final : num 26 24.5 26.5 34.5 30.5 31 33.5 31 29.5 37.5 ... $ hw : num 28.5 28.2 28.3 29.2 27.3 27.5 29.7 29 27.3 27.2 ... $ total : num 79 75.2 78.3 87.2 80.3 74.5 90.7 82.5 81.8 94.7 ... |

The column names can be renamed using colnames(). I then use the summary() function to get some basic statistics on the columns/variables.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> # Rename columns: > > colnames(stats_marks) <- c("Midterm_Test", "Final_Exam", "Homework", "Course_Mark") > > # Want to investigate want the midterms, homework and finals are worth. > # Summary function: > > summary(stats_marks) Midterm_Test Final_Exam Homework Course_Mark Min. : 8.50 Min. :13.00 Min. : 9.20 Min. :48.30 1st Qu.:17.25 1st Qu.:23.00 1st Qu.:25.85 1st Qu.:68.15 Median :20.50 Median :27.00 Median :27.50 Median :74.50 Mean :20.32 Mean :26.49 Mean :26.24 Mean :73.05 3rd Qu.:23.50 3rd Qu.:30.25 3rd Qu.:28.50 3rd Qu.:78.75 Max. :30.00 Max. :37.50 Max. :29.70 Max. :94.70 |

**Preparing The Data Before Plotting**

The numbers given under Midterm_Test, Final_Exam, and Homework are scores which contribute to the course mark. The course mark is a mark out of 100 as it is the final mark. One could speculate that the midterm test is worth 30% of the course mark, the final exam is worth 40% of the course mark and that the homework is worth 30% of the course mark.

With that being said I convert the scores from Midterm_Test, Final_Exam, and Homework to percentage scores out of 100% for each assessment in R.

1 2 3 4 5 6 7 8 9 |
# Could speculate that midterm is worth 30%, final exam is 40% and homework is 30%. # Convert the midterm test, final exam scores and homework as a percentage. stats_marks[, 1] <- round((stats_marks[, 1] / 30)*100, 2) stats_marks[, 2] <- round((stats_marks[, 2] / 40)*100, 2) stats_marks[, 3] <- round((stats_marks[, 3] / 30)*100, 2) |

I preview the changed data using head().

1 2 3 4 5 6 7 8 9 10 |
> # Look at data again: > > head(stats_marks) Midterm_Test Final_Exam Homework Course_Mark 1 81.67 65.00 95.00 79.0 2 75.00 61.25 94.00 75.2 3 78.33 66.25 94.33 78.3 4 78.33 86.25 97.33 87.2 5 75.00 76.25 91.00 80.3 6 53.33 77.50 91.67 74.5 |

Notice how the data is in a long format. For plotting purposes, it is preferable to have the data from a wide format to a long format. This long formatted data will have two columns Type and Score. The scores under Score will be percentage grade out of 100 and Type will be a column of factors containing either Midterm_Test, Final_Exam, Homework or Course_Mark.

To achieve, the gather() function from the tidyr package is used.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
> # Convert the dataset from wide format to long format for graphing purposes: > > gathered_data <- gather(data = stats_marks, "Type", "Score", 1:4) > > # Preview the data: > > head(gathered_data); tail(gathered_data) Type Score 1 Midterm_Test 81.67 2 Midterm_Test 75.00 3 Midterm_Test 78.33 4 Midterm_Test 78.33 5 Midterm_Test 75.00 6 Midterm_Test 53.33 Type Score 215 Course_Mark 72.0 216 Course_Mark 77.0 217 Course_Mark 69.0 218 Course_Mark 74.5 219 Course_Mark 87.0 220 Course_Mark 75.8 > > # Structure of new data: > > str(gathered_data) 'data.frame': 220 obs. of 2 variables: $ Type : chr "Midterm_Test" "Midterm_Test" "Midterm_Test" "Midterm_Test" ... $ Score: num 81.7 75 78.3 78.3 75 ... > > # Convert first column Type into a column of factors: > > gathered_data[, 1] <- factor(gathered_data[, 1], + levels = c("Midterm_Test", "Final_Exam", "Homework", "Course_Mark")) |

The data is now ready for plotting. Plots can be created for each of the four assessments (Midterm_Test, Final_Exam, Homework, Course_Mark).

**A Boxplot Visual**

Creating a boxplot in R with ggplot2 is quite simple. (For more details on boxplots click here.) Here is the full code and output. Note that the red square represents the mean.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
### Plotting The Data Using ggplot2: # Box Plot (With Means) ggplot(gathered_data, aes(x = Type, y = Score)) + geom_boxplot() + labs(x = "\n Assessment Type", y = "Percentage Score \n", title = "Student Percentage Scores \n In A Statistics Course \n") + scale_x_discrete(labels = c("Midterm Test", "Final Exam", "Homework", "Course Mark")) + theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(face="bold", colour="blue", size = 12), axis.title.y = element_text(face="bold", colour="blue", size = 12), legend.title = element_text(face="bold", size = 10)) + stat_summary(fun.y="mean", geom="point", shape=22, size=3, fill = "red", color = "red") |

The boxplot above allows us to visually see the spread of the percentage scores amongst the assessment types.

In the midterm test scores, the median and mean (red quare) scores are quite close and the range/spread is quite large from around 30 to 100 percent.

The final exam scores boxplot is similar to the midterm test boxplot but the range is smaller.

In Homework, the boxplot is positioned higher which indicates that many students scored high on their homework. The red square below the black line means that the mean percentage score for homework is lower than the median. The five black points represent outliers (extreme points/cases) where they did not really do their homework (or did the homework poorly).

With the course mark scores, the median and mean scores are around 75 percent. The two black dots below the boxplot are outliers and represent extreme values on the low end. The single black dot outlier is the extreme case where that student did the best in the class.

**A Histogram Visual**

This data can also be represented with a histogram.

1 2 3 4 5 6 7 8 9 10 |
# Histogram: ggplot(gathered_data, aes(x = Score)) + geom_histogram(binwidth = 10) + facet_grid(. ~ Type) + labs(x = "\n Percentage Score", y = "Count \n", title = "Student Percentage Scores \n In A Statistics Course \n") + theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(face="bold", colour="blue", size = 12), axis.title.y = element_text(face="bold", colour="blue", size = 12), legend.title = element_text(face="bold", size = 10)) |

An advantage that the histogram has over a boxplot is that you can see counts/frequencies associated with the scores here.

The midterm test scores appear approximately normal (with a mean = median at about 68%).

Final exam scores has somewhat of a left skew with more values on the right side of mean and the left tail is somewhat longer. The mode score is about 68% to 72%.

Since a lot of students scored high in their homework, more scores will be on the right. This creates a left skew where the left tail is long and the “hump” or mode is on the right.

For the course marks, it is hard to determine the skewness of the distribution of the scores. A good portion of scores have scores from 65% to 85%. (I used a binwidth of 10 = 10% so two bars cover 20% width wise.)

**References**

R Graphics Cookbook by Winston Chang (2012)

https://www.rstudio.com/wp-content/uploads/2015/04/ggplot2-cheatsheet.pdf