Data Analysis In R: Marks In A Statistics Class Dataset

Hi there. I have been playing around with this dataset which contains marks from a statistics class at the University of Michigan.


Table Of Contents

A Look At The Dataset

Preparing The Data Before Plotting

A Boxplot Visual

A Histogram Visual

References


A Look At The Dataset

This dataset from the faraway dataset library is on marks in a statistics class. I have included a screenshot image below providing the details of the data.

We first load the faraway, ggplot2 and tidyr packages into R.

This data is called stat500. I have saved this into a variable called stats_mark. I then use the head() and tail() functions to preview the data.

The structure of this data can be examined by using str().

The column names can be renamed using colnames(). I then use the summary() function to get some basic statistics on the columns/variables.


Preparing The Data Before Plotting

The numbers given under Midterm_Test, Final_Exam, and Homework are scores which contribute to the course mark. The course mark is a mark out of 100 as it is the final mark. One could speculate that the midterm test is worth 30% of the course mark, the final exam is worth 40% of the course mark and that the homework is worth 30% of the course mark.

With that being said I convert the scores from Midterm_Test, Final_Exam, and Homework to percentage scores out of 100% for each assessment in R.

I preview the changed data using head().

Notice how the data is in a long format. For plotting purposes, it is preferable to have the data from a wide format to a long format. This long formatted data will have two columns Type and Score. The scores under Score will be percentage grade out of 100 and Type will be a column of factors containing either Midterm_Test, Final_Exam, Homework or Course_Mark.

To achieve, the gather() function from the tidyr package is used.

The data is now ready for plotting. Plots can be created for each of the four assessments (Midterm_Test, Final_Exam, Homework, Course_Mark).


A Boxplot Visual

Creating a boxplot in R with ggplot2 is quite simple. (For more details on boxplots click here.) Here is the full code and output. Note that the red square represents the mean.

 

The boxplot above allows us to visually see the spread of the percentage scores amongst the assessment types.

In the midterm test scores, the median and mean (red quare) scores are quite close and the range/spread is quite large from around 30 to 100 percent.

The final exam scores boxplot is similar to the midterm test boxplot but the range is smaller.

In Homework, the boxplot is positioned higher which indicates that many students scored high on their homework. The red square below the black line means that the mean percentage score for homework is lower than the median. The five black points represent outliers (extreme points/cases) where they did not really do their homework (or did the homework poorly).

With the course mark scores, the median and mean scores are around 75 percent. The two black dots below the boxplot are outliers and represent extreme values on the low end. The single black dot outlier is the extreme case where that student did the best in the class.


A Histogram Visual

This data can also be represented with a histogram.

An advantage that the histogram has over a boxplot is that you can see counts/frequencies associated with the scores here.

The midterm test scores appear approximately normal (with a mean = median at about 68%).

Final exam scores has somewhat of a left skew with more values on the right side of mean and the left tail is somewhat longer. The mode score is about 68% to 72%.

Since a lot of students scored high in their homework, more scores will be on the right. This creates a left skew where the left tail is long and the “hump” or mode is on the right.

For the course marks, it is hard to determine the skewness of the distribution of the scores. A good portion of scores have scores from 65% to 85%. (I used a binwidth of 10 = 10% so two bars cover 20% width wise.)


References

R Graphics Cookbook by Winston Chang (2012)

https://www.rstudio.com/wp-content/uploads/2015/04/ggplot2-cheatsheet.pdf

 

Leave a Reply