Data Manipulation In R Using The dplyr Package

Hello. In this page, I experiment with data manipulation and extracting information from data using the dplyr package in the statistical program R. Here is my guide to dplyr and R.


Table Of Contents


What Is The dplyr Package in R?

The dplyr package in R allows the user to transform and summarize data from data sets and data frames. In a sense, the dplyr package is somewhat similar to SQL. In SQL, queries are used to extract information from data tables while dplyr extracts information from data sets.

dplyr can also be used on large datasets as well.


Five Main Functions In The dplyr R Package

In the dplyr package there are five main functions for transforming the data and for extracting information from the data.

  1. filter() for extracting rows which meet a criteria
  2. select() for selecting/subsetting columns
  3. summarise() for extracting information about the dataset (e.g. average)
  4. mutate() for making new variables/columns in the dataset
  5. group_by() for grouping data into rows

There are other more functions in the dplyr package, but these 5 are the more important ones.


An Example of dplyr In R – The Dataset

The poisons dataset is being used and the information is from from https://vincentarelbundock.github.io/Rdatasets/doc/boot/poisons.html.

The data is a result from an experiment on animals with three poisons and four treatments.

Variables

time : The survival time of the animal in units of 10 hours

poison: A factor with levels of 1, 2, 3 depending on the poison type being used

treat: A factor with levels of A, B, C and D based on the treatment type.


Importing the Dataset

If the dplyr package in R is not installed into the system, one can install dplyr into R using the command:

To enable the dplyr package and its commands (such as filter, summarise), type in:

In the next steps, the url website address is stored into the url variable and we import and read the url as a comma separated value file (.csv) into R. This imported .csv file is converted to a data frame and stored into the animal_surv variable.

After importing the data, we take a look at the data using the head(), tail() and summary() functions in R.

dplyr_01

The head() function looks at the top 6 rows (by default) of the dataset. The tail() function is like the head() function but it looks at the bottom 6 rows (by default). The summary() function looks at the dataset and outputs the minimum, median, mean , maximum, first quartile and third quartile for each variable (column).

One could point out that there is this X column which is not needed. We can fix this by removing the first column of the dataset as follows.

dplyr_02


Using the dplyr Functions In R

We now have the data ready for analysis. Some of the dplyr functions will be used here as reference and as examples.

Instead of looking at the first or last rows of data, we can use the sample_n() function.

dplyr_03


select()

 

Suppose we want just the poison and time columns and not treat. There are two ways to do this.

dplyr_04

For illustration purposes, the head() function was used to show this output. For the full select output remove the head() function.


filter()

The filter function extracts rows which meet a logical criteria. Here is an example.

dplyr_05


summarise()

The summarise() function in dplyr summarises the data into a single row of values. There are two examples below.


group_by()

The second example uses the summarise() function along with the group_by() dplyr function. In this second example, the mean times are taken and are grouped by each poison type.


The %>% Pipe Operator In dplyr

Notice how in each command, the first argument was the dataset animal_surv. There is a way such that we do not have to type the dataset out every time.

The %>% pipe operator allows us to condense the amount of code and acts as a “and then” operator. Here are a few examples.

In the first example, the animal_surv variable is passed as the first argument in filter() using the %>% pipe operator.

With the second example, the animal_surv variable is passed as the first argument in group_by() and then the group_by(poison) is passed as the first argument in summarise. The dataset animal_surv is implied in summrise too.

The output for the above commands are ommitted to save space here.

Another example of the pipe operator and the filter() function is this one below.

dplyr_06


mutate()

Suppose we want to add a new column/variable to our dataset. This new variable answers the question “Does the case have more than 1 hour of survival time?”

We use the mutate() function to add a new column given an equation or a criteria.

From the new dataset with the extra column we can extract information from it.

dplyr_07


References

The featured image is from https://www.rstudio.com/wp-content/uploads/2014/06/RStudio-Ball.png.

This introductory tutorial on the dplyr package is helpful as well.

The Datacamp course on using dplyr is helpful too.

 

Leave a Reply