Hello. I have been playing around with data cleaning and the tidyr package in the statistics programming language R. Here is a guide on it.
The Importance of Clean Data
As someone who took numerous statistics courses, the datasets given us to us students were often not very large and they were formatted in a way that was ready for statistical modeling and for statistical data analyses.
What was not mentioned in my experiences was data handling and data cleaning. The clean data was taken for granted.
In the event of non-organized data, data cleaning is needed in order for the data to be ready for tasks such as data manipulation, data extraction, statistical modeling and so on.
The guide below will be a brief guide to the tidyr package in R and its functions.
Assuming that tidyr is installed into R, load the package using the library function.
In a data set, we have variables as column names and observations in each row.
With a tidy data set, each variable has its own column and each observation is in its own row.
Column names which have X1, X2, X3 and so on are not very good as they are not descriptive enough. More appropriate column variables would be Price, Count and Gender for example.
Values such as 1 and 2 for Gender are not ideal. They can be replaced by Male and Female.
Years of 2000, 2001, 2002 and so on are not good as column names as we could have a Year column with 2000, 2001, 2000 inside the column corresponding to the observation/row.
The next sections will look into various functions in R’s tidyr package for data cleaning, data manipulation and formatting.
The gather() function in tidyr is for when you have column names which are not variables such as years 1998, 1999, 2000 for example. Gather() gathers columns into rows.
This example in R shows how the gather() function is applied to the barley (named as immer) data from the library(MASS) dataset package in R.
From the R documentation, here is information about the variables in the immer dataset.
- Loc: Location
- Var: The variety of barley (“manchuria”, “svansota”, “velvet”, “trebi” and “peatland”).
- Y1: Yield in 1931.
- Y2: Yield in 1932.
In the above code, a peek of the dataset was done along with looking at the variables. The column names were renamed before the gather() function was used.
In the gather function, barley was used as the dataset. New columns “Year” and “Yield” were created from columns three and four from barley as indicated by c(3,4).
From gather(), the years 1931 and 1932 are in the new Year column. The values from the old columns of 1931 and 1932 are in the values column with the values corresponding with the year 1931 or 1932.
The opposite of gather is spread(). Spread() takes rows into columns.
Here is an example using the suicide dataset in the faraway data package.
The R documentation for the suicide dataset in the faraway package is as follows:
- y: number of people
- cause: method used – a factor with levels drug (suicide by solid or liquid matter), gas, gun (guns, knives or explosives) hang (hanging, strangling, suffocating or drowning, jump other
- age: a factor with levels m (middle-aged) o (old) y (young)
- sex: a factor with levels f m
After a quick look at the suicide dataset, the column names are not descriptive enough. It is best to change y into something like Counts.
The spread() function here uses the suicide_data dataset, separates the y counts by sex (male and female). The colnames part of the code renames the columns as specified by the user. The new columns are Cause, Age_Group, Female_Count, Male_Count.
In the Age_Group column, the factors of m, o and y are vague and could use some edits. The following code will take these m, o and y values as numerics, change them to “Middle_Aged”, “Old” and “Young” respectively and convert them as factors.
Reversing spread() with gather()
We can undo the spread function with tidyr’s gather() function as follows:
Several columns can be combines into one column using the unite() command. I was not able to find a suitable dataset for an example. Instead, randomly generated hours, minutes and days were used here. The unite() function will be used to combine hours, minutes and seconds into the HH:MM:SS time format.
# Fake data, generate hours, minutes and seconds
hours <- round(runif(50, min = 0, max = 5), 0)
minutes <- round(runif(50, min = 0, max = 60), 0)
seconds <- round(runif(50, min = 0, max = 60), 0)
# Put hours, minutes and seconds together as a data frame:
sample <- data.frame(Hours = hours, Minutes = minutes, Seconds = seconds)
# A peek at the data:
Hours Minutes Seconds
1 0 3 15
2 1 35 59
3 3 60 47
4 1 24 31
5 5 34 27
6 4 38 44
# Put a zero as first digit if value is less than ten to get 0X format.
sample$Hours <- ifelse(sample$Hours < 10, paste0(0, sample$Hours), as.character(sample$Hours))
sample$Minutes <- ifelse(sample$Minutes < 10, paste0(0, sample$Minutes), as.character(sample$Minutes))
sample$Seconds <- ifelse(sample$Seconds < 10, paste0(0, sample$Seconds), as.character(sample$Seconds))
# Another peek to check:
Hours Minutes Seconds
1 00 03 15
2 01 35 59
3 03 60 47
4 01 24 31
5 05 34 27
6 04 38 44
# Unite the Hours, Minutes, Seconds columns, this combined column is Time:
united <- unite(sample, Time , Hours, Minutes, Seconds, sep = ":")
# One last look:
With the unite() function, the dataset sample is specified. Time is the new column which combines Hours, Minutes, Seconds. The “:” in the sep argument gives the HH:MM:SS time format.
Before the unite function paste0() was used. paste0() concatenates/combines strings together. I used ifelse() for the condition where the hour/minute/second is less than 10. If it was less than 10, a 0 was put in front. If not, then no 0 is needed in front as it satisfies the two digits in HH, MM or SS.
Here is an example of paste0 and the brother function paste():
# paste0 (mini tutorial)
paste("a", "1", sep = ".")
The separate() function allows for splitting one column into multiple columns. I was not able to find a suitable dataset for an example. A made up sample dataset was created as an example.
Suppose we have treatments A and B with numbers 1 and 2 as Male and Female (respectively). But the column treat_gender has values A_1, A_2, B_1 and B_2. The function separate() will be used to split into letters and numbers.
# separate into multiple columns example:
# Creating sample data:
# 1 is male, 2 is female
treat_gender <- c("A_1", "A_2", "B_1", "B_2")
Count <- c(3, 8 , 10, 6)
sample <- data.frame(treat_gender, Count)
1 A_1 3
2 A_2 8
3 B_1 10
4 B_2 6
# Seprate column treat_gender into Treatment and Gender columns:
data <- separate(sample, treat_gender, c("Treatment", "Gender"), sep = "_")
# Change the 1 to Male and the 2 cases to Female
data$Gender[data$Gender == 1] <- "Male"
data$Gender[data$Gender == 2] <- "Female"
Treatment Gender Count
1 A Male 3
2 A Female 8
3 B Male 10
4 B Female 6
The tidyr package R makes data cleaning and data formatting much easier. The concepts here are not abstractly difficult but it does take a lot of practice.
The tidyr package itself is not enough for data cleaning. Other techniques and functions may be needed.
Some useful resources/references include:
- Datacamp Courses have been helpful.
- Various tidyr tutorials online (Can be easily google searched)
The featured image is from http://cdn.skilledup.com/wp-content/uploads/2015/01/paid-R-Program-courses-Feature_1290x688_MS.jpg.