Hello. I have recently been learning and experimenting with the data.table package in the statistical program R. This data.table package is similar to dplyr with data manipulation techniques. Both dplyr and data.table have features which are similar to SQL queries (SELECT, WHERE keywords and so on).
This post will highlight some features from data.table. The dataset being used is an interesting babynames dataset. (Warning: This post is quite long.)
Table Of Contents
- The babynames Dataset
- Selecting Data Using data.table
- Selecting Rows Using i
- Selecting Columns in j
- The by Keyword
- Selecting Data in Rows i and Columns j (And The Most Popular Baby Names)
- Indexing and Keys
- Changing Column Names
- Changing Column Order
- Removing Columns
The babynames Dataset
The large dataset that we will be working with is from the babynames dataset in R. These baby names are from the United States. Information on this dataset can be found here. The babynames package has datasets named applicants, births, babynames and lifetables.
We can import the library and dataset in R as well as import the data.table library into R.
# We will work with the babynames dataset:
# To install use: install.packages("data.table")
In the last line, we converted the babynames dataset into a data table and set it to the variable baby_table.
We check if there is any missing data in the dataset.
## Check for any missing values:
The output shows that there are no NAs which represent missing values. The babynames dataset is already ready for data analysis.
# Dimensions of dataset: (1.82 million rows with 5 columns)
 1825433 5
# Column names of the baby_data:
 "year" "sex" "name" "n" "prop"
Running dim(baby_table) returns the dimension of the dataset which is about 1.825 million rows with 5 columns.
The names() function returns the names of the 5 columns. These 5 columns are year, sex, name, n (the counts for the name), prop (proportion or percentage of total population).
The str() function returns the structure of the dataset and information about the column variables.
From the summary() function, we retrieve information such as minimum, maximum, mean, median, etc. for the column variables.
The head() and tail() functions for a dataset returns the first 6 and last 6 (respectively) rows of data. One can use the n = argument to change the number of rows to view. I used n = 10 here for both head() and tail() functions.
We see that the names collected were from the years 1880 to 2014.
Selecting Data Using data.table
The general form for subseting a data table is DT[i, j, by]. This means that we take the data table DT, subset/select rows using “i”, then calculate “j” grouped by “by”.
If the above sounds confusing, please refer to the examples below.
Selecting Rows Using i
To start off, here is an example where we select rows 25 to 29 in two ways.
Here is an example where we select a row with a condition. We choose rows with the name Marissa.
I did not show the output from above as it would take too much space.
We can also select rows with multiple names as the condition.
Selecting Columns in j
We now select columns from the data table.
In this example, we select the year column from baby_table. (No output shown)
# Selecting columns in j:
# Years column:
years <- baby_table[, year];
What if we want to select more than one column. We can certainly do that by using the .() in the j column.
Instead of using .(), we could use the list() function as follows.
Note that if we do not use .() or list() in j we would not return a data.table object. We would instead get a vector and an undesirable output.
We can also run a few calculations to retrieve counts, means (average), standard deviations and such.
The by Keyword
In the previous example, we found the sum of n as the counts and the standard deviations of the counts. The results we found for both male and female genders.
What if we want to find counts, averages and such grouped by gender for example? We use the by keyword.
The example below counts the number of rows for each gender. Note that .N counts the number of rows.
The counts above were the number of rows for each gender. But each row has an n value of the counts per row. We have to sum these n values for each gender to retrieve the true counts for each gender.
We now check if the female total counts and the male total counts match the total counts from the last example in the previous section.
Selecting Data in Rows i and Columns j
We can now select data from rows i and columns j in the data table. The examples below are neat examples as we investigate the popular baby names.
The 15 Most Popular Baby Names
We find out the 15 most popular baby names (Male and female) from the dataset.
We see that the name James is the most popular baby name from the dataset
Notice that the by keyword was used and that a the sum function was used to create a new variable Name.Count. In addition, we used chaining with the order command. Instead of having
you can use chaining to do in one line as illustrated in the image above.
The 15 Most Popular Male Baby Names
Now we look at the 15 most popular male baby names.
We see that the 5 most popular male baby names are James, John, Robert, Michael and William.
The 15 Most Popular Female Baby Names
We now present the 15 most popular female baby names.
The top five female baby names from the dataset are Mary, Elizabeth, Patricia, Jennifer and Linda.
Indexing and Keys
We can also subset and select data using keys.
To start off, we create a new copy of the babynames dataset and name it key_table. The setkey() function takes a data table and a column(s) as reference. In the code below, we put the name column in setkey(). When I input key_table[“Mario”], “Mario” would refer to the “Mario” in the name column of key_table and all rows containing the name “Mario” would be shown.
# Indexing and keys:
key_table <- data.table(babynames)
# Set key on our data table with column name:
The next example is about selecting multiple names in setkey().
The nomatch argument
The nomatch argument is for cases where a specified value in the row
does not match in the data table rows. If I search for the name “What” alongside with “April” without the nomatch argument I will get NAs (missing values).
By default, the nomatch argument is set to NA as follows.
Setting nomatch to 0 will omit the row(s) which are not a match.
Using the keys we can also get counts. The two examples show counting where one is aggregated and the other one is separated.
Selecting Multiple Columns in Keys
Earlier, we set the name column in the setkey() function. We can also put multiple columns into setkey(). Here are some examples.
Changing Column Names
There are times when column names could use improvements. For example, the column n is understood by mostly math, stats and computer science type people but it is better to replace n with Count (or something similar).
Changing Column Order
In the original babynames dataset we had “year”, “sex”, “name”, “n”, “prop”. Here, we want to but the name before sex. We can also change the order of the columns as follows.
In data.table we can also remove columns. One should be careful when removing columns. It is like chopping a carrot; once it is cut you can’t put it back together.
Here is the case with removing one column.
We can also remove multiple columns. Here, we remove the year, sex and prop columns leaving us with the name and n columns.
Well, that was a lot to digest and take in. In short, the data.table package in R allows for data selection, data extraction and data manipulation of (large) data sets. Since computing power has increased tremendously over the years, it is easier and faster to analyze big data.
We have also learned some neat things from the babynames dataset which is close to 2 million rows and 5 columns. From 1880 to 2014, the most popular male name is James and the most popular female name is Mary. Overall the most popular names (1880 to 2014) are James, John, Robert, Michael and Mary.
In the US, there are more babies named Maria (Count: 538815) than Mario (Count: 141494) from 1880 to 2014.
It is unknown whether name variations such as Rob for Robert and Bill for William are counted as the same or not. One could use these baby names statistics to predict the most popular baby name (male and female) for 2016, 2017 , and so forth. Predictions can be made using this entire dataset or the most recent years (2004 to 2014) in the dataset for example.
More can done with this dataset such as look at the years from 2000 to 2014 for example.
The babynames dataset documentation: https://cran.r-project.org/web/packages/babynames/babynames.pdf
This cheat sheet for data.table has been really useful.
The interesting featured image is from http://parentingatoz.com/wp-content/uploads/2015/09/baby-name.jpg