Analyzing Big Data and Baby Names In R Using data.table

 

Hello. I have recently been learning and experimenting with the data.table package in the statistical program R. This data.table package is similar to dplyr with data manipulation techniques. Both dplyr and data.table have features which are similar to SQL queries (SELECT, WHERE keywords and so on).

This post will highlight some features from data.table. The dataset being used is an interesting babynames dataset. (Warning: This post is quite long.)


Table Of Contents


The babynames Dataset

The large dataset that we will be working with is from the babynames dataset in R. These baby names are from the United States. Information on this dataset can be found here. The babynames package has datasets named applicants, births, babynames and lifetables.

We can import the library and dataset in R as well as import the data.table library into R.

In the last line, we converted the babynames dataset into a data table and set it to the variable baby_table.


We check if there is any missing data in the dataset.

The output shows that there are no NAs which represent missing values. The babynames dataset is already ready for data analysis.

Running dim(baby_table) returns the dimension of the dataset which is about 1.825 million rows with 5 columns.

The names() function returns the names of the 5 columns. These 5 columns are year, sex, name, n (the counts for the name), prop (proportion or percentage of total population).

babynames01

The str() function returns the structure of the dataset and information about the column variables.

From the summary() function, we retrieve information such as minimum, maximum, mean, median, etc. for the column variables.

babynames02

The head() and tail() functions for a dataset returns the first 6 and last 6 (respectively) rows of data. One can use the n = argument to change the number of rows to view. I used n = 10 here for both head() and tail() functions.

We see that the names collected were from the years 1880 to 2014.


Selecting Data Using data.table

The general form for subseting a data table is DT[i, j, by]. This means that we take the data table DT, subset/select rows using “i”, then calculate “j” grouped by “by”.

If the above sounds confusing, please refer to the examples below.


Selecting Rows Using i 

To start off, here is an example where we select rows 25 to 29 in two ways.

babynames03

Here is an example where we select a row with a condition. We choose rows with the name Marissa.

babynames03b

I did not show the output from above as it would take too much space.

We can also select rows with multiple names as the condition.

babynames04


Selecting Columns in j

We now select columns from the data table.

In this example, we select the year column from baby_table. (No output shown)

What if we want to select more than one column. We can certainly do that by using the .() in the j column.

babynames05

Instead of using .(), we could use the list() function as follows.

babynames06

Note that if we do not use .() or list() in j we would not return a data.table object. We would instead get a vector and an undesirable output.

We can also run a few calculations to retrieve counts, means (average), standard deviations and such.

babynames07


The by Keyword

In the previous example, we found the sum of n as the counts and the standard deviations of the counts. The results we found for both male and female genders.

What if we want to find counts, averages and such grouped by gender for example? We use the by keyword.

The example below counts the number of rows for each gender. Note that .N counts the number of rows.

by01

The counts above were the number of rows for each gender. But each row has an n value of the counts per row. We have to sum these n values for each gender to retrieve the true counts for each gender.

by02

We now check if the female total counts and the male total counts match the total counts from the last example in the previous section.

by03


Selecting Data in Rows i and Columns j

We can now select data from rows i and columns j in the data table. The examples below are neat examples as we investigate the popular baby names.

The 15 Most Popular Baby Names

We find out the 15 most popular baby names (Male and female) from the dataset.

babynames08

We see that the name James is the most popular baby name from the dataset

Notice that the by keyword was used and that a the sum function was used to create a new variable Name.Count. In addition, we used chaining with the order command. Instead of having

chain01

you can use chaining to do in one line as illustrated in the image above.

chain02


The 15 Most Popular Male Baby Names

Now we look at the 15 most popular male baby names.

babynames09

We see that the 5 most popular male baby names are James, John, Robert, Michael and William.


The 15 Most Popular Female Baby Names

We now present the 15 most popular female baby names.

babynames10

The top five female baby names from the dataset are Mary, Elizabeth, Patricia, Jennifer and Linda.


Indexing and Keys

We can also subset and select data using keys.

To start off, we create a new copy of the babynames dataset and name it key_table. The setkey() function takes a data table and a column(s) as reference. In the code below, we put the name column in setkey(). When I input key_table[“Mario”], “Mario” would refer to the “Mario” in the name column of key_table and all rows containing the name “Mario” would be shown.

babynames11

The next example is about selecting multiple names in setkey().

babynames12


The nomatch argument

The nomatch argument is for cases where a specified value in the row
does not match in the data table rows. If I search for the name “What” alongside with “April” without the nomatch argument I will get NAs (missing values).

babynames13

By default, the nomatch argument is set to NA as follows.

babynames14

Setting nomatch to 0 will omit the row(s) which are not a match.

babynames15


Counting Methods

Using the keys we can also get counts. The two examples show counting where one is aggregated and the other one is separated.

babynames16


Selecting Multiple Columns in Keys

Earlier, we set the name column in the setkey() function. We can also put multiple columns into setkey(). Here are some examples.

babynames17


Changing Column Names

There are times when column names could use improvements. For example, the column n is understood by mostly math, stats and computer science type people but it is better to replace n with Count (or something similar).

babynames18


Changing Column Order

In the original babynames dataset we had “year”, “sex”, “name”, “n”, “prop”. Here, we want to but the name before sex. We can also change the order of the columns as follows.

babynames19


Removing Columns

In data.table we can also remove columns. One should be careful when removing columns. It is like chopping a carrot; once it is cut you can’t put it back together.

Here is the case with removing one column.

babynames20

We can also remove multiple columns. Here, we remove the year, sex and prop columns leaving us with the name and n columns.

babynames21


Summary

Well, that was a lot to digest and take in. In short, the data.table package in R allows for data selection, data extraction and data manipulation of (large) data sets. Since computing power has increased tremendously over the years, it is easier and faster to analyze big data.

We have also learned some neat things from the babynames dataset which is close to 2 million rows and 5 columns. From 1880 to 2014, the most popular male name is James and the most popular female name is Mary. Overall the most popular names (1880 to 2014) are James, John, Robert, Michael and Mary.

In the US, there are more babies named Maria (Count: 538815) than Mario (Count: 141494) from 1880 to 2014.

It is unknown whether name variations such as Rob for Robert and Bill for William are counted as the same or not. One could use these baby names statistics to predict the most popular baby name (male and female) for 2016, 2017 , and so forth. Predictions can be made using this entire dataset or the most recent years (2004 to 2014) in the dataset for example.

More can done with this dataset such as look at the years from 2000 to 2014 for example.


References

The babynames dataset documentation: https://cran.r-project.org/web/packages/babynames/babynames.pdf

This cheat sheet for data.table has been really useful.

https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf

The interesting featured image is from http://parentingatoz.com/wp-content/uploads/2015/09/baby-name.jpg

Leave a Reply