Analyzing A Baby Names Dataset In R Part 2

Hi there. Sometime back in 2016, I have made a page on analyzing a baby names dataset in R with the data.table package. This second part will look at the most popular baby names again. Part one did not have any graphs or plots. In this second part, bar graphs will be shown.


Table Of Contents

The babynames Dataset

Preparing The Data For Graphing

Finding The Top 20 Baby Names

The 20 Most Popular Male Baby Names

The 20 Most Popular Female Baby Names

Notes & References


The babynames Dataset

There is a large dataset which contains babynames and their counts. It is called babynames in the babynames library in R. The R documentation screenshot image below shows some details about the data. (I do not know what SSA means.)

In R, load the babynames, ggplot2 and data.table packages using the library() function.

 

In the babynames library, the dataset babynames contains the most popular baby names.

It appears that the dataset has data on baby names from 1880 to 2014. (It is uncertain if this data is 100% accurate. We go with what we have.)

The structure of the dataset can be examined using str().

The output from str(baby_data) tells us that this large dataset has 1,825,433 rows and 5 columns.


Preparing The Data For Graphing

The current column names are not the best. The column names can be changed by using colnames() in R.

 

Using the features from the data.table library in R, the baby names in the dataset are sorted by Name and their Counts are added by name.

The line sorted_names[order(-Name.Count)] will then order the name and its count in descending order (most popular to least popular).

Since I want the the 20 most popular baby names, I take the 20 rows from sorted_names and name it a new variable.

 


Finding The Top 20 Baby Names

Some Initial Bar Graphs

The data is now formatted for creating bar graphs using ggplot2. Here is a first attempt at a bar graph.

This bar graph looks okay for the most part. However, the counts are in scientific notation and the names are somewhat hard to read as you may need to tilt your head.

 

This next attempt at a bar graph that I tried has the bars sideways.

This bar graph looks a little bit better but the counts are off and could use some scaling.

An Updated Version Of The Bar Graph

This updated version of the bar graph has the bars sorted from most popular baby name to the 20th most popular baby name. These bars also have labeled counts which eliminates the guess work. Here is the code and output for it.


The 20 Most Popular Male Baby Names

We can also look at the most popular male baby names. The code uses features from data.table and it is quite similar to the code above. Here is the code and output.

From 1880 to 2014, James is the most popular male baby name (and baby name of all time in this time frame).


The 20 Most Popular Female Baby Names

Here is the code and output for the 20 most popular female baby names from 1880 to 2014.

Mary is overwhelming the most popular female baby name from 1880 to 2014.


Notes And References

As large as this dataset is, one should be wary of biases in data. It is unknown how the data was collected. Which countries, regions were involved in this sampling of this dataset?

It is unknown if cases such Ana and Anna are considered the same? Likewise with cases such as Liz, Elizabeth, Liza.

The time frame is from 1880 to 2014. One could look into a more recent time window such as 2000 to 2014.

  • R Graphics Cookbook by Winston Chang (2012)
  • http://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html
  • http://www.statmethods.net/management/sorting.html
  • https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf

Leave a Reply