Analyzing A Babynames Dataset In R With The dplyr Package

Hi there. This page is about working with R’s dplyr package for analyzing a babynames dataset. I have used the data.table package in R with this babynames dataset (Part 1 and Part 2) in the past but I have forgotten some things. Also, I am much more used to dplyr than data.table.

 

 


Sections

 

 


Getting Started & The babynames Dataset

In R, we load the associated libraries.

The babynames package in R contains the babynames dataset where all the baby names are located.  (R Documentation image found below.)

 

The data is provided by the SSA. My guess of the SSA would be The United States Society Security Administration. Baby names from this data do not represent global baby names.

I save the babynames dataset into a variable called baby_data. Then I use the head(), tail() and str() functions to take a look of the data.

 

From the head() and tail() functions we can see that the babynames data contains names from the years 1880 to 2015. This is a lot of data and baby names! Remember though that these names are from the SSA are limited to the United States.

Naming Considerations

There are cases where certain names can be either Male or Female. Here are some examples:

  • Ashley is typically a female name but there are male Ashleys out there.
  • Jordan is usually a guy’s name but I have met a female Jordan.
  • Lindsey is usually a female name but there are male Lindseys.
  • Brett is usually a guy’s name. It can also be a woman’s name.

I am not sure if names such as James, Jim and Jimmy are recorded together as one or are recorded separately. Other examples include Dave & David, Sara & Sarah, Ashley & Ashlee, Mary & Marie, Greg & Gregory, William & Will & Bill, John & Johnny & Jon and so on.

My guess is that there are not many of these cases in the data.

 


Finding The Top 20 Baby Names

The dplyr function in R comes in handy when it comes to data wrangling and manipulation. The data needs to be in a certain format that would be ready for plotting with ggplot2.

I create a variable called sorted_names which groups the names (from different years) and Sex together. Total counts for the names are provided as well. The arrange(desc(Total)) part rearranges the rows in the data from the highest counts to the lowest counts.

The sorted names contains many rows but I just want the top twenty.

The next lines of code is for making sure that the bars in the upcoming bar graphs are sorted in order (highest to lowest).

A bar graph within ggplot2 can be created now. Here is the code and output.

Notes

  • I put the counts in millions and have these counts displayed with the geom_text() add on function.
  • coord_flip() converts the bar graph from vertical bars to horizontal bars.
  • Setting fill = Sex gives the different colours for the graphs depending on gender. This is good for the viewer.
  • The labs() and theme() functions are for dealing with the labels and title.
  • There are more male names than female which are in the top 20 most popular baby names from 1880 to 2015 (in the USA).
  • One could investigate even further why some on these names are so popular over time (historically).
  • These names are from 1880 to 2015 but these names do not necessarily match the popular baby names from the year of 2016. Link: https://www.ssa.gov/oact/babynames/

 


Finding The Top 20 Female Baby Names

The code for the popular female baby names is not much different than in the previous section. I add a filter(Sex = “F”) part in the first few lines of code.

 

Mary is the biggest winner from the data and from the bar graph. There must be something about the name Mary and its popularity which would require investigation. The name Elizabeth is in second. Like mentioned in the beginning, it is unknown if Liz, Liza or Beth would be counted as Elizabeth.

 


Finding The Top 20 Male Baby Names

Here are the top twenty male baby names in the form of a bar graph and its associated code.

 


Popular Baby Names By Letter

Some may find this metric useful and some may find it somewhat useless. I wanted to find the popular baby names by letter as an exercise in R. For me, this turned out to be somewhat technical.

grepl() Function

The grepl() function in R asks the user for a pattern (or regular expression) and for a string. If the pattern exists in the string, grepl() returns TRUE. If the pattern is not there a FALSE appears.

The grepl() function can be combined with the filter() function from the dplyr package. Here are some examples with the babynames dataset.

Obtaining the counts for each letter would require a loop instead of doing the above procedure 26 times. The LETTERS variable contains the 26 (captialized) letters and the loop will loop through these letters.

In the for loop below I include the paste0() function with the filter() and grepl() functions. The paste0() function will combine the characters “^[“, the element in LETTERS without the quotes, and “]”. The counts for each letter are appended to a counts vector.

The letter J takes first place due to popular baby names such as John, James, Joseph, Jennifer and Jessica. The second place first letter goes to M. Notable names from the first letter M include Mary, Margaret, Michael and Matthew.

Note that I have removed some of the background lines in the plot with the use of commands such as panel.grid.minor = element_blank()  and panel.grid.major = element_blank().

 

 


References

  • https://stackoverflow.com/questions/35090883/remove-all-of-x-axis-labels-in-ggplot
  • http://felixfan.github.io/ggplot2-remove-grid-background-margin/
  • https://stackoverflow.com/questions/27141565/how-to-sum-up-the-duplicated-value-while-keep-the-other-columns
  • R Graphics Cookbook by Winston Chang

Leave a Reply