Finding The Most Common Baby Names In A Dataset With Python

Hi there. This page focuses on finding the most common baby names from a dataset with the Python programming language. The Python packages pandas and matplotlib are used heavily.

 

Source: http://latesthdwallpapers1.com/wp-content/uploads/2014/12/Babies-Images-1.jpg

 

 


Sections

 

  • References
  • Importing Data
  • Most Common Baby Names Of 2016
  • The Full Baby Names Dataset
  • Popular Baby Names With matplotlib Graphs

 

 


References

  • https://matplotlib.org/gallery/lines_bars_and_markers/barh.html
  • https://stackoverflow.com/questions/28931224/adding-value-labels-on-a-matplotlib-bar-chart
  • Python For Data Analysis By Wes McKinney
  • https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
  • https://matplotlib.org/users/legend_guide.html
  • https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html
  • https://stackoverflow.com/questions/16096627/selecting-a-row-of-pandas-series-dataframe-by-integer-index

 

 


Importing Data

From the website source link https://www.ssa.gov/oact/babynames/limits.html, you can download the zip file. You have to unzip the files into a folder on your computer. Do make sure that you know the directory of the folder where the files are. The os module in Python will allow for importing and reading data (offline).

pandas, matplotlib and numpy are imported into Python as well.

 

 

Before reading in files offline, set the working directory where the files are unzipped to. (** = censored)

 

 


Most Common Baby Names Of 2016

One of the unzipped files has the file name of yob2016.txt. With pandas’ read_csv() function, this .txt file can be read into Python. The .head() and .tail() methods are used to preview the data (output not shown).

 

 

Obtaining the number of 2016 births for each gender can be done with the use of the .groupby() method. The column Sex is in .groupby() and the .sum() method is used afterwards.

 

 

Filtering out rows in the pandas environment can be done by using the format of df[ condition ] where df is a Pandas dataframe. For finding the male subset, I select the rows where there is an M under the Sex column of babynames_2016.

 

 

Getting the females is not much different in terms of the code. (Harper is a female name??!!!)

 

Bar Graphs For Each Gender

 

For an audience, it is more appropriate to display the counts in the form of bar plots instead of showing outputs. I want the top 15 baby names of 2016 for each gender. The first one is for the males.

 

 

The plots in this page are all done with Python’s matplotlib module. In the code below, the setup starts with fig, ax = plt.subplots() and having names and respective counts as variables.

 

 

This next block of code starts with ax.barh() for generating a horizontal bar graph. The functions after are for adding in labels, titles and adjusting the viewing range for x. For the colours, you can use common ones such as blue, green, and red. Alternatively, you can use HTML colour codes to get the colour and shade you want in your plots.

In the for loop below with enumerate(), ax.text() is used for adding in text labels with the bars.

 

 

Once the plot is ready, you can display it with plt.show().

 

 

 

With the female baby names, the code is not much different. I have used a few different colours.

 

 


The Full Baby Names Dataset

The previous section only looked at babynames from 2016. What if we want all of the baby names from 1880 to 2016? From the zip file there is a .txt file for each year from 1880 to 2016. The format of the file is of yob####.txt where # is a numeric digit.

After the initialization of years, pieces, and columns, a for loop is used to read in all the unzipped files and place them into the pieces list. The pandas concat function is used to put everything into one pandas dataframe.

 

 

For checking purposes, the .head() and .tail() methods can be used.

 

Total Births By Gender

To obtain the total number of births for each gender, the .groupby method on the Sex column is used. After that, I take the sum and extract the Count column.

 

 

 

Total Births Per Year With Graph

The groupby method can also be used to obtain the number of births per year.

 

 

Seeing the numbers from the output is not the best for seeing trends. A simple plot with .plot allows us to see the upward trend of the number of births.

 

 

 

From the data, the number of births has been increasing from the years 1880 to 1960. Then it decreases from 1960 to about 1975 and increases to the 2010 level of about 3.6 million. To really understand why birth numbers have increased, one may need to consider factors such as immigration, economic cycles, financial situations and such.

The above plot is a quick and simple one. This next plot is a more refined version where the labels look cleaner. There is no legend though.

 

 

 

Total Births Per Year By Gender

To obtain the total births per year for each gender, pandas’ pivot_table function helps with this.  You need to specify the pandas dataframe, the values, the index column, the (new) columns and function is applied.

 

 

From the .head() output you can see the total birth counts for each year for each gender. A simple plot can be made for visualizing these results.

 

 

 

A more refined plot of the above requires a bit more work.

 

 


Popular Baby Names With matplotlib Bar Graphs

This section looks at obtaining the most popular baby names from the dataset in Python. The two subsections focus on the most popular female names and the most popular male names.

 

 

 

Popular Female Baby Names

From the names Pandas dataframe, the female names can be obtained through filtering. To obtain, the female names counts throughout all years the .groupby(), .sum() and sort_values() methods are used.

 

 

For the horizontal bar plot, the baby names, the counts are obtained to help with the plot setup. The main line after fig, ax = plt.subplots() is ax.barh(). After ax.barh(), I insert the ticks, labels, title and set limits for the x variable. The enumerate for loop part is for adding the number labels with the bars.

 

 

 

Mary comes in as the clear number one female baby name followed by Elizabeth, Patricia and Jennifer. It is uncertain whether related names such as Marie and Maria are counted the same as Mary. The name Sarah has the variation Sara but it is not known if those are the same during the counting.

 

Popular Male Baby Names

The code for the popular male baby names is not much different.

 

 

 

 

From the dataset, the top names are James, John, Robert, Michael and William. Names like James have variations such as Jim and William is related to Will and Bill. It is not known if related names are counted separately or counted together.

Leave a Reply