Hello. I have been playing around with this particular dataset I have found from this website link https://catalog.data.gov/dataset/city-of-seattle-wages-comparison-by-gender-average-hourly-wage-by-age-353b2. This dataset deals with the City Of Seattle Wages by gender. The dataset was first published on July 16, 2013 and last modified on September 27, 2016.
Data analysis of this dataset is done in the statistical programming language R. Other programs which can import and analyze data such as Python (with associated) modules can be used too.
Table Of Contents
A First Look At The Dataset
We first import the data into R using the dataset’s download link.
In R, the dataset can be imported from the web as long there is a download link (non-zip file). In this case, I have a .csv file so I use the read.csv command in R.
After importing the data, we take a look at the data. The head() function in R looks at the first 6 rows (obervations) and all the columns (variables). The tail() function operates the same way as head() but looks at the last 6 rows. By default the head() and tail() functions looks at 6 rows within a data frame.
In the Grand Total row, we have 3600 employed females and 6285 employed males for a combined 9885 employed people in this study. A sample size of 9885 is large as it is not easy to survey that many people. From a mathematical and statistical perspective, 9885 is not that large as the population of Seattle would be much larger. Remember that statistics is based on partial information.
The str() command in R gives details about the variables in the columns. We find out that there are 8 variables and 13 rows (observations).
The variables we have include age ranges, average hourly rates for males and females, the counts of females and males employed per age range, the total average of hourly rate (average overall per age range regardless of gender), counts and the female to male rate.
In this dataset, I wanted to look at the average female hourly wage rate versus the average male hourly wage rate to see if there are pay gaps. If there are pay gaps between genders, how much of a gap is there?
Plotting The Data
Before plotting the data, the dataset can be somewhat altered. The Grand Total row and the under 20 in the AGE.RANGE column can be removed. (After playing around with the data and plotting, I found having the under 20 Age Range in the graphs to be a bit of a nuisance as it was not before 20-25, 25-30. Also, the counts associated with the under 20 age range is not very high).
In R, we omit the last two rows which are the Grand Total row and the under 20s row. One can also view this as selecting the first 11 rows out of 13.
We take a look at the data again using head(), tail() and str().
With the help of the ggplot2 library in R, a plot is created displaying age ranges and their corresponding average hourly pay by gender. The blue dots represent the males and the red dots represent the females.
In R, the ggplot() command is used along with geom_smooth, geom_point, ggtitle() for the title, xlab() and ylab() to label the axes.
For this Seattle data, it appears that for all age ranges listed here the average hourly wage for males is higher than the average hourly wage for females. We may want to investigate why is this the case (given our data). It is dangerous to make snap judgments and conclude that men make money than women as we do not have full information and this data is for Seattle.
What we do not know in this dataset include factors such as:
- The type of jobs
- The job sectors
- Hours worked
- Bonuses involved?
- Type of pay (Commission vs Contract vs Salaried)
- Time frame of the study
- Locations of Jobs
Do keep in mind that the number of people surveyed for each age range by gender is not the same (i.e. 26 females for age range 20-25 versus a count of 576 for age range 46-50).
Plotting The Wage Gaps
In the graph above, we noticed that there is a wage gap across age ranges. We can plot these gaps by taking the (absolute) differences and then plotting them using ggplot.
Creating a new dataset by combining the columns of wages_data and the new wage_gap column.
Taking a look at the updated data.
We now plot the wage gaps by age range and gender using ggplot. Here is the code and the corresponding output.
This plot gives us a clearer picture of the wage gap of males versus females by age range (given our data). The gap is smallest at the age range of 26-30. Why is this so? One would need to look at factors such as job types, job sectors, locations, skills, education and so on. The largest gaps are in the age ranges of 61-65 and 66-70. These large gaps could be a result of having older male employees as CEOs, Board Members and other senior positions. Once again, this would need further investigation.
This data does suggest that there is a wage gap where males make more money per hour on average than females across all age ranges. However, our data only represents a subset of Seattle’s population. We do not know whether the wage gap exists in all of Seattle. Assuming that there is available funds and time, it is recommended to investigate factors such as job types, job sectors, skill sets, hours worked, etc. which would impact average hourly pay.