Fitting A Linear Model In R Using ggplot2

Hi. In this post, I will discuss fitting a linear model in the statistical program R with the ggplot2 data visualization package.


Table Of Contents

  1. The Dataset on Teen Gambling
  2. Data Preparation
  3. Producing the Linear Model
  4. Regression Through Origin
  5. Notes & Thoughts
  6. Full Code
  7. References


The Dataset on Teen Gambling

The dataset chosen here is teengamb from the R package called faraway. A survey was conducted to study teenage gambling in Britain. One can find documentation of this data package here.

In the teengamb dataset the sex variable is 0 for males and 1 for females, status is the socioeconomic status score dependent of the parents’ job(s), income is in pounds per week, verbal is the verbal score in words out of 12 and the gamble variable is the annual gambling expenditure in pounds.

The variables of interest are income and gamble. We want to see the relationship between annual incomes versus annual gambling expenses.

The code and output below shows loading the libraries and a peek of the data.

The gambling data set has 47 observations (rows) and 5 variables (columns). Getting the summary of the dataset will show minimums, maximums, mean, median and more for each variable.


Data Preparation

The teengamb dataset could use a few tune-ups. The income is in pounds per week and the gambling spending is in pounds per year. Both variables should be on the same scale. Also, we would like to convert the 0s and 1s to Male and Female respectively.

The code is as follows:


Producing the Linear Model

After fixing up the data, we can produce linear models.

We can produce linear models with combine the males and females and we can fit 2 linear models with one for males and the other for females.

Before the linear models, we start with a scatterplot of our data:

Based on our sample and this plot, we can see that females overall have lower annual gambling expenditures than males. Many of the extreme values or outliers are from males. One can see that it is likely that if we were to fit a linear model to this data, the line would be upward sloping (positive). This would mean that as annual income increases, the annual gambling expenditure would increase as well.

We now build the linear models and extract model coefficients such as the slope and intercept and use them for plotting in ggplot2.

The lm( dep_var ~ indep_var) function is used to fit a linear model while the coef() function extracts the slope and intercept of the linear model.

 

We can easily see that our model (blue and red dots) shows that males spend more on gambling as annual incomes increases compared to females.

The model with just black dots is the overall model where males and females are together. Again, as annual income increases the gambling spending increases.

Formally, we can express the model in a more mathematical way.

For the male case in blue the fitted linear model can be expressed as:

Annual Gambling Exp. = -2.6546 + 0.0024 * MaleIncome

For the female case in red, the fitted linear model can be expressed as:

Annual Gambling Exp. = 3.1400 + 0.0001 * FemaleIncome

(6.468844e-05 is rounded to 0.0001 in 4 decimal places)

For the overall linear model, the fitted linear model is:

Annual Gambling Exp. = -6.3246 + 0.0020 * AnnualIncome 


Regression Through The Origin

Based on the context our data, we cannot have negative income and 0 income should relate to 0 gambling spending. (One could argue that you can have 0 income and still gamble with other people’s money but that is not assumed) We should be running a linear model through the origin point of (0, 0).

To run a regression through the origin we add a 0 and a + with the independent variable in the lm() function in R.

The code will be similar as before but with different outputs.

For the overall model (black dots and green line), the linear model is:

Annual Gambling Exp. = 0 + 0.0895 * AnnualIncome 

The fitted linear model for the male case is:

Annual Gambling Exp. = 0 + 0.1191 * MaleIncome 

For the female case, the linear model is:

Annual Gambling Exp. = 0 + 0.0140 * FemaleIncome 

(Numeric values are rounded to 4 decimal places.)

The linear models through the origin are not much different than the linear models before. It does make more sense to fit a model through the origin based on the context of the data.


Notes & Thoughts

Based on the sample of 47 males and females in the survey study, females have lower gambling expenses compared to males. Overall, the more income one has, the higher the gambling expenses will be (from a likely male over female).

Remember that statistics is based on partial information. It is dangerous and most likely not true that as income increases then gambling spending increases for every British teenager. If one wants more information, then a larger sample would be needed which comes at a cost.

More variables could have been used for the linear models such as the socioeconomic status variable. That would be a multiple linear regression case.

The preferred graph would be the one with the male and female cases. It provides a good visual between males and females in terms of gambling spending.

Update on April 21, 2017: Yes, I know the titles are not centered. To center titles, use hjust = 0.5 in element_text for plot.title inside of theme().


Full Code

Here is the full code for reference.


References

  • https://cran.r-project.org/web/packages/faraway/faraway.pdf
  • Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118
  • Datacamp courses were really helpful too.

 

Leave a Reply