Decision Trees In R

Hi there. In this post, I share some experimental work that I have done with decision trees in R.

 


Sections

 

 

 


Decision Trees Overview

Decision trees are used as a tool for making predictions on categorical variables. These trees also give information on decision options and the criteria involved in these decisions. They appear in a flow chart type of fashion that is easy to follow for the viewer.

An example of a decision tree is shown below.

 

Source: https://databricks.com/wp-content/uploads/2014/09/decision-tree-example.png

 


Example One – A Small Grades Dataset

In this example, I work with a small dataset from R’s faraway library. This dataset is called spector and it investigates whether a student’s grade in an economics class has improved under a new teaching method. Here is an image of the documentation.

 

In R, I load in the faraway package and take a look at the data with the head(), tail() and str() functions.

I test out this decision tree from the ctree() function.

This output does not look great nor informative. The ctree() function is not recommended in my opinion.

I then try making a decision tree with the rpart package in R. Here is what I got.

For some reason, the output got cutoff in my Rstudio program. A different tool would be needed.

What worked for me was using the rpart and rattle packages together to create a good decision tree with colour. The fancyRpartPlot() function from the rattle package does the trick.

 

 

 

 


Example Two – Wine Quality Data

This second example deals with wine quality data. I load in the data from a website URL link.

Previewing the data with the head() and tail() functions gives:

You can examine the data a bit further with the summary() and str() functions in R.

One decision tree I want to try out is predicting quality based on the wine’s pH, alcohol content, sulphates and chlorides.

 

This decision tree comes out nicely and you can easily read it from top to bottom. Note that the sample size is 4898. Numbers at the top of the nodes represent the predicted wine quality number. Quality scores are from 3 to 9. The percentages at the bottom represent the percentage of the sample size. As an example, the 63% in the second row left node is 63% of 4898. In this node, a quality score of 6 is predicted at .44 (or 44%). It appears that the predictions are using a likelihood approach where the quality score is chosen based on the highest percentage.

 

An alternate decision tree would include counts instead of percentages corresponding to the wine quality.

 

 

From the rattle and RColorBrewer packages in R, there is an another alternate decision tree you can create. This tree has percentages and not counts in the nodes.

 

A Full Decision Tree

The decision trees above were predicting wine quality with some of the other variables. This tree will predict wine quality with all of the other variables. This tree will have more nodes and branches.

 

These trees have added more detail by adding volatile.acidity and free.sulfur.dioxide into the decisions. I am not sure if these additional factors actually affect the quality and taste of the wine. It would be best to consult with wine experts to see if these results are valid. (Wine quality seems like a subjective measure as well.)

 


References

  • https://www.tutorialspoint.com/r/r_decision_tree.htm
  • https://www.r-bloggers.com/a-brief-tour-of-the-trees-and-forests/
  • https://www.youtube.com/watch?v=JFJIQ0_2ijg
  • http://trevorstephens.com/kaggle-titanic-tutorial/r-part-3-decision-trees/

Leave a Reply