# Plotting Kaplan-Meier Survival Times Curves In R With ggplot2

Hi. This page will be about plotting Kaplan-Meier survival curves using R with the ggplot2 data visualization package. When it comes to survival times between two groups we are dealing with the statistical field of survival analysis. Survival analysis deals with time to event data. Events can include a patient being ill, bankruptcy, an employee leaving a company, a person exiting a clinical trial and more.

Sections

References

The addicts Dataset

Importing The Data

Taking A Look At The Data

Plotting Survival Curves Using Base R Graphics

Plotting Survival Curves Using ggplot2 and ggfortify

References

R Graphics Cookbook by Winston Chang (2012)

The link http://rpubs.com/sinhrks/plot_surv is useful for understanding ggfortify.

The dataset is from http://web1.sph.emory.edu/dkleinb/surv3.htm

The book that I use for understanding Survival Analysis is called Survival Analysis – A Self Learning Text (3rd Edition, 2012) by David G. Kleinbaum & Mitchel Klein. This book teaches the subject in an applied manner and it is suitable for non-statisticians who wish to study the subject. A slight problem is that the R coding section in this book uses base R graphics and does not mention ggplot2.

The addicts Dataset

This addicts dataset can be downloaded from the website http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta. This is a .dta file or a STATA file so the haven package in R is needed to deal with this file type.

This information is from the Survival Analysis – A Self Learning Text (3rd Edition, 2012). A 1991 Australian study by Caplehorn et al. compared two methadone clinics for heroin addicts. The patient’s survival time (in days) is the amount of time the patient spent at the clinic before dropping out.

In the addicts dataset, the variables are defined as:

ID – Patient ID

SURVT – The time in days until the patient dropped out of the clinic or was censored (missing information).

STATUS – 1 for patient dropped out of the clinic or censored; o otherwise

CLINIC – Methadone Treatment Clinic Number 1 or 2

PRISON – An indicator whether the patient had a prison record. 1 for yes, 0 for no

DOSE – Patient’s maximum methadone does (mg/day, continuous variable)

Importing The Data

In the book Survival Analysis – A Self Learning Text (3rd Edition), the addicts dataset is loaded from the C:\ drive in your computer. I propose that you can load this addicts dataset online under the link of http://web1.sph.emory.edu/dkleinb/surv3.htm. The only slight issue is that the file is a .dta file (for STATA users). To fix this, the haven package in R is used to deal with the .dta files.

If the haven package is not installed into R, you can install haven by typing in:

Here is the code for importing the data.

The read_data() function is needed to read the .dta file. I then convert this into a data.frame and save it to the variable addicts.

Taking A Look At The Data

It is usually a good idea to preview the data to have an idea of what the data looks like and the type of information you are dealing with. The head() and tail() functions are used here to preview the data.

It may seem that the id column is redundant at first but if you look at the output from tail(addicts) you see that a few id numbers were skipped. We have 238 rows but the last id number is 266. Keep the id column and work with what we have.

For more information on the variables, the summary() and str() functions can be used.

In the str() output, all the variables are atomic. The variable clinic should be a factor and the rest of the variables should be numeric and not atomic.

I could verify the variable types by using str() again.

The Surv() function gives a list of times (in days) until the patient has dropped out of the methadone clinic. Cases with the plus sign indicate censorship rather than the event of the patient dropping out.

An optional line of code is to look at the summary statistics of this Surv() function by using summary().

The shortest clinic staying time is 2 days and the longest time a patient stayed at a methadone clinic was 1076 days.

Plotting Survival Curves Using Base R Graphics

To start, a variable Y is created as the survival object in R. This Surv() function is the outcome variable for survfit() which will be used later. (This Surv() function is the same as in the previous section.)

The survfit() function produces Kaplan-Meier survival estimates. It takes in our Surv() function indicated by Y. We stratify by clinic as we are comparing the two methadone clinics.

The summary function of kmfit gives a table of times (in days), the number of patients in the study, the number of patients who dropped out at each time point, the associated standard errors, the lower and upper limits of the 95% confidence intervals for the survival estimates.

Here is the code and output for the Kaplan-Meier curves in base R graphics.

Plotting Survival Curves Using ggplot2 and ggfortify

The base R graphics version of the Kaplan-Meier survival curves is not visually appealing. With the help of the ggplot2 and ggfortify packages, nicer plots can be produced.

Here is the code and output for the Kaplan-Meier curves with ggplot2 and ggfortify.

In this plot, the colours help the reader identify which curve goes with which clinic. The shaded bands represent the confidence intervals and each time point. The plus signs represent the censored cases at a given time point.

More patients stay in clinic 2 than in clinic 1 since the survival curve is higher than the curve for clinic 1. An investigation is recommended in determining on why a lot of the patients in clinic one leave. It could be the clinic, it could the selection of patients or something else not explained by the data.