The k-Means Algorithm In R For Clustering Data

Hi. I have been doing some self-learning on R and this topic of the K-Means Algorithm. The contents below is based on me using the K-Means Algorithm on a small dataset as practice. Since I am a student of this topic, the information presented here may not be exact.

The K-Means Algorithm is an unsupervised machine learning method which takes a collection of data points and partitions them into clusters (or subsets of the original data). This clustering method is useful for analyzing patterns within data and for labeling certain groups based on the data.

 


Sections

References

The Dataset

Using The kmeans() Function In R

The K-means Component Outputs In R

Choosing The Optimal Number Of Clusters (k) & The Scree Plot

Alternate Plots For Choosing The Number Of Clusters

 


References

Here is some references that I have used here.

R Graphics Cookbook By Winston Chang

R Documentation (on kmeans).

https://www.r-bloggers.com/k-means-clustering-in-r/

http://stackoverflow.com/questions/14524818/results-of-k-means-used-in-r

Datacamp videos and notes have been useful. One video that I have used is this:

 


The Dataset

The dataset I am using is from the faraway library in R. This data is called star and I renamed this as star_data. The R documentation for this dataset is also included below the code.

 

The index column is not really needed and it can be removed.

I can further examine the data by using the functions summary() and str().

From the str() output, we see that the dataset has 47 observations (rows) and 2 variables (columns). It is not a large dataset but I am using this for illustration purposes.

Here is a scatterplot of the data (before clustering).

 

 


Using The kmeans() Function In R

The k-means algorithm takes a set of points (data) and splits/partitions them into k clusters. The user has the choice in determining how many clusters (k) he/she wants from the data.

As I am learning the k-means algorithm myself, my explanation on this may be off but I will give it a try.

Before the first iteration, the data is split into k clusters. The number of k clusters is user-defined. For each cluster a centroid/midpoint (average point) is associated with the cluster. There would be k centroids.

The iterative steps are as follows:

  1. Assign points to the nearest centroid.  Points to the closest centroid are in that centroid’s cluster.
  2. Relocate the centroid’s location to the cluster’s midpoint.
  3. Repeat steps 2 and 3 for the next iteration. Note that the centroid/midpoint of a cluster can change between iterations. The algorithm stops if there are no more interations or if there is convergence (the centroids and clusters don’t change).

For an alternate explanation, please refer to the Youtube video placed in the References section above.

The k-means algorithm is not difficult to use in R. In R, it can be achieved by using the kmeans() function. Type in ?kmeans to load up the R documentation for the kmeans() function in R.

I first place a random seed in R. This is for reproducibility purposes. If you use this same random seed, you would get the same output and plot as me. If this random seed is not there, then outputs will vary even though you may run the code repeatedly.

The above code is an examples of kmeans() in R. I use 2 clusters as indicated by centers = 2, and nstart = 25. The R documentation states the nstart refers to the number of random sets chosen. I don’t know too much about nstart so I would have to look more into it. By default, iter.max is 10 which is the amount of iterations.

The outputs also gives information on how many points are in each cluster, which points go with which cluster, and the available components from kmeans().

In the next lines of code, I convert the cluster component to factors in order to enable the colour fills in ggplot2. The ggplot2 code and plot can be found below.


The K-means Component Outputs In R

One natural question is what does the k-means algorithm output? What does it all mean? I try to figure it out using the R documentation. I have included a screenshot of the R documentation for the output of kmeans.

The cluster output component outputs a vector of integers from 1 to the number of clusters inputted into kmeans() which is k. This vector shows which points go with each cluster. In this case the first point goes to cluster 1, the second point goes to cluster two and so on until observation/point 47 in the star data.

Centroid positions from each cluster can be determined from the centers output component from kmeans(). The first row represents the centroid for the first cluster and the second row represents the centroid position for the second cluster.

Using the size component gives the number of points assigned to each cluster. Note that the number of points in each cluster do not have to be equal. The first cluster has 27 points while the second cluster has 20 points assigned to it from the 47 in the star data.

Here is the number of iterations. I am not sure what the documentation means by outer iteration. I would have to look at it in more detail.

There is an ifault component which I am not familiar with. The documentation says it’s for experts.

I left these remaining output components last as there are linked together and they are important. I think these measures use Euclidean distance by default.

The withinss output component creates a vector of within-cluster sum of squares. The within-cluster sum of squares takes the sum of squares within each cluster. That is, we take the sum of the squares of each point minus In math notation, it would be something like:

    \[\sum_{i = 1}^{p} (\mathbf{x_i} - \mathbf{\bar{x_{k}}})^2\]

where there are p points in the k-th cluster and  \mathbf{\bar{x_{k}}} is the centroid of the k-th cluster. Note the boldface notation as each \mathbf{x_i} point has a x-coordinate and a y-coordinate in two dimensions. In three dimensions, points are in (x, y, z) format and so on.

The tot.withinss output component gives the total within cluster sum of squares. This is the sum of the above components of the vector.

Between clusters sum of squares are given by betweenss. I am not exactly sure if it measures from centroid to centroid between clusters or from the nearest points between clusters. That would require some further investigation.

The total sum of squares is the sum of the between clusters sum of squares and the total within cluster sum of squares. (i.e. TSS = BSS + TWSS) This formula is used in the later section where there are alternate plots for choosing k clusters.

 


Choosing The Optimal Number Of Clusters (k) & The Scree Plot

When it comes this k-means algorithm, one may want to use some sort of metric or aid to help in choosing the number of clusters. The scree plot with the elbow method is one way to help in determining the optimal number of clusters.

Note that this approach is from a statistics and optimization perspective. The chosen number of clusters could vary depending on the context of the data.

In the previous section, I did the k-means algorithm for 2 clusters. I want to compare the algorithm from 1 cluster to 8 clusters. (You could use more than 8 clusters if you want to be more thorough.) A for loop is used to help build the scree plot.

From the above scree plot, the “elbow” of the plot is at k = 3 (or you could say k = 4).

The ggplot2 Version Of The Scree Plot

If you want to get fancy and add colour to the scree plot. Here is the ggplot2 version of the scree plot.

The “elbow” of the scree plot is k = 3 (or k = 4).

Choosing the number of clusters as three, here is the code and output for the K-Means algorithm on the star data.


Alternate Plots For Choosing The Number Of Clusters

First Alternate Plot

The scree plot above used the total within sum of squares versus the k number of clusters. This scree plot uses the number of clusters (k) vs the ratio of total within sum of squares over the total sum of squares.

Here is the code and output.

 

Second Alternate Plot

Recall that after viewing the scree plot (elbow method) the chosen number of clusters was k = 3. The kmeans output included this part:

This ratio of between clusters sum of squares over the total sum of squares is the total variance. This total variance is the total variance in the data explained by the clusters. I would like to plot this total variance versus the number of clusters (k).

 

 

Even though the number of clusters chosen from the scree plot was 3, it seems that having 4 clusters is a good option as well. In general, increasing the number of clusters increases the total variance. In this case, going past 4 clusters does not increase the total variance that much.

 

Third Alternate Plot

This third plot would create the same output plot as the second plot. The formula used here would be in a slightly different form.

Recall that the total sum of squares is the sum of the between-clusters sum of squares and the total within-cluster sum of squares. (i.e. TSS = BSS + TWSS)

kmeans() with k = 3 in R gave an output of

The formula in that output is BSS / TSS. Using the rearranged equation BSS = TSS – TWSS, the ratio becomes

(TSS – TWSS) / TSS = 1 – (TWSS / TSS). This quantity looks somewhat familiar. It looks similar to to the general formula for the coefficient of determination R^2 (from linear regression). R^2 can be used as a measure of explained variance so this BSS / TSS = 1 – (TWSS / TSS) is the total variance in the data explained by the clusters.

Here is the code and output here. Note that TWSS / TSS is represented by ratio in R and was used in the first alternate plot.

 

Leave a Reply