Scatterplots In R Using The ggvis Package

Hello. In this page, I will talk about displaying scatterplots using the statistical program R and the R package ggvis for data visualization.


Table of Contents

  1. The Dataset
  2. Looking At The Data
  3. The Scatterplots and Linear Models
  4. Notes and Thoughts
  5. References


The Dataset

This particular dataset is based on time or time spent on the internet. Variables of interest in this dataset are the number of internet users, the population of the country and the number of Facebook users.


Looking at The Data

In R, we load the libraries ggvis and dplyr. Loading ggvis will allow us to access the plots, graphs and other visual tools in the library. The dplyr package will allows us to use the %>% syntax for ggvis. More info on the %>% can be found here. The dataset was from here.

The code and output can be found below:

ggvis_01

The output looks messy but we can still extract key information. There are 32 observations (rows) and 11 variables (columns). The 32 observations are countries with their own population size, number of internet users, number of Facebook users, GDP and the like.

As mentioned earlier, the variables of interest in this dataset are the population of the country, the number of internet users,  and the number of Facebook users.

The numbers in the population of the country and in other variables are really large and would be bad in the plots (overcrowding).

We put the population numbers, and the number of users in the hundred thousands. Here is the code and output:

ggvis_02

We create a new dataset from the original data set with just population size, the number of internet users and the number of facebook users.


The Scatterplots and Linear Models

After looking at our data and cleaning it up a bit, we plot the data.

We will plot two scatterplots. The first will be comparing Population size versus the number of internet users. In the second plot, it will be the number of internet users versus the number of Facebook users.

Model 1

Here is the code and output of the first model.

ggvis_03

Above is the visual of the scatterplot and a linear model (“line of best fit”). If we want the linear model in a more mathematical form such as y = mx  + b we run this code.

ggvis_04

The linear model fitted is (Units In Per Hundred Thousand):

Number of Internet Users  = 0.2860429 * Population Size + 184.4877 923

According to the model, for every unit (or 100,000) increase of the population, the number of internet users increases by 0.2860429 * 1 = 0.2860429 (or about 28604 users).

One can notice the three most rightward points on the plot. In statistics, such extreme values are called outliers. There are ways of determining whether or not a point is an outlier but it won’t be discussed here. You could say that if a point is far away from the rest of the points and the line then it is likely an outlier.

One way of finding the outliers here is to find the points with a population size of over 3000 (per 100 thousand). The code is below.

ggvis_05

 

Those three points large population sizes and a large number of internet users belong to the countries of China, India and the United States.

Model 2

In the second model, we compare the number of internet users to the number of Facebook users.

 

ggvis_06

The slope here is positive (upward sloping) and is less steep (more flat). Let’s find out what the linear model is.

ggvis_07

 

The linear model here is:

Number of Facebook Users = 0.08383939 * Number of Internet Users + 189.96371821

Again we have three outliers in this second plot.

ggvis_08

The three outliers are from China, India and the United States once again.

 


Notes and Thoughts

I am aware that the linear model should go through the origin as if there is no population then there cannot be any internet users nor Facebook users. Also, no internet users means no Facebook users. However, I do not know at this time how to plot a linear model through the origin (intercept of zero) using ggvis.

In R, one can plot a linear model through the origin. As an example we can have:

Number of Facebook Users = 0.1637 * Number of Internet Users


References

As someone learning some parts of R again along with packages such as dplyr, ggvis, ggplot2 and more. I have to thank the various sources for such valuable information.

  •  http://ggvis.rstudio.com/cookbook.html
  •  http://ggvis.rstudio.com/axes-legends.html
  • DataCamp