Hello. In this guide, I will show how R and the dplyr package can be used to extract soccer data. This guide has been a result of me playing around with R and its dplyr data manipulation package.
Code and output in R is provided.
- The Worldcup Data
- Finding Data Using R’s dplyr Package
- Subsetting The Data
- Selecting Data With Subsetted Data
The Worldcup Data
I found the world cup dataset from the faraway library in R. The dataset is listed as the variable worldcup. Shots, Passes, Tackles and Saves are counts. It is unknown why the dataset does not include goals scored for a player. We work with what we have here.
The documentation can be found here.
The variable Team is the Country in the tournament. Position choices are either Midfielder, Forward, Defender or Goalkeeper. The variable Time is in minutes played for the tournament.
You can install the dplyr package and the faraway library package with the code:
After installation, we load the data and take a look at the data as follows:
# Analyzing worldcup data from 2010 World Cup using dplyr:
# faraway package:
We then save the worldcup dataset into a variable called wc_2010 and take a look at the data using the head() and str() functions in R.
Notice how each entry in each row starts with the player’s last name and not with an index number. The player name is not a part of the dataset. These player names are in the row names of the dataset.
The following code will convert the players names as row names into player names in a column.
Each row is indexed by the player’s last name but that looks weird. They will be replaced by indices 1, 2, 3 and so on.
Finding Data Using R’s dplyr Package
The neat thing about R’s dplyr is that you can easily find information from a data.frame. The dplyr package is very similar to SQL.
The following code can extract the players in this world cup ( 2 ways).
The head() function shows a subset of the full dataset. The n = 15 argument was used to show the first 15 rows of the dataset / vector.
The %>% Pipe Operator
As you have noticed in the above, the two lines of code gives the same output. When the pipe operator %>% is used, the object in front of %>% is used as the first argument in the function after %>%.
Instead of select(wc_2010, Player), we use wc_2010 %>% select(Player) . The pipe operator is used from here on out.
As another example, instead of having f(x, y) I could use x %>% f(y).
We can find out the 32 teams in the 2010 World Cup:
When we select from a column, it is possible to have duplicate rows. (In this case, we have multiple players from the same team.) The %>% distinct() part was used after select() to remove duplicates.
Suppose we wanted to select players with a playing time of at least 45 minutes. Also, we want to know how many played at leat 45 minutes at the tournament. The code is as follows:
The select() command deals with selecting from columns while the filter() command in dplyr extracts rows/observations which meet a criteria.
To get players who played at least 45 minutes, we use Time >= 45 as the criteria.
To give counts, the summarise(Count = n()) gives the counts.
Here are the top 10 players with the most playing time.
The top_n() function is used after select(). This top_n() function selects and orders the top n entries given a column. The user specifies the n value which is 10 in this case.
The arrange() function is used last. It is used to order rows by values in a column. By default, it arranges from low to high. To arrange from high to low, the desc() part is needed.
Here is the average playing time for a player at the tournament.
Subsetting the Data
The filter() function in dplyr can be used to create smaller datasets which are a subset of the original dataset.
We can create a smaller dataset which is just for midfielders, one for forwards (strikers), one for defenders and for goalkeepers.
Selecting Data With Subsetted Data
Here are the French forwards of the 2010 World Cup.
Here are the forwards from the countries Spain, England, Germany and Italy.
Top 10 Goalkeepers With The Most Saves
Top 10 Forwards With The Most Attempted Shots
Top 10 Midfielders With The Most Passes
Top 10 Defenders With The Most Tackles
Top ten lists can have ties.
The dplyr package in R is very useful in extracting information from datasets.
As mentioned before, it is disappointing that there is no information about goals in the worldcup dataset in the faraway library.
R Documentation of worldcup dataset: http://www.rdocumentation.org/packages/faraway/versions/1.0.7/topics/worldcup
Data Wrangling with dplyr and tidyr Cheat Sheet:
The featured image is from https://upload.wikimedia.org/wikipedia/en/thumb/2/21/2010_FIFA_World_Cup_logo.svg/932px-2010_FIFA_World_Cup_logo.svg.png.