With the slow migration of pieces across from https://biscuitchaserfc.blogspot.com/ I am working backwards with the pieces whilst refining them a little.
Next up…pulling both meta and shot data from Understat that can then be plotted in R using ggplot!
At the time of typing theres a few resources for dealing with x/y data thats worth looking into:
- Statsbomb (covered in upcoming posts)
- Canadian Premier League (yet to dive in but a super interesting league and regularly updated data)
- Wyscout (top 5 European Leagues 2017-2018)
- UnderstatR (will use this in this post)
Anyway, the plan:
- Load up the packages required
- Run 2020 team/player data
- Extract 2020 x/y shot locations
- Save to .csv
##install understatR if needed
remotes::install_github('ewenme/understatr')
##load packages
library(tidyverse)
library(understatr)
library(here) ##for saving output - more stable than set_wd()
library(ggrepel)
library(ggsoccer)
library(glue)
Firstly, find the available leagues and seasons:
##pull available leagues
leagues<-get_leagues_meta()
leagues
There is data available back to 2014. Use unique(leagues$league_name)
to view the leagues:
At this point, I will filter out the RFPL and focus on the European top 5 Leagues in 2020 so we are looking at this season’s data (obviously the league/s and season is up to you!)
##remove RFPL to leave top 5 European leagues
leagues<-leagues%>%
filter(!league_name == "RFPL")
##pull team match data
team_data<-map_dfr(unique(leagues$league_name), get_league_teams_stats, year = 2020)
team_data
We now have a load of data for each team, from each match, in the top 5 European Leagues. This is useful in itself but we will take it a step further to generate the meta data for each player in each team:
##pull player meta data
player_data<-map_dfr(unique(team_data$team_name), get_team_players_stats, year = 2020)
player_data
The above will take each individual team from our team_data
and get the player stats - this may take a while as once the code has run, you will have the data of 2599 players.
At this point I normally calculate some Per 90 stats and export to .csv so in the future I can just load the file rather than running the above.
##create some summary stats
summary<-player_data %>%
mutate(nineties = time/90,
npxg_p90 = npxG/nineties,
xa_p90 = xA/nineties)
##save player meta data
write_csv(summary, here("player_meta_2020.csv"))
You can dive into the player data but a quick vis to see the attacking contribution P90 using the summary stats created.
##quick plot npxg_p90 & xa_p90
summary %>%
filter(time>900) %>%
ggplot(aes(x=npxg_p90, y=xa_p90))+
geom_point() +
geom_text_repel(data = summary %>%
filter(npxg_p90>0.4 | xa_p90>0.4),
aes(label = player_name))+
labs(title = "Attacking Contribution",
subtitle = "European Big 5 Leagues // >900mins",
x = "NPxG P90",
y = "xA P90")
The above takes the summary data frame, filters players with over 900minutes only, then plots NPxG P90 against xA P90 before labelling the players above 0.4xA/xG P90. The outcome:
This is obviously a basic use of what is available but gives a quick idea!
Finally, within understatR is the ability to pull the x/y shot location data - this can be done in a few lines of code!
##create playId vector
players<-c(player_data$player_id)
##pull all x/y locations for playerId's
shot_data <- players %>%
map_dfr(.,possibly(get_player_shots,otherwise=NULL))
##filter 2020 shots only
shots_data_2020<-shot_data %>%
filter(year == 2020)
Again, grab a biscuit as this takes a while, but once run you will have ~210k shots. We once more will filter this down to the 2020 season only leaving ~26k shots and hit save:
##save 2020 shot data
write_csv(shots_data_2020, here("shots_data_2020.csv"))
Meta and shot data saved! This can be loaded whenever you require it and saves running the above process each time you wish to work with the data.
As a final step, let’s plot Lewandowski’s shots as he leads NPxG P90 from the above scatter.
##plot Lewandowski non-penalty shots
player_name<-"Robert Lewandowski"
shots_data_2020 %>%
filter(player=={player_name},
!situation=="Penalty") %>%
mutate(X = X*100,
Y = Y*100) %>%
ggplot(aes(x = X, y = 100-Y))+
annotate_pitch()+
geom_point(aes(colour = result, size = xG)) +
coord_flip(xlim = c(50,100),
ylim = c(0,100)) +
theme_pitch() +
labs(title = glue({player_name}),
subtitle = "Shot Locations 2020/2021 - Penalties Removed")
The above takes the shot data, filters for Lewandowski, removes penalties, multiplies the x/y coordinates to fit a 100x100 pitch as we will use for ggsoccer…then plots the shots! The colour variable relates to the shot outcome, with the size indicating the xG of the shot taken. Nothing too ground breaking here but nice to see the visual output of the data!
You can customise as much or as little as you want, but this is the basic groundwork to extract the data and plot.
As always, any questions or queries, let me now!
The full code is on my Github along with below.
library(tidyverse)
library(understatr)
library(here) ##for saving output - more stable than set_wd()
library(ggrepel)
library(ggsoccer)
##pull available leagues
leagues<-get_leagues_meta()
unique(leagues$league_name)
##remove RFPL to leave top 5 European leagues
leagues<-leagues%>%
filter(!league_name == "RFPL")
##pull team match data
team_data<-map_dfr(unique(leagues$league_name), get_league_teams_stats, year = 2020)
##pull player meta data
player_data<-map_dfr(unique(team_data$team_name), get_team_players_stats, year = 2020)
##create some summary stats p90
summary<-player_data %>%
mutate(nineties = time/90,
npxg_p90 = npxG/nineties,
xa_p90 = xA/nineties) %>%
filter(time>900)
##save player meta data
write_csv(summary, here("player_meta_2020.csv"))
##quick plot npxg_p90 & xa_p90
summary %>%
filter(time>900) %>%
ggplot(aes(x=npxg_p90, y=xa_p90))+
geom_point() +
geom_text_repel(data = summary %>%
filter(npxg_p90>0.4 | xa_p90>0.4),
aes(label = player_name))+
labs(title = "Attacking Contribution",
subtitle = "European Big 5 Leagues // >900mins",
x = "NPxG P90",
y = "xA P90")
##create playId vector
players<-c(player_data$player_id)
##pull all location x/y for playerId's
shot_data <- players %>%
map_dfr(.,possibly(get_player_shots,otherwise=NULL))
##filter 2020 shots only
shots_data_2020<-shot_data %>%
filter(year == 2020)
##save 2020 shot data
write_csv(shots_data_2020, here("shots_data_2020.csv"))
##plot Lewandowski non-penalty shots
player_name<-"Robert Lewandowski"
shots_data_2020 %>%
filter(player=={player_name},
!situation=="Penalty") %>%
mutate(X = X*100,
Y = Y*100) %>%
ggplot(aes(x = X, y = 100-Y))+
annotate_pitch()+
geom_point(aes(colour = result, size = xG)) +
coord_flip(xlim = c(50,100),
ylim = c(0,100)) +
theme_pitch() +
labs(title = glue({player}),
subtitle = "Shot Locations 2020/2021 - Penalties Removed")
This has been great, just getting into R for football stats. For some reason, my attacking contribution graphs keep coming out with the wrong axis scale and i have no idea why. But great piece really helped.