The previous post around getting started in R and loading StatsBomb data to plot seemed well received…so I will put some other bits together!
I should caveat that I’m not an incredible coder - I have been coding in R for around a 12 months. Whilst I have found stuff that works, it may not be the most efficient/best practice to those more experienced!
Anyway, enough feeling self conscious, lets create:
Hopefully at this point opening up Rstudio and loading in packages is straightforward enough - we will use some additional packages:
##packages
library(StatsBombR) ##StatsBomb data
library(tidyverse) ##manipulate data
library(ggrepel) ##prevent labels overlapping
library(ggsoccer) ##pitch plot
library(RColorBrewer) ##colourblind colours
##install ggsoccer
remotes::install_github("torvaney/ggsoccer")
Once hitting run on the above, the StatsBomb data can be loaded in a similar way to the previous post. I will stick with the FAWSL 2019/2020 data.
##load competitions / filter FAWSL 2019/2020
comps<-FreeCompetitions() %>%
filter(competition_id==37, season_name=="2019/2020")
##get matches for competition
matches<-FreeMatches(comps)
##pull available data for all matches and clean
data<-StatsBombFreeEvents(MatchesDF = matches, Parallel = T) %>%
allclean()
data
In order to create the Per 90 stats for shots/goals/Non-Penaly xG, player minutes are required. Again, taking a looking into the StatsBombR documentation theres a handy get.minutesplayed()
function.
##get player minutes
mins<-get.minutesplayed(data)
head(mins)
This is useful as we now have 7 columns outlining the minutes played by the 200+ players in the dataset. Immediate problem…who does the player.id
relate to? Both player.name and player.id exist in our data
data frame, so we can perform a left_join()
. This essentially combines two tables using a shared column:
##join player names and ID
players<-data %>%
distinct(player.id, player.name, team.name) %>%
left_join(mins, by = "player.id")
players
To explain the above:
take the
data
data framedistinct()
= unique rows in the data frameleft_join()
= add themins
data frame by the shared column “player.id”
Excellent! We now know who the player.id
relates to, along with who they play for. There is a host of NA values in the dataset - these can be removed by using na.omit()
leaving us with the data we’re after. Now for some data manipulation to gather some per 90s!
In the previous post I used filter()
, mutate()
, and select()
. The additional functions used will be group_by()
, summarise()
and arrange()
along with and distinct()
.
##sum player minutes and calculate 90s
mins_stats<-players %>%
group_by(player.name) %>%
summarise(total_mins = round(sum(MinutesPlayed), digits = 2)) %>%
mutate(nineties = round(total_mins/90, digits = 2)) %>%
arrange(-nineties) %>%
distinct() %>%
ungroup()
Taking the players
data frame we will:
group_by
the player.name so all stats are calculated for each individualcreate a new column with
summarise()
summing all minutes played by each individual before rounding the result to 2 digitsround(digits = 2)
add a new column (
mutate()
) calculating how many 90s have been playedarrange()
= sort from most to fewest 90sdistinct()
= narrow results to unique rowsungroup()
so future calculations aren’t affected
Noice. The next chunk will look quite hefty, but it uses dplyr functions that have already been used:
##get summary stats - goals/shots/xg - calc P90 stats
summary<-data %>%
mutate(is_shot = ifelse(type.name=="Shot", 1, 0),
is_goal = ifelse(shot.outcome.name=="Goal", 1, 0)) %>%
filter(is_shot == 1) %>%
group_by(player.name) %>%
summarise(shots = sum(is_shot),
goals = sum(is_goal),
npxg = sum(shot.statsbomb_xg)) %>%
left_join(mins_stats, by = "player.name") %>%
ungroup() %>%
mutate(shots_p90 = shots/nineties,
goals_p90 = goals/nineties,
npxg_p90 = npxg/nineties) %>%
filter(nineties>10) %>%
arrange(-npxg_p90)
summary
Hopefully the above is logical! We handled if_else()
in the last post but using it in this context we are creating 2 new columns (is_shot, is_goal) - if the type.name is a shot, in the new column “1” will be posted, otherwise it will be “0”. We then go on to keep only shots (filter(is_shot=1)
), sum the shots/goals/Non-Penalty xG (summarise()
) before adding in our mins_stats
we created earlier. This then enables us to create the Per 90 stats! Yeeha! Finally we will keep players that have played over 10 90s before arranging npxg_90
from highest to lowest!
At this point theres been a fair amount of preparing the data - lets visualise something!
##shots p90/goals p90 scatter
ggplot(data = summary,
aes(x = goals_p90,
y = shots_p90))+
geom_point() +
geom_hline(yintercept = 2,
colour = "red",
alpha = 0.7,
linetype = "dotted")+
geom_vline(xintercept = 0.3,
colour = "red",
alpha = 0.7,
linetype = "dotted")+
geom_text_repel(data = summary %>%
filter(shots_p90 >= 2 | goals_p90>= 0.3),
aes(label = player.name))+
theme_minimal() +
labs(title = "FAWSL Shots/Goals",
subtitle = "2019/2020 // Players >10 90s",
x = "Goals P90",
y = "Shots P90")
I won’t break down the above line for line, but simply we are putting the goals_p90 on the x axis with shots_p90 on the y axis. Geom_text_repel()
calls the ggrepel package that was loaded earlier…we filter the summary stats for players with shots_p90 greater than or equal to 2 OR (“|”) goals_p90 greater than or equal to 0.3:
To take this a step further, I will slice()
the top 9 players before plotting their NPxG P90. Because players were arranged in the summary
data frame in descending NPxG P90, we can simply keep rows 1-9:
##top 9 players by NPxG P90
summary<-summary %>%
slice(1:9)
##bar plot top 9 players
ggplot(data = summary,
aes(x = npxg_p90,
y = reorder(player.name, npxg_p90)))+
geom_col(fill = "midnightblue") +
geom_text(aes(label = round(npxg_p90, digits = 2)),
colour = "white",
hjust = 1.2,
fontface = "bold",
size = 5) +
labs(title = "FAWSL NPxG P90",
subtitle = "2019/2020",
x = "NPxG P90",
y = "",
caption = "Data: StatsBomb | @biscuitchaser") +
theme_minimal()+
theme(plot.title = element_text(size = 30, face = "bold"),
plot.subtitle = element_text(size = 15),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12))
Much like the scatter I won’t go through forensically, however the on the x axis the NPxG P90 is plotted, whilst the players names are on the y axis ordered (reorder()
) from highest to lowest npxg_p90
:
Ok! Lets plot some shots!
All of the x/y location data is in the data
data frame so we will filter only the top 9 players we have identified above before only keeping the shots. Once more we will create a is_goal column to make things easier to plot before keeping selected columns:
##filter location data by top 9 players - shots only
shot_data<-data %>%
filter(player.name %in% unique(summary$player.name),
type.name == "Shot") %>%
mutate(is_goal = ifelse(shot.outcome.name=="Goal", 1, 0) %>%
as.factor()) %>%
select(player.name, location.x, location.y, is_goal, shot.outcome.name, shot.statsbomb_xg, team.name)
view(shot_data)
By using the %in%
operation we can filter by multiple values. In this case, we take the unique player.name
from within the summary
dataframe. The $
refers to the column. Let’s plot!
We will build this in stages as is the way with ggplot:
##plot half pitch
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80))
The above uses the ggsoccer package to plot the base half-pitch. Statsbomb Coordinates are different to that of Opta which use 100x100…StatsBomb use 120x80 (by removing theme_pitch()
there is evidence of this):
##add shots taken
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg))
Geom_point
now adds the shots to the pitch base plot colouring by goal outcome (is_goal) , and varying the point size based on the xG of the chance:
Personally, I like to highlight the goals further, but this next part is optional:
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")
By adding a second geom_point()
I am filtering by goal only, before:
Selecting a hollow circle (shape = 21)
Filling with
#66C2A5
Stroke = thickness of the point border
Colour = colour of point border
Again, this is personal choice, but for me, goals are now more evident. Now to change the goal/no-goal colours and deal with the colour legend:
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")+
scale_colour_manual(values = c("#FC8D62", "#66C2A5"),
name = "Shot Outcome",
labels = c("No-Goal", "Goal"))
The two colours have been selected from the RColorBrewer package (you can see how to use it here) and are (hopefully!) colour blind friendly!
Great! Now to add some labels and let the magic happen:
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")+
scale_colour_manual(values = c("#FC8D62", "#66C2A5"),
name = "Shot Outcome",
labels = c("No-Goal", "Goal"))+
labs(title = "FAWSL Top Non Penalty xG P90 Players",
subtitle = "2019/2020 // Players >10 90s",
caption = "Data: StatsBomb | @biscuitchaser",
size = "Non-Penalty xG")+
facet_wrap(~player.name)
The labs()
part is self explanatory, however facet_wrap(~player.name)
will break the current plot into the 9 top players we found earlier:
Rah! Both Kelly and Williams seemingly scoring from corners…Miedema a goal monster. At this point you could leave the vis alone, but let take it a step or two further by adding some of the P90 summary stats using geom_text()
:
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")+
scale_colour_manual(values = c("#FC8D62", "#66C2A5"),
name = "Shot Outcome",
labels = c("No-Goal", "Goal"))+
labs(title = "FAWSL Top Non Penalty xG P90 Players",
subtitle = "2019/2020 // Players >10 90s",
caption = "Data: StatsBomb | @biscuitchaser",
size = "Non-Penalty xG")+
facet_wrap(~player.name) +
geom_text(data = summary,
aes(x=80,
y=15,
label = paste("Shots P90:", round(shots_p90, digits = 2))))+
geom_text(data = summary,
aes(x=74,
y=15,
label = paste("Goals P90:", round(goals_p90, digits = 2))))+
geom_text(data = summary,
aes(x=68,
y=15,
label = paste("NPxG P90:", round(npxg_p90, digits = 2))))
Finally add in theme()
to style the overall plot!
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")+
scale_colour_manual(values = c("#FC8D62", "#66C2A5"),
name = "Shot Outcome",
labels = c("No-Goal", "Goal"))+
labs(title = "FAWSL Top Non Penalty xG P90 Players",
subtitle = "2019/2020 // Players >10 90s",
caption = "Data: StatsBomb | @biscuitchaser",
size = "Non-Penalty xG")+
facet_wrap(~player.name) +
geom_text(data = summary,
aes(x=80,
y=15,
label = paste("Shots P90:", round(shots_p90, digits = 2))))+
geom_text(data = summary,
aes(x=74,
y=15,
label = paste("Goals P90:", round(goals_p90, digits = 2))))+
geom_text(data = summary,
aes(x=68,
y=15,
label = paste("NPxG P90:", round(npxg_p90, digits = 2))))+
theme(strip.text = element_text(hjust = 0.5, size = 12, face = "bold"),
plot.title = element_text(size = 30, hjust = 0.5, face = "bold"),
plot.subtitle = element_text(size = 15, hjust = 0.5),
plot.caption = element_text(size = 10),
legend.key = element_blank(),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12, face = "bold"),
legend.position = "top",
strip.background = element_blank())
Give the plot a save:
ggsave(plot = last_plot(), "shot_plot.png", dpi = 320, height = 10, width = 12)
##dpi = plot resolution
There we go! This has run a little longer than expected, so thanks for sticking with it if you’ve made it this far!
As always, give it a go and let me know if theres any issues.
##packages
library(StatsBombR)
library(tidyverse)
library(ggrepel)
library(ggsoccer)
library(RColorBrewer)
##load competitions / filter FAWSL 2019/2020
comps<-FreeCompetitions() %>%
filter(competition_id==37, season_name=="2019/2020")
##get matches for competition
matches<-FreeMatches(comps)
##pull available data for all matches and clean
data<-StatsBombFreeEvents(MatchesDF = matches, Parallel = T) %>%
allclean()
##get player minutes
mins<-get.minutesplayed(data)
##join player names and ID
players<-data %>%
distinct(player.id, player.name, team.name) %>%
left_join(mins, by = "player.id") %>%
na.omit()
##sum player minutes and calculate 90s
mins_stats<-players %>%
group_by(player.name) %>%
summarise(total_mins = round(sum(MinutesPlayed), digits = 2)) %>%
mutate(nineties = round(total_mins/90, digits = 2)) %>%
arrange(-nineties) %>%
distinct() %>%
ungroup()
##get summary stats - goals/shots/xg - calc P90 stats
summary<-data %>%
mutate(is_shot = ifelse(type.name=="Shot", 1, 0),
is_goal = ifelse(shot.outcome.name=="Goal", 1, 0)) %>%
filter(is_shot == 1) %>%
group_by(player.name) %>%
summarise(shots = sum(is_shot),
goals = sum(is_goal),
npxg = sum(shot.statsbomb_xg)) %>%
left_join(mins_stats, by = "player.name") %>%
ungroup() %>%
mutate(shots_p90 = shots/nineties,
goals_p90 = goals/nineties,
npxg_p90 = npxg/nineties) %>%
filter(nineties>10) %>%
arrange(-npxg_p90)
##shots p90/goals p90 scatter
ggplot(data = summary,
aes(x = goals_p90,
y = shots_p90))+
geom_point() +
geom_hline(yintercept = 2,
colour = "red",
alpha = 0.7,
linetype = "dotted")+
geom_vline(xintercept = 0.3,
colour = "red",
alpha = 0.7,
linetype = "dotted")+
geom_text_repel(data = summary %>%
filter(shots_p90 >= 2 | goals_p90>= 0.3),
aes(label = player.name))+
theme_minimal() +
labs(title = "FAWSL Shots/Goals",
subtitle = "2019/2020 // Players >10 90s",
x = "Goals P90",
y = "Shots P90")
##top 9 players by NPxG P90
summary<-summary %>%
slice(1:9)
##bar plot top 9 players
ggplot(data = summary,
aes(x = npxg_p90,
y = reorder(player.name, npxg_p90)))+
geom_col(fill = "midnightblue") +
geom_text(aes(label = round(npxg_p90, digits = 2)),
colour = "white",
hjust = 1.2,
fontface = "bold",
size = 5) +
labs(title = "FAWSL NPxG P90",
subtitle = "2019/2020",
x = "NPxG P90",
y = "",
caption = "Data: StatsBomb | @biscuitchaser") +
theme_minimal()+
theme(plot.title = element_text(size = 30, face = "bold"),
plot.subtitle = element_text(size = 15),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12))
##filter location data by top 9 players - shots only
shot_data<-data %>%
filter(player.name %in% unique(summary$player.name),
type.name == "Shot") %>%
mutate(is_goal = ifelse(shot.outcome.name=="Goal", 1, 0) %>%
as.factor()) %>%
select(player.name, location.x, location.y, is_goal, shot.outcome.name, shot.statsbomb_xg, team.name)
##plot shots!
ggplot()+
annotate_pitch(dimensions = pitch_statsbomb)+
theme_pitch()+
coord_flip(xlim = c(60,120),
ylim = c(0,80)) +
geom_point(data = shot_data,
aes(x = location.x,
y = location.y,
colour = is_goal,
size = shot.statsbomb_xg)) +
geom_point(data = shot_data %>%
filter(is_goal==1),
aes(x = location.x,
y = location.y,
size = shot.statsbomb_xg),
shape = 21,
fill = "#66C2A5",
stroke = 0.6,
colour = "black")+
scale_colour_manual(values = c("#FC8D62", "#66C2A5"),
name = "Shot Outcome",
labels = c("No-Goal", "Goal"))+
labs(title = "FAWSL Top Non Penalty xG P90 Players",
subtitle = "2019/2020 // Players >10 90s",
caption = "Data: StatsBomb | @biscuitchaser",
size = "Non-Penalty xG")+
facet_wrap(~player.name) +
geom_text(data = summary,
aes(x=80,
y=15,
label = paste("Shots P90:", round(shots_p90, digits = 2))))+
geom_text(data = summary,
aes(x=74,
y=15,
label = paste("Goals P90:", round(goals_p90, digits = 2))))+
geom_text(data = summary,
aes(x=68,
y=15,
label = paste("NPxG P90:", round(npxg_p90, digits = 2))))+
theme(strip.text = element_text(hjust = 0.5, size = 12, face = "bold"),
plot.title = element_text(size = 30, hjust = 0.5, face = "bold"),
plot.subtitle = element_text(size = 15, hjust = 0.5),
plot.caption = element_text(size = 10),
legend.key = element_blank(),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12, face = "bold"),
legend.position = "top",
strip.background = element_blank())
ggsave(plot = last_plot(), "shot_plot.png", dpi = 320, height = 10, width = 12)
Great content, loved it!