Football Data & SQL in R...continued!
How to go further with an SQL database in R using football shot data
I have been on a small hiatus busy with other stuff, but I was messaged this week asking if there was a follow up from this piece in December. This also ties in with some other work I have been doing, so thought I would add an additional piece.
The natural next step once creating the SQL database and populating with data, is to automate the process.
General work flow looking like:
Identify data (in this case, Understat shot location data)
Create a SQL database and populate with the above data (in the previous piece)
Create a script to pull the data as it’s updated - updating the SQL database on at user selected intervals
Automate the above
Feed data into rmarkdown/shiny to create insights
This will primarily focus on the 3rd and 4th bullet point…I guess the final bullet point is something to revisit…probably in 6 months time!
Anyway, loading the packages:
library(tidyverse)
library(glue)
library(DBI) ##install.packages("DBI")
library(RSQLite) ##install.packages("RSQLite")
library(here) ##install.packages("here")
library(worldfootballR) ##install.packages("worldfootballR")
Previously I would have used the understatr
package, however worldfootballR
now ties in a load of useful football related data from fbRef, transfermarkt, Understat and Fotmob.
We will next open the connection to our Understat SQL database before checking the tables for all leagues (EPL, La Liga, Ligue 1, Bundesliga, Serie A) exist from 2014-2020. We are missing data from the season starting in 2021 so will use this to add to our existing database.
##create .db ----
con <- dbConnect(drv = RSQLite::SQLite(),
here::here("understat_shot.db"))
##check data exists in .db ----
dbListTables(con)
Yay!
Next to specify the leagues we wish to obtain data for:
##leagues
leagues <- c("EPL",
"La liga",
"Bundesliga",
"Serie A",
"Ligue 1")
Finally, create a function to pull the data:
##create function
get_shots <- function(league_name, season) {
1. ##print league data being obtained
print(glue::glue("Pulling data for {league_name} - {season}"))
2. ##pull data
dat <- understat_league_season_shots(league = league_name,
season_start_year = season)
3. ##edit string for saving files
table_name <- league_name %>%
str_replace(., " ","_") %>%
paste0(., "_", season) %>%
tolower
4. ##save to .csv
write_csv(dat, here::here("data", glue::glue("{table_name}.csv")))
5. ##write data to SQL db
dbWriteTable(con,
name = table_name,
dat,
append = TRUE)
}
Above, I have created a get_shots
function where the league name and season can be provided in order to pull the data. The function will -
Print a message for league and season being run
Pull the data (
dat
)Some string editing for the saved files
Write the data to a .csv - using the
here
package to save within a data folder in the Understat projectWrite the to SQL database
Now this is set up we can use purrr::walk()
to cycle through the leagues for 2021:
##get data
purrr::walk(leagues, ~get_shots(league_name = .x,
season = "2021"),
con)
Finally, check database that 2021 data has been added:
Lovely stuff.
Just change the season to 2022 and save the script!
The Final part is to automate the above, now we have a script that works.
This is something I have recently looked into, so it may not be the ultimate guide, but seems to be something that works! We can use cronR in mac and TaskscheduleR in windows to set a cron job to run the above script. I’m working on mac so will be using cronR
.
Following the set up guide, I will use the shiny add in:
install.packages('cronR')
install.packages('miniUI')
install.packages('shiny')
install.packages('shinyFiles')
Using either Addins in Rstudio or using cronR::cron_rstudioaddin()
the shiny interface can be accessed.
Simply select the file we wish to run (the above saved script), select the time and schedule (once a week Monday morning?) click ‘create job’, ‘done’ and you are away! You can double check the job has been added by looking at ‘Manage existing jobs’.
The job will not run unless the computer/laptop is on, therefore Understat shots will be added to the SQL database once opened.
Your database will now be automatically updated without having to manually run the script! This allows quick access to data with focus on output.
Hopefully this is of use to at least one person!
Full code on github.
The full code:
# Load packages -----------------------------------------------------------
library(worldfootballR)
library(glue)
library(tidyverse)
library(DBI)
library(RSQLite)
library(here)
##connect to db
con <- dbConnect(drv = RSQLite::SQLite(),
here::here("understat_shot.db"))
##leagues
leagues <- c("EPL",
"La liga",
"Bundesliga",
"Serie A",
"Ligue 1")
# function ----------------------------------------------------------------
get_shots <- function(league_name, season) {
##print league data being obtained
print(glue::glue("Pulling data for {league_name} - {season}"))
##pull data
dat <- understat_league_season_shots(league = league_name,
season_start_year = season)
##edit string for saving files
table_name <- league_name %>%
str_replace(., " ","_") %>%
paste0(., "_", season) %>%
tolower
##save to .csv
write_csv(dat, here::here("data", glue::glue("{table_name}.csv")))
##write data to SQL db
dbWriteTable(con,
name = table_name,
dat,
append = TRUE)
}
# pull data ---------------------------------------------------------------
purrr::walk(leagues, ~get_shots(league_name = .x,
season = "2021"),
con)
##disconnect fron db
dbDisconnect(con)