Skip to content

Scrape, process and load football-related data from FBref, Sofascore and Transfermarkt

Notifications You must be signed in to change notification settings

felipeall/football-data-platform

Repository files navigation

Football Data Platform

WIP: This project is still under development, more updates to come

The Football Data Platform is a comprehensive data aggregation tool tailored for football enthusiasts, analysts, and researchers. It collects football-related data from popular platforms: FBRef, Sofascore and Transfermarkt. Once fetched, it saves the webpages' data as JSON files and subsequently loads it into a PostgreSQL database for structured queries and analytics.


Features

  • Data Scraping: Pulls data from Transfermarkt, Sofascore, and FBRef efficiently and systematically.
  • Data Storage: Stores raw webpage data as JSON files.
  • Database Loading: Inserts and structures the scraped data into a PostgreSQL database.

Prerequisites

  • Python 3.9+
  • Poetry
  • Docker

Installation

  1. Clone this repository:
git clone https://github.com/your-github-username/football-data-platform.git
cd football-data-platform
  1. Create a Poetry virtual environment and install the dependencies:
poetry shell
poetry install --no-root
  1. Create a .env file in the root directory:
cp .env.example .env
  1. Build the Docker image and spin up the containers:
docker compose up -d --build
  1. Run the database migrations:
alembic upgrade head

Scrapping

Sofascore

scrapy crawl sofascore -a TOURNAMENT_ID=<tournament_id> -a SEASON_ID=<season_id>

Where <tournament_id> and <season_id> are the tournament and season identifiers, respectively. They can be found in the URL of the tournament page on Sofascore. If no season_id is provided, the crawler will scrape all seasons with available data.

Example: LaLiga 23/24

scrapy crawl sofascore_season -a TOURNAMENT_ID=8 -a SEASON_ID=52376

Transfermarkt

scrapy crawl transfermarkt -a TOURNAMENT_ID=<tournament_id> -a SEASON_ID=<season_id>

Where <tournament_id> and <season_id> are the tournament and season identifiers, respectively. They can be found in the URL of the tournament page on Transfermarkt.

Example: LaLiga 23/24

scrapy crawl transfermarkt -a TOURNAMENT_ID=ES1 -a SEASON_ID=2023

FBref

scrapy crawl <spider_name>

Where <spider_name> is the name of the spider to be executed. The available spiders are:

  • FBrefBRA1
  • FBrefEPL
  • FBrefUCL

Processing

usage:

processing [-h] [--full-load] [--debug] [{sofascore,transfermarkt}]

positional arguments:

{sofascore,transfermarkt} Source to process data from.

optional arguments:

-h, --help show this help message and exit
--full-load Process and load all data from the source
--debug Enable debug mode.

Example:

python app/processing sofascore --full-load

About

Scrape, process and load football-related data from FBref, Sofascore and Transfermarkt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published