mlb-pitcher-xK is a data science and machine learning project focused on predicting MLB pitchers' strikeout percentages (K%) for the 2024 season. Using historical pitching data, feature engineering, and statistical modeling, this project aims to derive insights into pitcher performance trends while emphasizing reproducibility and employing best practices in data science.

Problem Statement

The provided dataset in data/k.csv contains only eight columns:

  1. MLBAMID: player's MLB ID
  2. PlayerId: player's FanGraphs ID
  3. Name: player's name
  4. Team: player's team name (NOTE: " - - -" indicates the player played for multiple teams in a season)
  5. Age: player's age in 2024 season
  6. Season: season year
  7. TBF: Total batters faced for the player-season
  8. K%: Strikeout percentage for the player-season

Objective: Predict each player’s K% for the 2024 season using historical K% and other derived features. The analysis excludes any data from Opening Day 2024 onward.


A linear regression model (LinearRegression) and two tree-based models (XGBRegressor and RandomForestRegressor) were developed using:

  • Provided data (k.csv): historical K% and TBF values
  • Supplemental data: Scraped statistics from Baseball Reference Pitcher Data, including advanced metrics like strike percentages and contact rates.

2024 Predictions

model R2 MSE
LinearRegression 0.945 0.00018
XGBRegressor 0.935 0.00021
RandomForestRegressor 0.926 0.00024


The LinearRegression model was chosen as the final model architecture.

Key Features Used by the Model

  • I/Str: ball in play percentage (balls put into play including hr / total strikes)
  • Pit/PA: pitches per plate appearance
  • Con: contact percentage ((foul + inplay strikes) / (inplay + foul + swinging strikes))
  • 30%: 3-0 count seen percentage (3-0 counts / PA)
  • L/SO: strikeouts looking
  • F/Str: foul ball strike percentage (pitches fouled off / total strikes seen)
  • Str%: strike percentage (strikes / total pitches; intentional balls included)
feature coef
I/Str -0.0528688
Pit/PA -0.0143233
Con -0.0124488
30% -0.00476233
L/SO 0.00440924
F/Str -0.00169988
Str% -0.000350969

Model Performance

The model effectively predicts xK% (expected strikeout percentage), as demonstrated by the correlation between predicted and actual K%:


For an interactive visualization, see assets/images/linear-pred-vs-target.html:


A few cool plots based on the predictions:





Development Process

All analysis and modeling were conducted in Jupyter notebooks (see the notebooks/ directory). The final code was refactored into a Python package, bullpen, for modularity and reproducibility (see src/bullpen/). Key development steps include:

  1. Data Scraping: Extracted supplemental data from Baseball Reference using bullpen.data_utils.
  2. Data Cleaning & Integration: Processed and merged supplemental data with k.csv.
  3. Feature Engineering: Created data processing pipelines for scaling and one-hot encoding features using bullpen.model_utils.
  4. Modeling: Trained and validated models using:
  • Classic cross-validation (utilizing sklearn.GridSearchCV)
  • Time-series cross-validation (implemented custom time splitting training loop)


These are likely the files you want to look at to familiarize yourself with the analysis.


Sometimes the interactive plots don't render on GitHub. If that is the case, use nbviewer for an enhanced notebook viewing experience.

Scraping Supplementary Pitching Data from Baseball Reference

The provided dataset (k.csv) located in the data/ directory contains essential but limited pitching statistics, with the following eight columns:

  1. MLBAMID: Player's MLB ID
  2. PlayerId: Player's FanGraphs ID
  3. Name: Player's name
  4. Team: Player's team name (Note: " - - -" indicates the player played for multiple teams in a season)
  5. Age: Player's age during the 2024 season
  6. Season: Year of the season
  7. TBF: Total batters faced for the player-season
  8. K%: Strikeout percentage for the player-season

To make accurate predictions of a pitcher's strikeout percentage (K%) for the 2024 season, additional contextual data will likely be required. Fortunately, Baseball Reference offers a comprehensive dataset of MLB pitching statistics: Baseball Reference Pitching Data.

Scraping Utility

To facilitate data collection, a scraping utility has been implemented:

  • bullpen.data_utils.Scraper(): A core scraping tool for Baseball Reference data.
  • bullpen.data_utils.batch_scrape(): A convenience function to scrape data across multiple seasons.

Since the dataset in k.csv covers the seasons from 2021 to 2024, we will limit our scraping to this same range.

Supplemental Data Attributes

The Baseball Reference data contains the following additional attributes, which provide deeper insights into a pitcher's performance:

  1. Rk: Arbitrary rank based on sorting
  2. Name: Player's name
  3. Age: Age as of June 30th of the season year
  4. Tm: Abbreviated team name
  5. IP: Innings pitched
  6. PA: Number of plate appearances (including inning-ending baserunning outs)
  7. Pit: Total pitches in plate appearances
  8. Pit/PA: Pitches per plate appearance
  9. Str: Total strikes (including both in-zone and out-of-zone swings)
  10. Str%: Strike percentage (Str / Pit)
  11. L/Str: Looking strike percentage (Looking strikes / Str)
  12. S/Str: Swinging strike percentage (Swinging strikes / Str)
  13. F/Str: Foul strike percentage (Fouls / Str)
  14. I/Str: Balls in play percentage (Balls in play / Str)
  15. AS/Str: Percentage of strikes swung at ((In-play + Fouls + Swings) / Str)
  16. I/Bll: Intentional ball percentage (Intentional balls / Total balls)
  17. AS/Pit: Swing percentage (Swings / (Pit - Intentional balls))
  18. Con: Contact percentage ((Fouls + In-play) / Swings)
  19. 1st%: First pitch strike percentage (First-pitch strikes / PA)
  20. 30%: Percentage of 3-0 counts seen (3-0 counts / PA)
  21. 30c: Total 3-0 counts
  22. 30s: Strikes in 3-0 counts
  23. 02%: Percentage of 0-2 counts seen (0-2 counts / PA)
  24. 02c: Total 0-2 counts
  25. 02s: Strikes in 0-2 counts
  26. 02h: Hits allowed in 0-2 counts
  27. L/SO: Strikeouts looking
  28. S/SO: Strikeouts swinging
  29. L/SO%: Looking strikeout percentage (Looking SO / Total SO)
  30. 3pK: Three-pitch strikeouts
  31. 4pW: Four-pitch walks
  32. PAu: Plate appearances with unknown outcomes
  33. Pitu: Pitches with unknown ball-strike results
  34. Stru: Strikes with unknown details
  35. Season: Year of the season

Data Partitioning Strategy

Inspired by scikit-learn:

Project Layout

The full layout of the project is shown below -- notably:

  • data/: Location of any provided or collected dataset
    • k.csv: original provided data
    • player_ids.json: A collection of Name to ID mappings (used by bullpen.data_utils.PlayerLookup).
    • supplemental-stats.csv: Scraped data from Baseball Reference Pitcher Data
    • train.csv and test.csv: saved model training and test data after merging k.csv and supplemental-stats.csv together (see 02-data-partitioning.ipynb).
  • models/: Model registry that contains trained model files.
  • notebooks/: Development notebooks that contain data preprocessing, feature engineering, and modeling ideation and implementation.
    • For source code that implements the ideas in notebooks, see src/bullpen/mle_project files
    • The notebooks/html/ directory holds HTML versions of the Jupyter Notebooks.
  • src/: Source code
  • tests/: Unit test suite
$ tree
├── README.html
├── articles
│   └── The Definitive Pitcher Expected K% Formula _ RotoGraphs Fantasy Baseball.pdf
├── assets
│   └── images
├── data
│   ├── k.csv
│   ├── player_ids.json
│   ├── supplemental-stats.csv
│   ├── test.csv
│   └── train.csv
├── models
│   ├── linear.joblib
│   ├── randomforest.joblib
│   └── xgboost.joblib
├── notebooks
│   ├── 00-data-scrape-example.ipynb
│   ├── 01a-data-processing-fixing-names.ipynb
│   ├── 01b-data-processing-merging.ipynb
│   ├── 02-data-partitioning.ipynb
│   ├── 03-feature-engineering.ipynb
│   ├── 04a-modeling-classic-cv.ipynb
│   ├── 04b-modeling-time-series-cv.ipynb
│   ├── 05-final-predictions.ipynb
│   └── html
│       ├── 00-data-scrape-example.html
│       ├── 01a-data-processing-fixing-names.html
│       ├── 01b-data-processing-merging.html
│       ├── 02-data-partitioning.html
│       ├── 03-feature-engineering.html
│       ├── 04a-modeling-classic-cv.html
│       ├── 04b-modeling-time-series-cv.html
│       └── 05-final-predictions.html
├── pyproject.toml
├── src
│   └── bullpen
│       ├──
│       ├──
│       ├──
│       ├──
│       └──
└── tests


  • Create a virtual environment (with Python 3.11+)
  • Activate the virtual environment
  • Clone the repo: git clone
  • Navigate to the project directory: cd mlb-pitcher-xK)
  • Install local version via pip install -e .
(mlb-pitcher)$ cd mlb-pitcher-xK

(mlb-pitcher)$ pip install -e .

  • Optional step: run test suite
