A reproducible MLB game and season simulation project using historical data. Built as a senior project exploring how statistical modeling can predict player performance and team outcomes.
TaylorBall implements three progressive stages of baseball simulation:
- Stage 1: Simple Probabilistic Model — Basic win probability using league-average outcomes
- Stage 2: Team-Level Adjustments — Incorporates team offensive and pitching metrics
- Stage 3: Player-Level Simulation — Models individual batter-pitcher matchups using historical stats
Each stage builds on the previous one, with validation comparing simulated seasons against actual historical records.
pip install -r requirements.txt
jupyter notebook TaylorBall.ipynbconda env create -f environment.yml
conda activate taylorball
jupyter notebook TaylorBall.ipynbThis project uses publicly available baseball data:
- Lahman Database — Historical statistics from 1871-present
- Retrosheet — Play-by-play game logs
(Add 2-3 bullet points summarizing your most interesting results, e.g.:)
- The Stage 3 model predicted season win totals within X games for Y% of teams
- Run differential proved to be the strongest single predictor of...
- Monte Carlo simulations showed that playoff outcomes have higher variance than...
| Type | Senior Capstone Project |
| Focus | Baseball Analytics / Sports Data Science |
| Tools | Python, Jupyter, pandas, NumPy, matplotlib |
This project demonstrates end-to-end analytical work: scoping a research question, acquiring and cleaning real-world data, building progressively complex models, and validating results against ground truth.
- Current model uses season-level stats; pitch-by-pitch data could improve accuracy
- Simulation assumes independent at-bats (doesn't model hot/cold streaks)
- Could extend to project prospect performance or evaluate trades
MIT License — see LICENSE for details.
Author: Josh Taylor
Contact: joshknowsbaseball