Methodology

How ProPlotFits Models Work

Overview

ProPlotFits uses machine learning models trained on 2+ seasons of NCAA Division I men’s basketball data to predict game outcomes and identify betting value.

Our predictions are generated fresh daily by combining:

  1. Rolling performance metrics from recent games
  2. Advanced basketball analytics (Four Factors, pace-adjusted efficiency)
  3. Statistical modeling (Gradient Boosting, Random Forest, Logistic Regression)
  4. Vegas odds comparison from multiple sportsbooks

Data Pipeline

1. Data Collection

Game Statistics (via hoopR)

  • Box scores for all 350+ D-I teams
  • Updated within 1 hour of game completion
  • Covers 2023-24, 2024-25, and current 2025-26 seasons

Vegas Odds (via The Odds API)

  • Real-time spreads, totals, and moneylines
  • DraftKings, FanDuel, BetMGM coverage
  • Updated every 6 hours on game days

KenPom Metrics (coming in v2.0)

  • Adjusted offensive/defensive efficiency
  • Strength of schedule
  • Tempo ratings

2. Feature Engineering

For each game, we calculate:

Recent Performance (Rolling Windows)

  • Last 3 games (L3): Points, margins, win %
  • Last 5 games (L5): Full four factors, efficiency metrics

Four Factors Analysis

  • Effective Field Goal % (eFG%)
  • Turnover Rate (TOV%)
  • Offensive Rebound % (ORB%)
  • Free Throw Rate (FTR)

Efficiency Metrics

  • Offensive Efficiency (points per 100 possessions)
  • Defensive Efficiency (opponent points per 100 possessions)
  • Net Efficiency (Offensive - Defensive)
  • Pace (possessions per game)

Matchup Features

  • Offensive vs Defensive matchup ratings
  • Pace differentials
  • Home court advantage adjustment (+3.5 points)
  • Conference game indicator

Prediction Models

Model 1: Spread Prediction (Gradient Boosting)

Target: Home team margin of victory

Algorithm: Gradient Boosted Trees (GBM)

Key Features:

  • Net efficiency differential (most important)
  • Offensive/defensive matchup ratings
  • Recent form (L3 margins)
  • Home court advantage
  • Four factors differentials

Performance Metrics:

  • RMSE: ~11.2 points (backtest)
  • MAE: ~8.7 points
  • R²: 0.31

What This Means:

On average, our spread predictions are within ~9 points of actual margins. While that sounds wide, the key is identifying games where we differ from Vegas by 3+ points.

Model 2: Total Prediction (Random Forest)

Target: Combined score (both teams)

Algorithm: Random Forest Regression

Key Features:

  • Combined offensive efficiency (both teams)
  • Average pace (most important)
  • Pace differential
  • Recent scoring trends
  • Conference game indicator

Performance Metrics:

  • RMSE: ~13.8 points (backtest)
  • MAE: ~10.9 points
  • R²: 0.24

What This Means:

Total predictions are harder than spreads (more variance), but we focus on games with 5+ point edges where we see clear over/under value.

Model 3: Win Probability (Logistic Regression)

Target: Home team win (binary)

Algorithm: Logistic Regression

Key Features:

  • Predicted margin (from Model 1)
  • Win percentage differential
  • Recent form trends
  • Home court advantage

Performance Metrics:

  • Accuracy: 72% (backtest)
  • AUC-ROC: 0.78
  • Log Loss: 0.54

What This Means:

Our win probability model correctly predicts winners 72% of the time on historical data. High-confidence picks (>75% win prob) are our strongest plays.


Value Detection

What is “Edge”?

Edge = | Our Prediction - Vegas Line |

Spread Edge: Difference in predicted margin

  • Example: We predict Duke -12, Vegas has Duke -8 → 4 point edge
  • We look for edges of 3+ points

Total Edge: Difference in predicted total points

  • Example: We predict 145 total, Vegas has O/U 155 → 10 point edge
  • We look for edges of 5+ points

Why These Thresholds?

Based on backtesting:

  • 3-point spread edges hit at 55-58% (profitable with -110 odds)
  • 5-point total edges hit at 53-56% (profitable with -110 odds)
  • Higher edges = higher win rates, but fewer opportunities

Multi-Sportsbook Averaging

We average odds across DraftKings, FanDuel, and BetMGM to:

  1. Get a consensus market view
  2. Reduce impact of outlier lines
  3. Ensure picks are available at multiple books

You should always shop for the best line at bet time.


Model Training & Validation

Training Data

  • 2023-24 season: 5,000+ games
  • 2024-25 season: 5,000+ games
  • Current season: Ongoing (retraining weekly)

Validation Approach

  • Time-series split: Train on past, test on future
  • No data leakage: Only use information available pre-game
  • Rolling retraining: Models updated weekly with new games

What We’re NOT Doing

❌ Overfitting to past results ❌ Using future information in predictions
❌ Cherry-picking successful picks for marketing ❌ Guaranteeing profits


Limitations & Transparency

Model Weaknesses

  1. Injuries not accounted for - We use team stats, not player-level data
  2. Coaching changes - Not explicitly modeled
  3. Motivation factors - Can’t quantify “must-win” games
  4. Small sample sizes early - Teams with <5 games have less reliable stats
  5. Non-conference randomness - Lower-quality matchups harder to predict

Why We Share This

We believe in transparent, data-driven betting analysis. No model is perfect, and we won’t pretend ours is. Our edge comes from:

  • Systematic, repeatable process
  • Large sample size predictions
  • Focus on high-edge opportunities
  • Continuous model improvement

Version History

v1.0 (January 2026) - Current

  • Gradient Boosting (spreads)
  • Random Forest (totals)
  • Logistic Regression (win probability)
  • hoopR data pipeline
  • Multi-sportsbook odds integration

v2.0 (Planned - February 2026)

  • KenPom adjusted efficiency integration
  • XGBoost model testing
  • Injury data scraping
  • Historical performance tracking dashboard

Questions?

Want to learn more? Check out:

Found an issue? Open a GitHub issue - we’re actively improving!