Methodology
How ProPlotFits Models Work
Overview
ProPlotFits uses machine learning models trained on 2+ seasons of NCAA Division I men’s basketball data to predict game outcomes and identify betting value.
Our predictions are generated fresh daily by combining:
- Rolling performance metrics from recent games
- Advanced basketball analytics (Four Factors, pace-adjusted efficiency)
- Statistical modeling (Gradient Boosting, Random Forest, Logistic Regression)
- Vegas odds comparison from multiple sportsbooks
Data Pipeline
1. Data Collection
Game Statistics (via hoopR)
- Box scores for all 350+ D-I teams
- Updated within 1 hour of game completion
- Covers 2023-24, 2024-25, and current 2025-26 seasons
Vegas Odds (via The Odds API)
- Real-time spreads, totals, and moneylines
- DraftKings, FanDuel, BetMGM coverage
- Updated every 6 hours on game days
KenPom Metrics (coming in v2.0)
- Adjusted offensive/defensive efficiency
- Strength of schedule
- Tempo ratings
2. Feature Engineering
For each game, we calculate:
Recent Performance (Rolling Windows)
- Last 3 games (L3): Points, margins, win %
- Last 5 games (L5): Full four factors, efficiency metrics
Four Factors Analysis
- Effective Field Goal % (eFG%)
- Turnover Rate (TOV%)
- Offensive Rebound % (ORB%)
- Free Throw Rate (FTR)
Efficiency Metrics
- Offensive Efficiency (points per 100 possessions)
- Defensive Efficiency (opponent points per 100 possessions)
- Net Efficiency (Offensive - Defensive)
- Pace (possessions per game)
Matchup Features
- Offensive vs Defensive matchup ratings
- Pace differentials
- Home court advantage adjustment (+3.5 points)
- Conference game indicator
Prediction Models
Model 1: Spread Prediction (Gradient Boosting)
Target: Home team margin of victory
Algorithm: Gradient Boosted Trees (GBM)
Key Features:
- Net efficiency differential (most important)
- Offensive/defensive matchup ratings
- Recent form (L3 margins)
- Home court advantage
- Four factors differentials
Performance Metrics:
- RMSE: ~11.2 points (backtest)
- MAE: ~8.7 points
- R²: 0.31
What This Means:
On average, our spread predictions are within ~9 points of actual margins. While that sounds wide, the key is identifying games where we differ from Vegas by 3+ points.
Model 2: Total Prediction (Random Forest)
Target: Combined score (both teams)
Algorithm: Random Forest Regression
Key Features:
- Combined offensive efficiency (both teams)
- Average pace (most important)
- Pace differential
- Recent scoring trends
- Conference game indicator
Performance Metrics:
- RMSE: ~13.8 points (backtest)
- MAE: ~10.9 points
- R²: 0.24
What This Means:
Total predictions are harder than spreads (more variance), but we focus on games with 5+ point edges where we see clear over/under value.
Model 3: Win Probability (Logistic Regression)
Target: Home team win (binary)
Algorithm: Logistic Regression
Key Features:
- Predicted margin (from Model 1)
- Win percentage differential
- Recent form trends
- Home court advantage
Performance Metrics:
- Accuracy: 72% (backtest)
- AUC-ROC: 0.78
- Log Loss: 0.54
What This Means:
Our win probability model correctly predicts winners 72% of the time on historical data. High-confidence picks (>75% win prob) are our strongest plays.
Value Detection
What is “Edge”?
Edge = | Our Prediction - Vegas Line |
Spread Edge: Difference in predicted margin
- Example: We predict Duke -12, Vegas has Duke -8 → 4 point edge
- We look for edges of 3+ points
Total Edge: Difference in predicted total points
- Example: We predict 145 total, Vegas has O/U 155 → 10 point edge
- We look for edges of 5+ points
Why These Thresholds?
Based on backtesting:
- 3-point spread edges hit at 55-58% (profitable with -110 odds)
- 5-point total edges hit at 53-56% (profitable with -110 odds)
- Higher edges = higher win rates, but fewer opportunities
Multi-Sportsbook Averaging
We average odds across DraftKings, FanDuel, and BetMGM to:
- Get a consensus market view
- Reduce impact of outlier lines
- Ensure picks are available at multiple books
You should always shop for the best line at bet time.
Model Training & Validation
Training Data
- 2023-24 season: 5,000+ games
- 2024-25 season: 5,000+ games
- Current season: Ongoing (retraining weekly)
Validation Approach
- Time-series split: Train on past, test on future
- No data leakage: Only use information available pre-game
- Rolling retraining: Models updated weekly with new games
What We’re NOT Doing
❌ Overfitting to past results ❌ Using future information in predictions
❌ Cherry-picking successful picks for marketing ❌ Guaranteeing profits
Limitations & Transparency
Model Weaknesses
- Injuries not accounted for - We use team stats, not player-level data
- Coaching changes - Not explicitly modeled
- Motivation factors - Can’t quantify “must-win” games
- Small sample sizes early - Teams with <5 games have less reliable stats
- Non-conference randomness - Lower-quality matchups harder to predict
Version History
v1.0 (January 2026) - Current
- Gradient Boosting (spreads)
- Random Forest (totals)
- Logistic Regression (win probability)
- hoopR data pipeline
- Multi-sportsbook odds integration
v2.0 (Planned - February 2026)
- KenPom adjusted efficiency integration
- XGBoost model testing
- Injury data scraping
- Historical performance tracking dashboard
Questions?
Want to learn more? Check out:
- GitHub Repository - All code is open source
- About Page - Meet the team
- Today’s Picks - See the models in action
Found an issue? Open a GitHub issue - we’re actively improving!