Methodology

How ProPlotFits Models Work

Overview

ProPlotFits uses machine learning models trained on 2+ seasons of NCAA Division I men’s basketball data to predict game outcomes and identify betting value.

Our predictions are generated fresh daily by combining:

Rolling performance metrics from recent games
Advanced basketball analytics (Four Factors, pace-adjusted efficiency)
Statistical modeling (Gradient Boosting, Random Forest, Logistic Regression)
Vegas odds comparison from multiple sportsbooks

Data Pipeline

1. Data Collection

Game Statistics (via hoopR)

Box scores for all 350+ D-I teams
Updated within 1 hour of game completion
Covers 2023-24, 2024-25, and current 2025-26 seasons

Vegas Odds (via The Odds API)

Real-time spreads, totals, and moneylines
DraftKings, FanDuel, BetMGM coverage
Updated every 6 hours on game days

KenPom Metrics (coming in v2.0)

Adjusted offensive/defensive efficiency
Strength of schedule
Tempo ratings

2. Feature Engineering

For each game, we calculate:

Recent Performance (Rolling Windows)

Last 3 games (L3): Points, margins, win %
Last 5 games (L5): Full four factors, efficiency metrics

Four Factors Analysis

Effective Field Goal % (eFG%)
Turnover Rate (TOV%)
Offensive Rebound % (ORB%)
Free Throw Rate (FTR)

Efficiency Metrics

Offensive Efficiency (points per 100 possessions)
Defensive Efficiency (opponent points per 100 possessions)
Net Efficiency (Offensive - Defensive)
Pace (possessions per game)

Matchup Features

Offensive vs Defensive matchup ratings
Pace differentials
Home court advantage adjustment (+3.5 points)
Conference game indicator

Prediction Models

Model 1: Spread Prediction (Gradient Boosting)

Target: Home team margin of victory

Algorithm: Gradient Boosted Trees (GBM)

Key Features:

Net efficiency differential (most important)
Offensive/defensive matchup ratings
Recent form (L3 margins)
Home court advantage
Four factors differentials

Performance Metrics:

RMSE: ~11.2 points (backtest)
MAE: ~8.7 points
R²: 0.31

What This Means:

On average, our spread predictions are within ~9 points of actual margins. While that sounds wide, the key is identifying games where we differ from Vegas by 3+ points.

Model 2: Total Prediction (Random Forest)

Target: Combined score (both teams)

Algorithm: Random Forest Regression

Key Features:

Combined offensive efficiency (both teams)
Average pace (most important)
Pace differential
Recent scoring trends
Conference game indicator

Performance Metrics:

RMSE: ~13.8 points (backtest)
MAE: ~10.9 points
R²: 0.24

What This Means:

Total predictions are harder than spreads (more variance), but we focus on games with 5+ point edges where we see clear over/under value.

Model 3: Win Probability (Logistic Regression)

Target: Home team win (binary)

Algorithm: Logistic Regression

Key Features:

Predicted margin (from Model 1)
Win percentage differential
Recent form trends
Home court advantage

Performance Metrics:

Accuracy: 72% (backtest)
AUC-ROC: 0.78
Log Loss: 0.54

What This Means:

Our win probability model correctly predicts winners 72% of the time on historical data. High-confidence picks (>75% win prob) are our strongest plays.

Value Detection

What is “Edge”?

Edge = | Our Prediction - Vegas Line |

Spread Edge: Difference in predicted margin

Example: We predict Duke -12, Vegas has Duke -8 → 4 point edge
We look for edges of 3+ points

Total Edge: Difference in predicted total points

Example: We predict 145 total, Vegas has O/U 155 → 10 point edge
We look for edges of 5+ points

Why These Thresholds?

Based on backtesting:

3-point spread edges hit at 55-58% (profitable with -110 odds)
5-point total edges hit at 53-56% (profitable with -110 odds)
Higher edges = higher win rates, but fewer opportunities

Multi-Sportsbook Averaging

We average odds across DraftKings, FanDuel, and BetMGM to:

Get a consensus market view
Reduce impact of outlier lines
Ensure picks are available at multiple books

You should always shop for the best line at bet time.

Model Training & Validation

Training Data

2023-24 season: 5,000+ games
2024-25 season: 5,000+ games
Current season: Ongoing (retraining weekly)

Validation Approach

Time-series split: Train on past, test on future
No data leakage: Only use information available pre-game
Rolling retraining: Models updated weekly with new games

What We’re NOT Doing

❌ Overfitting to past results ❌ Using future information in predictions
❌ Cherry-picking successful picks for marketing ❌ Guaranteeing profits

Limitations & Transparency

Model Weaknesses

Injuries not accounted for - We use team stats, not player-level data
Coaching changes - Not explicitly modeled
Motivation factors - Can’t quantify “must-win” games
Small sample sizes early - Teams with <5 games have less reliable stats
Non-conference randomness - Lower-quality matchups harder to predict

Version History

v1.0 (January 2026) - Current

Gradient Boosting (spreads)
Random Forest (totals)
Logistic Regression (win probability)
hoopR data pipeline
Multi-sportsbook odds integration

v2.0 (Planned - February 2026)

KenPom adjusted efficiency integration
XGBoost model testing
Injury data scraping
Historical performance tracking dashboard

Questions?

Want to learn more? Check out:

GitHub Repository - All code is open source
About Page - Meet the team
Today’s Picks - See the models in action

Found an issue? Open a GitHub issue - we’re actively improving!

Overview

Data Pipeline

1. Data Collection

2. Feature Engineering

Prediction Models

Model 1: Spread Prediction (Gradient Boosting)

Model 2: Total Prediction (Random Forest)

Model 3: Win Probability (Logistic Regression)

Value Detection

What is “Edge”?

Why These Thresholds?

Multi-Sportsbook Averaging

Model Training & Validation

Training Data

Validation Approach

What We’re NOT Doing

Limitations & Transparency

Model Weaknesses

Why We Share This

Version History

v1.0 (January 2026) - Current

v2.0 (Planned - February 2026)

Questions?