A Tale of Two Hawks Part 3

college-basketball
machine-learning
tutorial
kansas
miami-oh
How to train, validate and deploy predictive models
Author

ProPlotFits

Published

February 19, 2026

Introduction

Why am I trying to see who would win in a college basketball team between Kansas of the Big 12 and Miami of the MAC? These teams haven’t played each other since 2011. All of those answers can be answered in Part 1 and Part 2.

Here, the goal is to find out what would happen if Kansas and Miami were to play each other.


The Three Prediction Tasks

We have three outcomes that require three models:

Model 1: Binary Classification - Will one specific team win?
Model 2: Continuous Regression (Difference) - By how much might the score differ between teams?
Model 3: Continuous Regression (Sum) - What might the combined scoring output look like?

Each model uses the same underlying features but optimizes for different objectives. Together they give us a complete picture of what might happen.


Step 1: Prepare Historical Data

We need clean, structured data where each row represents one game from the perspective of both teams.

Code
library(tidyverse)
library(hoopR)
library(janitor)
library(caret)

# Load historical data (2024 and 2025 seasons)
box_scores_2024 <- load_mbb_team_box(seasons = 2024) %>% clean_names()
box_scores_2025 <- load_mbb_team_box(seasons = 2025) %>% clean_names()

# Combine seasons
all_games <- bind_rows(
  box_scores_2024 %>% mutate(season = 2024),
  box_scores_2025 %>% mutate(season = 2025)
)

glimpse(all_games)
Rows: 25,052
Columns: 57
$ game_id                           <int> 401638645, 401638645, 401638644, 401…
$ season                            <dbl> 2024, 2024, 2024, 2024, 2024, 2024, …
$ season_type                       <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ game_date                         <date> 2024-04-08, 2024-04-08, 2024-04-06,…
$ game_date_time                    <dttm> 2024-04-08 21:20:00, 2024-04-08 21:…
$ team_id                           <int> 2509, 41, 333, 41, 152, 2509, 282, 2…
$ team_uid                          <chr> "s:40~l:41~t:2509", "s:40~l:41~t:41"…
$ team_slug                         <chr> "purdue-boilermakers", "uconn-huskie…
$ team_location                     <chr> "Purdue", "UConn", "Alabama", "UConn…
$ team_name                         <chr> "Boilermakers", "Huskies", "Crimson …
$ team_abbreviation                 <chr> "PUR", "CONN", "ALA", "CONN", "NCSU"…
$ team_display_name                 <chr> "Purdue Boilermakers", "UConn Huskie…
$ team_short_display_name           <chr> "Purdue", "UConn", "Alabama", "UConn…
$ team_color                        <chr> "000000", "0c2340", "9e1632", "0c234…
$ team_alternate_color              <chr> "cfb991", "f1f2f3", "ffffff", "f1f2f…
$ team_logo                         <chr> "https://a.espncdn.com/i/teamlogos/n…
$ team_home_away                    <chr> "away", "home", "away", "home", "awa…
$ team_score                        <int> 60, 75, 72, 86, 50, 63, 77, 79, 67, …
$ team_winner                       <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRU…
$ assists                           <int> 8, 18, 9, 20, 10, 13, 22, 12, 11, 18…
$ blocks                            <int> 3, 4, 5, 8, 3, 2, 1, 6, 8, 6, 0, 4, …
$ defensive_rebounds                <int> 19, 21, 21, 25, 22, 30, 29, 24, 24, …
$ fast_break_points                 <chr> "0", "2", "0", "2", "2", "0", "9", "…
$ field_goal_pct                    <dbl> 44.4, 48.4, 44.8, 50.0, 36.8, 40.0, …
$ field_goals_made                  <int> 24, 30, 26, 31, 21, 22, 28, 28, 25, …
$ field_goals_attempted             <int> 54, 62, 58, 62, 57, 55, 60, 62, 65, …
$ flagrant_fouls                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ fouls                             <int> 15, 18, 15, 17, 13, 8, 16, 17, 9, 14…
$ free_throw_pct                    <dbl> 73.3, 81.8, 81.8, 77.8, 75.0, 90.0, …
$ free_throws_made                  <int> 11, 9, 9, 14, 3, 9, 9, 17, 12, 12, 5…
$ free_throws_attempted             <int> 15, 11, 11, 18, 4, 10, 10, 19, 17, 1…
$ largest_lead                      <chr> "2", "18", "5", "16", "0", "20", "7"…
$ offensive_rebounds                <int> 9, 14, 8, 12, 6, 11, 8, 7, 8, 13, 5,…
$ points_in_paint                   <chr> "40", "44", "26", "38", "20", "24", …
$ steals                            <int> 3, 3, 2, 4, 8, 5, 3, 8, 7, 9, 3, 5, …
$ team_turnovers                    <int> 0, 2, 1, 0, 0, 2, 1, 0, 0, 2, 0, 0, …
$ technical_fouls                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
$ three_point_field_goal_pct        <dbl> 14.3, 27.3, 47.8, 40.0, 26.3, 40.0, …
$ three_point_field_goals_made      <int> 1, 6, 11, 10, 5, 10, 12, 6, 5, 8, 17…
$ three_point_field_goals_attempted <int> 7, 22, 23, 25, 19, 25, 32, 17, 26, 2…
$ total_rebounds                    <int> 28, 35, 29, 37, 28, 41, 37, 31, 32, …
$ total_technical_fouls             <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
$ total_turnovers                   <int> 9, 8, 8, 4, 11, 16, 15, 8, 10, 11, 9…
$ turnover_points                   <chr> "13", "11", "6", "8", "10", "11", "1…
$ turnovers                         <int> 9, 8, 8, 4, 11, 16, 15, 8, 10, 11, 9…
$ opponent_team_id                  <int> 41, 2509, 41, 333, 2509, 152, 2550, …
$ opponent_team_uid                 <chr> "s:40~l:41~t:41", "s:40~l:41~t:2509"…
$ opponent_team_slug                <chr> "uconn-huskies", "purdue-boilermaker…
$ opponent_team_location            <chr> "UConn", "Purdue", "UConn", "Alabama…
$ opponent_team_name                <chr> "Huskies", "Boilermakers", "Huskies"…
$ opponent_team_abbreviation        <chr> "CONN", "PUR", "CONN", "ALA", "PUR",…
$ opponent_team_display_name        <chr> "UConn Huskies", "Purdue Boilermaker…
$ opponent_team_short_display_name  <chr> "UConn", "Purdue", "UConn", "Alabama…
$ opponent_team_color               <chr> "0c2340", "000000", "0c2340", "9e163…
$ opponent_team_alternate_color     <chr> "f1f2f3", "cfb991", "f1f2f3", "fffff…
$ opponent_team_logo                <chr> "https://a.espncdn.com/i/teamlogos/n…
$ opponent_team_score               <int> 75, 60, 86, 72, 63, 50, 79, 77, 84, …

Step 2: Feature Engineering

We calculate team-level statistics that capture performance quality.

Code
# Calculate four factors and efficiency metrics
team_features <- all_games %>%
  mutate(
    # Four Factors
    efg_pct = (field_goals_made + 0.5 * three_point_field_goals_made) / 
              field_goals_attempted,
    tov_pct = turnovers / 
              (field_goals_attempted + 0.44 * free_throws_attempted + turnovers),
    orb_pct = offensive_rebounds / 
              (offensive_rebounds + defensive_rebounds),
    ft_rate = free_throws_made / field_goals_attempted,
    
    # Efficiency
    possessions = field_goals_attempted - offensive_rebounds + 
                  turnovers + 0.44 * free_throws_attempted,
    ortg = (team_score / possessions) * 100,
    drtg = (opponent_team_score / possessions) * 100,
    net_rating = ortg - drtg,
    pace = possessions
  ) %>%
  select(
    season, game_date, game_id,
    team_display_name, team_id, team_home_away,
    team_score, opponent_team_score,
    efg_pct, tov_pct, orb_pct, ft_rate,
    ortg, drtg, net_rating, pace
  )

# Calculate rolling averages (last 5 games)
team_rolling <- team_features %>%
  arrange(team_id, game_date) %>%
  group_by(team_id) %>%
  mutate(
    efg_L5 = lag(zoo::rollmean(efg_pct, k = 5, fill = NA, align = "right")),
    tov_L5 = lag(zoo::rollmean(tov_pct, k = 5, fill = NA, align = "right")),
    orb_L5 = lag(zoo::rollmean(orb_pct, k = 5, fill = NA, align = "right")),
    ftr_L5 = lag(zoo::rollmean(ft_rate, k = 5, fill = NA, align = "right")),
    ortg_L5 = lag(zoo::rollmean(ortg, k = 5, fill = NA, align = "right")),
    drtg_L5 = lag(zoo::rollmean(drtg, k = 5, fill = NA, align = "right")),
    net_L5 = lag(zoo::rollmean(net_rating, k = 5, fill = NA, align = "right")),
    pace_L5 = lag(zoo::rollmean(pace, k = 5, fill = NA, align = "right"))
  ) %>%
  ungroup() %>%
  drop_na(efg_L5)  # Only keep games where we have rolling history

head(team_rolling)
# A tibble: 6 × 24
  season game_date   game_id team_display_name team_id team_home_away team_score
   <dbl> <date>        <int> <chr>               <int> <chr>               <int>
1   2024 2023-11-29   4.02e8 Auburn Tigers           2 home                   74
2   2024 2023-12-03   4.02e8 Auburn Tigers           2 away                   64
3   2024 2023-12-09   4.02e8 Auburn Tigers           2 away                  104
4   2024 2023-12-13   4.02e8 Auburn Tigers           2 home                   87
5   2024 2023-12-17   4.02e8 Auburn Tigers           2 home                   91
6   2024 2023-12-22   4.02e8 Auburn Tigers           2 home                   82
# ℹ 17 more variables: opponent_team_score <int>, efg_pct <dbl>, tov_pct <dbl>,
#   orb_pct <dbl>, ft_rate <dbl>, ortg <dbl>, drtg <dbl>, net_rating <dbl>,
#   pace <dbl>, efg_L5 <dbl>, tov_L5 <dbl>, orb_L5 <dbl>, ftr_L5 <dbl>,
#   ortg_L5 <dbl>, drtg_L5 <dbl>, net_L5 <dbl>, pace_L5 <dbl>

Key concept: We use lagged rolling averages so we only use information available before each game. This prevents data leakage.


Step 3: Adjust Data Granularity

Models predict games, not individual team performances. We need to pivot from team-game level to game level.

Code
# Separate home and away
home_games <- team_rolling %>%
  filter(team_home_away == "home") %>%
  select(
    game_id, season, game_date,
    home_team = team_display_name,
    home_score = team_score,
    home_efg = efg_L5, home_tov = tov_L5, home_orb = orb_L5, home_ftr = ftr_L5,
    home_ortg = ortg_L5, home_drtg = drtg_L5, home_net = net_L5, home_pace = pace_L5
  )

away_games <- team_rolling %>%
  filter(team_home_away == "away") %>%
  select(
    game_id,
    away_team = team_display_name,
    away_score = team_score,
    away_efg = efg_L5, away_tov = tov_L5, away_orb = orb_L5, away_ftr = ftr_L5,
    away_ortg = ortg_L5, away_drtg = drtg_L5, away_net = net_L5, away_pace = pace_L5
  )

# Join and calculate outcomes
games <- home_games %>%
  inner_join(away_games, by = "game_id") %>%
  mutate(
    # Outcome variables
    home_win = as.factor(if_else(home_score > away_score, "Yes", "No")),
    score_diff = home_score - away_score,  # positive = home won
    total_score = home_score + away_score,
    
    # Matchup differentials
    efg_diff = home_efg - away_efg,
    tov_diff = home_tov - away_tov,
    orb_diff = home_orb - away_orb,
    ftr_diff = home_ftr - away_ftr,
    ortg_diff = home_ortg - away_ortg,
    drtg_diff = home_drtg - away_drtg,
    net_diff = home_net - away_net,
    pace_avg = (home_pace + away_pace) / 2
  ) %>%
  drop_na()

glimpse(games)
Rows: 10,706
Columns: 34
$ game_id     <int> 401574556, 401583797, 401583798, 401583799, 401583800, 401…
$ season      <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ game_date   <date> 2023-11-29, 2023-12-13, 2023-12-17, 2023-12-22, 2023-12-3…
$ home_team   <chr> "Auburn Tigers", "Auburn Tigers", "Auburn Tigers", "Auburn…
$ home_score  <int> 74, 87, 91, 82, 101, 88, 66, 93, 82, 81, 99, 101, 59, 78, …
$ home_efg    <dbl> 0.5411528, 0.4976382, 0.5032872, 0.5060294, 0.5290879, 0.5…
$ home_tov    <dbl> 0.15635339, 0.11422651, 0.09844881, 0.08669590, 0.10987814…
$ home_orb    <dbl> 0.3092027, 0.3436161, 0.3289103, 0.3318864, 0.3005450, 0.2…
$ home_ftr    <dbl> 0.3115414, 0.3278700, 0.2922931, 0.2797149, 0.2985918, 0.3…
$ home_ortg   <dbl> 115.9966, 116.5411, 118.5297, 119.7185, 120.0627, 129.3485…
$ home_drtg   <dbl> 93.00273, 91.96987, 91.83074, 97.11057, 97.06614, 94.68838…
$ home_net    <dbl> 22.993876, 24.571234, 26.698965, 22.607937, 22.996599, 34.…
$ home_pace   <dbl> 71.112, 68.808, 69.352, 69.800, 71.048, 72.024, 70.832, 69…
$ away_team   <chr> "Virginia Tech Hokies", "UNC Asheville Bulldogs", "USC Tro…
$ away_score  <int> 57, 62, 75, 62, 66, 68, 55, 78, 59, 54, 81, 61, 70, 63, 78…
$ away_efg    <dbl> 0.4951475, 0.4959706, 0.5466798, 0.4568202, 0.5823009, 0.5…
$ away_tov    <dbl> 0.1462222, 0.1420995, 0.1595115, 0.1317813, 0.1235595, 0.1…
$ away_orb    <dbl> 0.2392959, 0.2960297, 0.3057633, 0.3065729, 0.2551598, 0.2…
$ away_ftr    <dbl> 0.3091771, 0.3705427, 0.2814960, 0.1790102, 0.2515047, 0.1…
$ away_ortg   <dbl> 105.03712, 108.30775, 111.93998, 99.49453, 122.49550, 110.…
$ away_drtg   <dbl> 98.71165, 97.90928, 107.47446, 98.58639, 103.04818, 107.07…
$ away_net    <dbl> 6.3254647, 10.3984684, 4.4655215, 0.9081459, 19.4473169, 3…
$ away_pace   <dbl> 68.552, 73.464, 71.960, 70.416, 70.336, 67.512, 67.024, 73…
$ home_win    <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
$ score_diff  <int> 17, 25, 16, 20, 35, 20, 11, 15, 23, 27, 18, 40, -11, 15, 1…
$ total_score <int> 131, 149, 166, 144, 167, 156, 121, 171, 141, 135, 180, 162…
$ efg_diff    <dbl> 0.046005368, 0.001667557, -0.043392585, 0.049209190, -0.05…
$ tov_diff    <dbl> 0.0101312371, -0.0278730341, -0.0610627094, -0.0450853666,…
$ orb_diff    <dbl> 0.069906866, 0.047586434, 0.023146954, 0.025313589, 0.0453…
$ ftr_diff    <dbl> 0.002364308, -0.042672706, 0.010797117, 0.100704608, 0.047…
$ ortg_diff   <dbl> 10.9594823, 8.2333595, 6.5897241, 20.2239733, -2.4327520, …
$ drtg_diff   <dbl> -5.708929, -5.939406, -15.643720, -1.475818, -5.982034, -1…
$ net_diff    <dbl> 16.6684115, 14.1727659, 22.2334439, 21.6997914, 3.5492819,…
$ pace_avg    <dbl> 69.832, 71.136, 70.656, 70.108, 70.692, 69.768, 68.928, 71…

Now each row is one game with features for both teams and three outcome variables.


Step 4: Train-Test Split

We use 2024 season for training and 2025 season for testing. This simulates real-world usage: train on history, predict the future.

Code
train_data <- games %>% filter(season == 2024)
test_data <- games %>% filter(season == 2025)

cat("Training games:", nrow(train_data), "\n")
Training games: 4889 
Code
cat("Testing games:", nrow(test_data), "\n")
Testing games: 5817 

Model 1: Binary Outcome Prediction

Goal: Estimate probability that the home team wins

Algorithm: Logistic Regression

Code
# Select features for classification
features_binary <- c(
  "efg_diff", "tov_diff", "orb_diff", "ftr_diff",
  "net_diff", "pace_avg"
)

# Prepare data
train_binary <- train_data %>%
  select(home_win, all_of(features_binary)) %>%
  drop_na()

test_binary <- test_data %>%
  select(home_win, all_of(features_binary)) %>%
  drop_na()

# Train model
set.seed(42)
model_binary <- train(
  home_win ~ .,
  data = train_binary,
  method = "glm",
  family = "binomial",
  trControl = trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  ),
  metric = "ROC"
)

# Evaluate
pred_binary <- predict(model_binary, test_binary, type = "prob")
test_binary$pred_prob <- pred_binary$Yes
test_binary$pred_class <- predict(model_binary, test_binary)

# Accuracy
accuracy <- mean(test_binary$pred_class == test_binary$home_win)
cat("Test Set Accuracy:", round(accuracy * 100, 1), "%\n")
Test Set Accuracy: 68.4 %
Code
# Confusion Matrix
confusionMatrix(test_binary$pred_class, test_binary$home_win)
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No   741  524
       Yes 1315 3237
                                          
               Accuracy : 0.6839          
                 95% CI : (0.6717, 0.6958)
    No Information Rate : 0.6466          
    P-Value [Acc > NIR] : 1.062e-09       
                                          
                  Kappa : 0.2422          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.3604          
            Specificity : 0.8607          
         Pos Pred Value : 0.5858          
         Neg Pred Value : 0.7111          
             Prevalence : 0.3534          
         Detection Rate : 0.1274          
   Detection Prevalence : 0.2175          
      Balanced Accuracy : 0.6105          
                                          
       'Positive' Class : No              
                                          

Interpretation: I’m really happy about this output, because it should spur us to investigate our data more.

The model predicted 1,315 false positives (it predicted 1,315 home wins that were actually losses) according to the Confusion Matrix. The model is likely over-relying on the home-court advantage, which can work slightly better than a coin flip but it is not nuanced enough to when the away team is actually better.

Ultimately, if I were to develop a machine learning model to help me determine moneyline, I would go for one that outputs probabilities rather than straight-up Yes/No outcomes.


Model 2: Score Differential Prediction

Goal: What is the likely difference in final scores?

Algorithm: Gradient Boosting Machine

Code
# Select features
features_diff <- c(
  "efg_diff", "tov_diff", "orb_diff", "ftr_diff",
  "ortg_diff", "drtg_diff", "net_diff", "pace_avg"
)

# Prepare data
train_diff <- train_data %>%
  select(score_diff, all_of(features_diff)) %>%
  drop_na()

test_diff <- test_data %>%
  select(score_diff, all_of(features_diff)) %>%
  drop_na()

# Train model
set.seed(42)
model_diff <- train(
  score_diff ~ .,
  data = train_diff,
  method = "gbm",
  trControl = trainControl(method = "cv", number = 5),
  verbose = FALSE,
  tuneLength = 3
)

# Evaluate
pred_diff <- predict(model_diff, test_diff)
test_diff$predicted <- pred_diff

# Mean Absolute Error
mae_diff <- mean(abs(test_diff$score_diff - test_diff$predicted))
cat("Test Set MAE:", round(mae_diff, 2), "points\n")
Test Set MAE: 10.55 points
Code
# Residuals plot
test_diff %>%
  ggplot(aes(x = predicted, y = score_diff)) +
  geom_point(alpha = 0.4) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Model 2: Predicted vs Actual Score Differential",
    x = "Predicted Differential",
    y = "Actual Differential",
    caption = "Red line = perfect prediction"
  )

Interpretation: On average, predictions are off by about 10.55 points. That’s the model’s typical error.


Model 3: Combined Score Prediction

Goal: Estimate the likely total combined score.

Algorithm: Random Forest (robust to outliers, handles interactions well)

Code
# Select features
features_total <- c(
  "home_ortg", "away_ortg", "home_drtg", "away_drtg",
  "home_pace", "away_pace", "pace_avg"
)

# Prepare data
train_total <- train_data %>%
  select(total_score, all_of(features_total)) %>%
  drop_na()

test_total <- test_data %>%
  select(total_score, all_of(features_total)) %>%
  drop_na()

# Train model
set.seed(42)
model_total <- train(
  total_score ~ .,
  data = train_total,
  method = "rf",
  trControl = trainControl(method = "cv", number = 5),
  tuneLength = 3,
  ntree = 100
)

# Evaluate
pred_total <- predict(model_total, test_total)
test_total$predicted <- pred_total

# Mean Absolute Error
mae_total <- mean(abs(test_total$total_score - test_total$predicted))
cat("Test Set MAE:", round(mae_total, 2), "points\n")
Test Set MAE: 13.71 points
Code
# Residuals plot
test_total %>%
  ggplot(aes(x = predicted, y = total_score)) +
  geom_point(alpha = 0.4) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Model 3: Predicted vs Actual Total Score",
    x = "Predicted Total",
    y = "Actual Total",
    caption = "Red line = perfect prediction"
  )

Interpretation: Total score predictions average about 13.7 points of error.


Step 5: Apply Models to Kansas vs. Miami

Now let’s use our trained models to analyze a hypothetical match-up.

Code
# Get current season data
current_season <- load_mbb_team_box(seasons = 2026) %>% 
  clean_names() %>%
  mutate(season = 2026)

# Filter to Kansas and Miami
hawks_2026 <- current_season %>%
  filter(team_display_name %in% c("Miami (OH) RedHawks", "Kansas Jayhawks"))

# Calculate features (same process as training data)
hawks_features <- hawks_2026 %>%
  mutate(
    efg_pct = (field_goals_made + 0.5 * three_point_field_goals_made) / 
              field_goals_attempted,
    tov_pct = turnovers / 
              (field_goals_attempted + 0.44 * free_throws_attempted + turnovers),
    orb_pct = offensive_rebounds / 
              (offensive_rebounds + defensive_rebounds),
    ft_rate = free_throws_made / field_goals_attempted,
    possessions = field_goals_attempted - offensive_rebounds + 
                  turnovers + 0.44 * free_throws_attempted,
    ortg = (team_score / possessions) * 100,
    drtg = (opponent_team_score / possessions) * 100,
    net_rating = ortg - drtg,
    pace = possessions
  ) %>%
  arrange(team_id, game_date) %>%
  group_by(team_id) %>%
  mutate(
    efg_L5 = zoo::rollmean(efg_pct, k = 5, fill = NA, align = "right"),
    tov_L5 = zoo::rollmean(tov_pct, k = 5, fill = NA, align = "right"),
    orb_L5 = zoo::rollmean(orb_pct, k = 5, fill = NA, align = "right"),
    ftr_L5 = zoo::rollmean(ft_rate, k = 5, fill = NA, align = "right"),
    ortg_L5 = zoo::rollmean(ortg, k = 5, fill = NA, align = "right"),
    drtg_L5 = zoo::rollmean(drtg, k = 5, fill = NA, align = "right"),
    net_L5 = zoo::rollmean(net_rating, k = 5, fill = NA, align = "right"),
    pace_L5 = zoo::rollmean(pace, k = 5, fill = NA, align = "right")
  ) %>%
  ungroup()

# Get latest stats for each team
kansas_latest <- hawks_features %>%
  filter(team_display_name == "Kansas Jayhawks") %>%
  arrange(desc(game_date)) %>%
  slice(1)

miami_latest <- hawks_features %>%
  filter(team_display_name == "Miami (OH) RedHawks") %>%
  arrange(desc(game_date)) %>%
  slice(1)

# Create matchup data (assuming Kansas at home)
matchup <- tibble(
  home_team = "Kansas Jayhawks",
  away_team = "Miami (OH) RedHawks",
  home_efg = kansas_latest$efg_L5,
  home_tov = kansas_latest$tov_L5,
  home_orb = kansas_latest$orb_L5,
  home_ftr = kansas_latest$ftr_L5,
  home_ortg = kansas_latest$ortg_L5,
  home_drtg = kansas_latest$drtg_L5,
  home_net = kansas_latest$net_L5,
  home_pace = kansas_latest$pace_L5,
  away_efg = miami_latest$efg_L5,
  away_tov = miami_latest$tov_L5,
  away_orb = miami_latest$orb_L5,
  away_ftr = miami_latest$ftr_L5,
  away_ortg = miami_latest$ortg_L5,
  away_drtg = miami_latest$drtg_L5,
  away_net = miami_latest$net_L5,
  away_pace = miami_latest$pace_L5
) %>%
  mutate(
    efg_diff = home_efg - away_efg,
    tov_diff = home_tov - away_tov,
    orb_diff = home_orb - away_orb,
    ftr_diff = home_ftr - away_ftr,
    ortg_diff = home_ortg - away_ortg,
    drtg_diff = home_drtg - away_drtg,
    net_diff = home_net - away_net,
    pace_avg = (home_pace + away_pace) / 2
  )


# Generate predictions
pred_win_prob <- predict(model_binary, matchup, type = "prob")$Yes
pred_diff <- predict(model_diff, matchup)
pred_total <- predict(model_total, matchup)

# Display results
results <- tibble(
  Metric = c("Home Win Probability", "Expected Score Differential", "Expected Total Score"),
  Value = c(
    paste0(round(pred_win_prob * 100, 1), "%"),
    paste0(if_else(pred_diff > 0, "+", ""), round(pred_diff, 1), " points"),
    paste0(round(pred_total, 1), " points")
  )
)

results %>%
  knitr::kable(
    caption = "Kansas (Home) vs Miami: Model Predictions",
    align = c("l", "r")
  )
Kansas (Home) vs Miami: Model Predictions
Metric Value
Home Win Probability 43.4%
Expected Score Differential -2.4 points
Expected Total Score 151 points

Interpreting the Results

Win Probability: Based on recent form and match-up metrics, the model estimates likelihood of home team success. Given that my model tends to default to the home-team winning, this one is quite a shocker.

Score Differential: The expected margin. Positive means home team favored, negative means away team favored.

Total Score: The expected combined output from both teams. Influenced by pace and efficiency.

Take-away: I’m very surprised that, even when displaying results that are well within our established margin-of-error, we are seeing compelling evidence that a hypothetical game between Miami and Kansas at Allen Fieldhouse in Lawrence is essentially a toss-up/leans Miami.

The reason I say toss up is because the of the result of the margin model (expected score differential). That model showed an MAE of 10.55. I would not count on Miami actually beating the Jayhawks unless the prediction for the match-up resulted in a value of, say, -11. In which case, yeah I’d declare that I’m picking Miami.


What Makes These Models Useful

They’re trained on real data: Thousands of historical games with known outcomes.

They use predictive features: Rolling averages prevent overfitting to single-game variance.

They’re conservative: MAE of 8-12 points means we know our uncertainty.

They’re explainable: Coefficients and feature importance show what drives predictions.


Feature Importance

Let’s see what matters most for score total predictions:

Code
# Extract feature importance from Random Forest model
importance <- varImp(model_total, scale = FALSE)

# Plot
plot(importance, top = 8, main = "Top Features: Score Differential Model")


Limitations and Honest Assessment

What these models don’t account for:

  • Injuries or lineup changes
  • Motivational factors (rivalry games, tournament pressure)
  • Referee tendencies
  • Weather or travel fatigue
  • Matchup-specific adjustments

Why we’re transparent about error rates:

An MAE of 10 points on differential means roughly 2 out of 3 predictions fall within 10 points of actual. That’s useful but not clairvoyant. We show our work so you can judge the quality yourself.

How to use these predictions:

They’re inputs to your own analysis, not commandments. If our model says Team A is slightly favored but you know their starting point guard is injured, adjust accordingly. Statistical models provide baselines; domain knowledge provides context.


Reproducibility

All code in this post is reproducible. You can run it yourself with:

  1. Install packages: hoopR, tidyverse, caret, janitor
  2. Copy the code chunks sequentially
  3. Adjust team names to analyze any matchup

The data comes from ESPN via hoopR (free). The models are standard algorithms (no proprietary methods). The methodology is documented here.


Conclusion

This concludes my three part series on building college basketball prediction models.


Want to see these predictions daily? We apply this exact pipeline to generate analysis for premium subscribers. Methodology transparent. Results tracked. No black boxes.