Model Deep Dive

Jim Pleuss & Dusty Turner — "Beat Navy" — 2021–2026

01 Model Overview

Men's Best Model
0.1871
XGBoost PRUNED-25
25 features · grid=50
Experiment #174
Women's Best Model
0.1336
XGBoost RANK_SEED+momentum
Experiment #152
Total Experiments
~196
XGBoost, Random Forest, GLM,
ensembles, unified, pruned variants
Training Window
2003-2025
Train: 2003-2019
Test: 2021-2025
2020 excluded (COVID)

02 Top 25 Features by Permutation Importance

Rank Feature Importance Value

03 Competition History & Model Evolution

Built by Jim Pleuss and Dusty Turner, competing as "Beat Navy" on Kaggle since 2021.

2021
Random Forest — First year competing. Composite rankings (SAG, POM, MOR, WLK, RPI), efficiency metrics, conference records, quad wins, and seeds. Raw A/B feature pairs (each stat entered twice — once per team).
Kaggle: #2 / 707 (top 0.3%) — Score: 0.55585. One spot from winning the whole thing.
2022
Same Random Forest pipeline. No major architecture changes from 2021.
What changed: Minor feature tweaks only. Same raw A/B approach.
Kaggle: #430 / 930 (top 46.2%) — Score: 0.65263. Chaotic tournament with many upsets.
2023
Continued Random Forest approach. Competition changed evaluation metric from log loss to Brier score.
What changed: Kaggle switched to Brier score evaluation. Model unchanged — not recalibrated for new metric.
Kaggle: #737 / 1,033 (top 71.3%) — Score: 0.22203.
2024
Random Forest with 28 raw A/B features, grid=5. Predicted Purdue as champion (actual winner: UConn).
What changed: Introduced team clustering — hierarchical clustering on 5 efficiency metrics (off eff, def eff, 3pt%, possessions, FT rate) to create 5 "play style" groups. The idea: certain team styles match up well against others (rock-paper-scissors). Also experimented with neural networks (PyTorch) for both genders, but RF won. Added web scraping from teamrankings.com.
Kaggle: #26 / 821 (top 3.2%) — Score: 0.05779. Strong bounce-back.
2025
Did not compete. Jim and Dusty were focused on their PhD defenses.
What changed (off-season): Used the gap year to completely rebuild the pipeline. Switched data processing from tidyverse to data.table (1.7x speedup). Added SQLite experiment tracking database. Built 7-tab Shiny dashboard. Deployed everything to a Raspberry Pi 5. Ran 196+ experiments testing XGBoost, logistic regression, stacking, calibration (Platt, isotonic), push-away, shrinkage, and unified gender models.
No Kaggle entry — PhD year
2026
XGBoost PRUNED-25 — the culmination of 196+ experiments and 5 years of iteration.
What changed:
  • Model: Random Forest → XGBoost (dominates RF in every experiment)
  • Features: Raw A/B pairs → differenced features (team_A − team_B)
  • New features: Elo ratings, GLM team quality, travel distance, coach experience, program tournament history, conference strength, consistency metrics
  • Feature selection: Expanded to 144 features, then pruned to 25 via permutation importance — less is more (0.1871 vs 0.1944 with all 144)
  • Key discovery: glm_quality_diff is 5x more important than any other feature
  • Tuning: Grid size 5 → 50 hyperparameter combos
  • Women's model: Separate pipeline with self-computed Elo + RPI (no Massey Ordinals needed)
  • Post-processing: Tested calibration, push-away, shrinkage — all hurt. Raw predictions are optimal
  • Simulation: Monte Carlo bracket generator (5M+ brackets)
Brier: 0.1871 (men) / 0.1336 (women) — Best backtest scores ever. Kaggle: TBD (deadline Mar 19)

04 2024 RF vs 2026 XGBoost — Side by Side

Visual bracket comparisons showing how the 2026 XGBoost model differs from the 2024 Random Forest. Click through to explore each view in detail.

2024 RF Predictions
RF Model
The 2024 Random Forest bracket — 28 raw A/B features, grid=5. Predicted Purdue as champion.
2026 XGBoost Predictions
XGBoost Model
The 2026 XGBoost PRUNED-25 bracket — 25 differenced features, grid=50. Applied to the 2024 field for comparison.
Actual 2024 Results
Ground Truth
What actually happened in the 2024 tournament. UConn won its second consecutive title.
Detailed Comparison
RF vs XGBoost
Side-by-side view highlighting model disagreements and where each model got it right or wrong.

05 Per-Season Backtest Results

Season RF Brier XGB Brier Winner

06 Live 2026 Excursion — How Would Our Old Models Do?

We retrained the exact models from 2021 and 2024 on 2026 team data and scored them against actual tournament results. Same algorithms, same features, same hyperparameters — just pointed at this year's bracket.

2021 Model: Random Forest, 28 raw A/B features, grid=3, tuned on accuracy. 2024 Model: Random Forest, 28 raw A/B features, grid=5, tuned on Brier. 2026 Model: XGBoost PRUNED-25, 25 diff features, grid=50, tuned on Brier.

Metric 2021 RF 2024 RF 2026 XGBoost

Per-Round Brier Scores

Round Games 2021 RF 2024 RF 2026 XGBoost

Games Where Models Split

Games where at least one model was right and another wrong. Y = correct, N = wrong.

Game 2021 2024 2026 Result

Biggest Prediction Disagreements

The 10 games with the largest gap between the 2024 RF and 2026 XGBoost predictions.

Game 2024 RF 2026 XGB Actual Winner

Estimated Kaggle Leaderboard Position

Using our women's model score (held constant) combined with each men's model. Leaderboard: 3,114 teams.

Model Est. Kaggle Score Est. Rank Percentile
Key Insight

Through the first two rounds, the 2024 RF's conservative predictions (closer to 0.50) are rewarding it — when upsets happen, it gets punished less. The 2026 XGBoost makes sharper, more confident predictions that pay off on chalk games but cost more on upsets. On historical backtest (2021–2025), the 2026 model is clearly superior (0.1871 vs 0.1941 Brier). As later rounds play out with tighter matchups, expect the XGBoost to pull ahead.

07 Kaggle Competition History

08 Key Findings