Model Deep Dive — 2026 March Madness

01 Model Overview

Men's Best Model

0.1871

XGBoost PRUNED-25
25 features · grid=50
Experiment #174

Women's Best Model

0.1336

XGBoost RANK_SEED+momentum
Experiment #152

Total Experiments

~196

XGBoost, Random Forest, GLM,
ensembles, unified, pruned variants

Training Window

2003-2025

Train: 2003-2019
Test: 2021-2025
2020 excluded (COVID)

02 Top 25 Features by Permutation Importance

Rank	Feature	Importance	Value

03 Competition History & Model Evolution

Built by Jim Pleuss and Dusty Turner, competing as "Beat Navy" on Kaggle since 2021.

2021

Random Forest — First year competing. Composite rankings (SAG, POM, MOR, WLK, RPI), efficiency metrics, conference records, quad wins, and seeds. Raw A/B feature pairs (each stat entered twice — once per team).
Kaggle: #2 / 707 (top 0.3%) — Score: 0.55585. One spot from winning the whole thing.

2022

Same Random Forest pipeline. No major architecture changes from 2021.

What changed: Minor feature tweaks only. Same raw A/B approach.

Kaggle: #430 / 930 (top 46.2%) — Score: 0.65263. Chaotic tournament with many upsets.

2023

Continued Random Forest approach. Competition changed evaluation metric from log loss to Brier score.

What changed: Kaggle switched to Brier score evaluation. Model unchanged — not recalibrated for new metric.

Kaggle: #737 / 1,033 (top 71.3%) — Score: 0.22203.

2024

Random Forest with 28 raw A/B features, grid=5. Predicted Purdue as champion (actual winner: UConn).

What changed: Introduced team clustering — hierarchical clustering on 5 efficiency metrics (off eff, def eff, 3pt%, possessions, FT rate) to create 5 "play style" groups. The idea: certain team styles match up well against others (rock-paper-scissors). Also experimented with neural networks (PyTorch) for both genders, but RF won. Added web scraping from teamrankings.com.

Kaggle: #26 / 821 (top 3.2%) — Score: 0.05779. Strong bounce-back.

2025

Did not compete. Jim and Dusty were focused on their PhD defenses.

What changed (off-season): Used the gap year to completely rebuild the pipeline. Switched data processing from tidyverse to data.table (1.7x speedup). Added SQLite experiment tracking database. Built 7-tab Shiny dashboard. Deployed everything to a Raspberry Pi 5. Ran 196+ experiments testing XGBoost, logistic regression, stacking, calibration (Platt, isotonic), push-away, shrinkage, and unified gender models.

No Kaggle entry — PhD year

2026

XGBoost PRUNED-25 — the culmination of 196+ experiments and 5 years of iteration.

What changed:

Model: Random Forest → XGBoost (dominates RF in every experiment)
Features: Raw A/B pairs → differenced features (team_A − team_B)
New features: Elo ratings, GLM team quality, travel distance, coach experience, program tournament history, conference strength, consistency metrics
Feature selection: Expanded to 144 features, then pruned to 25 via permutation importance — less is more (0.1871 vs 0.1944 with all 144)
Key discovery: glm_quality_diff is 5x more important than any other feature
Tuning: Grid size 5 → 50 hyperparameter combos
Women's model: Separate pipeline with self-computed Elo + RPI (no Massey Ordinals needed)
Post-processing: Tested calibration, push-away, shrinkage — all hurt. Raw predictions are optimal
Simulation: Monte Carlo bracket generator (5M+ brackets)

Brier: 0.1871 (men) / 0.1336 (women) — Best backtest scores ever. Kaggle: TBD (deadline Mar 19)

04 2024 RF vs 2026 XGBoost — Side by Side

Visual bracket comparisons showing how the 2026 XGBoost model differs from the 2024 Random Forest. Click through to explore each view in detail.

2024 RF Predictions

RF Model

The 2024 Random Forest bracket — 28 raw A/B features, grid=5. Predicted Purdue as champion.

2026 XGBoost Predictions

XGBoost Model

The 2026 XGBoost PRUNED-25 bracket — 25 differenced features, grid=50. Applied to the 2024 field for comparison.

Actual 2024 Results

Ground Truth

What actually happened in the 2024 tournament. UConn won its second consecutive title.

Detailed Comparison

RF vs XGBoost

Side-by-side view highlighting model disagreements and where each model got it right or wrong.

05 Per-Season Backtest Results

Season	RF Brier	XGB Brier	Winner

06 Live 2026 Excursion — How Would Our Old Models Do?

We retrained the exact models from 2021 and 2024 on 2026 team data and scored them against actual tournament results. Same algorithms, same features, same hyperparameters — just pointed at this year's bracket.

2021 Model: Random Forest, 28 raw A/B features, grid=3, tuned on accuracy. 2024 Model: Random Forest, 28 raw A/B features, grid=5, tuned on Brier. 2026 Model: XGBoost PRUNED-25, 25 diff features, grid=50, tuned on Brier.

Metric	2021 RF	2024 RF	2026 XGBoost

Per-Round Brier Scores

Round	Games	2021 RF	2024 RF	2026 XGBoost

Games Where Models Split

Games where at least one model was right and another wrong. Y = correct, N = wrong.

Game	2021	2024	2026	Result

Biggest Prediction Disagreements

The 10 games with the largest gap between the 2024 RF and 2026 XGBoost predictions.

Game	2024 RF	2026 XGB	Actual	Winner

Estimated Kaggle Leaderboard Position

Using our women's model score (held constant) combined with each men's model. Leaderboard: 3,114 teams.

Model	Est. Kaggle Score	Est. Rank	Percentile

Key Insight

Through the first two rounds, the 2024 RF's conservative predictions (closer to 0.50) are rewarding it — when upsets happen, it gets punished less. The 2026 XGBoost makes sharper, more confident predictions that pay off on chalk games but cost more on upsets. On historical backtest (2021–2025), the 2026 model is clearly superior (0.1871 vs 0.1941 Brier). As later rounds play out with tighter matchups, expect the XGBoost to pull ahead.

07 Kaggle Competition History

08 Key Findings

▲

XGBoost dominates Random Forest across nearly every test season, with a lower average Brier score (0.1871 vs 0.1941).
▲

Less is more: PRUNED-25 (25 features) beats the 144-feature EXPANDED model (0.1871 vs 0.1944). Aggressive feature selection removes noise.
◆

glm_quality_diff is the single most important feature by a wide margin — 2x the importance of the next feature. It captures team quality via a generalized linear model on game outcomes.
▼

Calibration hurts: Both Platt scaling and isotonic regression made predictions worse, not better. The raw XGBoost probabilities are already well-calibrated.
▼

Push-away and shrinkage hurt: Shifting predictions away from 0.5 (5-30%) and shrinking toward the mean (3-10%) both increased Brier score. Raw predictions are optimal.
▼

Unified model doesn't help: A single model trained on both men's and women's data (0.1658) underperforms separate gender-specific models (0.1607 combined).
▼

Multi-seed ensemble hurts: Averaging predictions from multiple random seeds (0.1882) is worse than the single best seed (0.1871).