User Guide¶
Getting Started¶
This guide picks up where the README.md (project root) left off.
Once you have installed the project and synced data, the typical workflow is:
Sync data — download NCAA game results from Kaggle (and optionally ESPN):
python sync.py --source all --dest data/
Train a model — fit a prediction model on historical seasons:
python -m ncaa_eval.cli train --model elo python -m ncaa_eval.cli train --model xgboost
Common options:
Flag
Default
Description
--model(required)
Registered model name (
elo,xgboost, or custom)--start-year2015First training season (inclusive)
--end-year2025Last training season (inclusive)
--data-dirdata/Path to synced Parquet files
--output-dirdata/Where to write run artifacts
--configNoneJSON file overriding model hyperparameters
Generate predictions — produce win-probability CSVs from a trained model:
python -m ncaa_eval.cli predict --run-id <run-id> --season 2025 python -m ncaa_eval.cli predict --run-id <run-id> --season 2025 --output preds.csv
Common options:
Flag
Default
Description
--run-id(required)
Model run ID (from the training step output)
--season(required)
Target season year
--data-dirdata/Path to synced Parquet files
--outputNoneOutput CSV path (omit to write to stdout)
Stateful models (Elo) produce pairwise probabilities for all C(n,2) team combinations. Stateless models (XGBoost, LogisticRegression) produce game-level predictions for every game in the season dataset.
Output format:
season,team_a_id,team_b_id,pred_win_prob 2025,1101,1102,0.6234 2025,1101,1103,0.4512
Explore results in the dashboard — launch the Streamlit app:
streamlit run dashboard/app.py
The sidebar lets you select a tournament year, model run, and scoring format. All pages update automatically when you change these filters.
Iterate — retrain with different hyperparameters, compare on the Leaderboard, inspect calibration in Model Deep Dive, and use the Bracket Visualizer and Pool Scorer to turn predictions into bracket picks.
Evaluation Metrics¶
The platform evaluates models on four complementary metrics. Each captures a different aspect of prediction quality.
Log Loss¶
What it measures: How well predicted probabilities match actual outcomes, with heavy penalties for confident wrong predictions.
Formula:
$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \ln(p_i) + (1-y_i) \ln(1-p_i)\bigr]$$
Interpretation:
Value |
Meaning |
|---|---|
0.0 |
Perfect — every prediction was 0% or 100% and correct |
~0.50 |
Good — a well-calibrated model typically lands here |
0.693 |
Random baseline (equivalent to predicting 50% for every game) |
> 0.693 |
Worse than guessing — the model is actively misleading |
Tip
Log Loss is the primary ranking metric in the Kaggle March Machine Learning Mania competition. A score of 0.55 means your model is meaningfully better than random but has room to improve.
Warning
Log Loss punishes confident wrong predictions exponentially. A single game where you predicted 99% and the other team won adds roughly 4.6 to your loss — far more than 100 correct predictions at 60% confidence save.
Brier Score¶
What it measures: Mean squared error of probability predictions — a gentler alternative to Log Loss.
Formula:
$$\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2$$
Interpretation:
Value |
Meaning |
|---|---|
0.0 |
Perfect |
~0.20 |
Good model |
0.25 |
Random baseline (predicting 50% for every game) |
> 0.25 |
Worse than guessing |
Brier Score is more forgiving of confident wrong predictions than Log Loss because it uses squared error instead of logarithmic error. A 99%-confident wrong prediction adds 0.98 to Brier (vs. 4.6 to Log Loss).
ROC-AUC¶
What it measures: Discrimination — can the model distinguish winners from losers? Equivalently: if you pick a random winning team and a random losing team, what is the probability the model assigns a higher win probability to the winner?
Formula: Area under the Receiver Operating Characteristic curve.
Interpretation:
Value |
Meaning |
|---|---|
1.0 |
Perfect discrimination |
~0.75 |
Good model |
0.5 |
Random — no discrimination ability |
< 0.5 |
Inversely correlated (predicting losers as winners) |
Warning
ROC-AUC does not measure calibration. A model can have perfect AUC (1.0) but terrible calibration — e.g., predicting 99% for every game that the favored team wins and 1% for every upset. Always pair AUC with calibration metrics (ECE, reliability diagrams).
Expected Calibration Error (ECE)¶
What it measures: How well predicted probabilities correspond to actual win rates. If a model says “70% win probability” for 100 games, about 70 should actually be wins.
Formula:
$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|$$
where predictions are binned into $B$ equal-width bins, $n_b$ is the count in bin $b$, $\text{acc}(b)$ is the observed win rate, and $\text{conf}(b)$ is the mean predicted probability.
Interpretation:
Value |
Meaning |
|---|---|
0.0 |
Perfectly calibrated |
< 0.03 |
Excellent calibration |
0.03–0.08 |
Reasonable calibration |
> 0.10 |
Poor calibration — predictions are systematically off |
Tip
ECE is the single best number for answering “can I trust these probabilities at face value?” A low ECE means you can use the model’s probability outputs directly for bet sizing, pool strategy, and expected-value calculations.
Model Types¶
NCAA_eval supports two model paradigms through a common abstract base class.
Stateful Models¶
Stateful models maintain internal ratings that update game-by-game through a season. They process games sequentially — the order matters.
Reference implementation: Elo (elo)
The built-in Elo model tracks a rating for every team. After each game, ratings shift based on the outcome and margin of victory:
Winners gain rating points; losers lose them
Upset victories produce larger rating swings
Ratings mean-revert between seasons (configurable fraction)
Separate K-factors for early-season, regular-season, and tournament games
Best for: Capturing in-season trajectory and momentum. Elo is simple, interpretable, and requires no feature engineering.
Key hyperparameters:
Parameter |
Default |
Description |
|---|---|---|
|
1500 |
Starting rating for new teams |
|
56.0 |
K-factor for the first 20 games |
|
38.0 |
K-factor for regular-season games |
|
47.5 |
K-factor for tournament games |
|
0.25 |
Fraction pulled toward mean between seasons |
Tip
You can override any hyperparameter via a JSON config file:
python -m ncaa_eval.cli train --model elo --config my_elo_config.json
Stateless Models¶
Stateless models are standard batch-trained classifiers. They take a feature matrix as input and produce probability predictions — game order does not matter.
Reference implementation: XGBoost (xgboost)
The built-in XGBoost model uses gradient-boosted decision trees trained on feature snapshots. Features are computed by the feature engineering pipeline (Epic 4) and include team statistics, strength-of-schedule metrics, and graph-based centrality measures.
Best for: Combining many feature dimensions for maximum predictive accuracy. XGBoost typically outperforms Elo when strong features are available.
Key hyperparameters:
Parameter |
Default |
Description |
|---|---|---|
|
500 |
Maximum number of boosting rounds |
|
5 |
Maximum tree depth |
|
0.05 |
Step size shrinkage |
|
50 |
Stop if validation loss doesn’t improve |
Plugin Registry¶
Models register themselves via the @register_model("name") decorator. To create
a custom model:
Subclass
Model(stateless) orStatefulModel(stateful)Implement the required methods:
Stateless (
Model):fit,predict_proba,save,load,get_configStateful (
StatefulModel):_predict_one,update,start_season,get_state,set_state,save,load,get_config—fitandpredict_probaare provided by the template (_predict_oneis the per-pair hook thatpredict_probacalls)
Decorate with
@register_model("my_model")Import the module before training so the decorator fires
The CLI discovers all registered models automatically:
# List available models
python -c "from ncaa_eval.model import list_models; print(list_models())"
For implementation details, see the API Reference.
Interpreting Results¶
Reliability Diagrams¶
A reliability diagram is the visual counterpart to ECE. It plots predicted probabilities against observed win rates:
X-axis: Predicted probability (grouped into bins, e.g., 0–10%, 10–20%, …)
Y-axis: Actual win rate within each bin
Perfect calibration line: The 45° diagonal — if your model says 70%, 70% of those games should be wins
How to read the diagram:
Pattern |
Meaning |
Action |
|---|---|---|
Points on the diagonal |
Well-calibrated |
No action needed |
Points above the diagonal |
Under-confident — actual win rates exceed predictions |
Model could be sharper |
Points below the diagonal |
Over-confident — predictions overstate win likelihood |
Model needs calibration |
S-shaped curve |
Probabilities are too extreme on both ends |
Retrain with calibration regularization; use a negative Upset Aggression slider value (chalk mode) to sharpen predictions |
Flat line near 0.5 |
Model lacks discrimination |
Improve features or model architecture |
Tip
The Model Deep Dive page in the dashboard shows reliability diagrams with per-year drill-down. Compare diagrams across years to check whether calibration is stable or drifts.
Calibration in Plain Language¶
A well-calibrated model is one you can take at face value. When it says “Duke has a 65% chance of beating UNC,” that means that in a large sample of similar matchups, Duke would win about 65% of the time.
Why calibration matters for bracket pools:
Pool strategy depends on knowing how likely outcomes are, not just which team is favored
Expected-point calculations multiply advancement probabilities by scoring weights — if probabilities are wrong, the strategy is wrong
An over-confident model will undercount upsets, leading you to pick too much chalk
Over-Confidence vs. Under-Confidence¶
Type |
Symptom |
Reliability Diagram |
Impact on Brackets |
|---|---|---|---|
Over-confident |
Predictions are too extreme (90% when reality is 70%) |
Points below diagonal at high probabilities |
Too much chalk; undervalues upsets |
Under-confident |
Predictions are too moderate (55% when reality is 70%) |
Points above diagonal at high probabilities |
Picks too many upsets; misses value in favorites |
Well-calibrated |
Predictions match reality |
Points on or near diagonal |
Bracket strategy reflects true odds |
Tournament Simulation¶
Monte Carlo Methodology¶
The platform simulates the full 64-team NCAA tournament bracket using two methods:
Analytical (Phylourny algorithm): Computes exact advancement probabilities via a post-order tree traversal. Fast and deterministic — no random sampling needed. Best for expected-point calculations where you want precise values.
Monte Carlo simulation: Runs thousands of independent tournament simulations (default 10,000). Each simulation randomly resolves every game using the model’s pairwise win probabilities. Produces:
Advancement probabilities — fraction of simulations each team reaches each round
Score distributions — histogram of total bracket points across all simulations
Confidence intervals — percentile-based ranges for expected outcomes
Bracket Distribution¶
When you run a Monte Carlo simulation, the platform computes a full score distribution for the “chalk bracket” (picking the pre-game favorite in every matchup). This answers: “If the model’s probabilities are correct, what range of scores should I expect?”
Key statistics shown on the Pool Scorer page:
Statistic |
What It Tells You |
|---|---|
Median |
The score you’d most typically get |
Mean |
Average expected score (weighted by probability) |
5th percentile |
Worst-case scenario (lots of upsets) |
95th percentile |
Best-case scenario (mostly chalk) |
Std Dev |
How much scores vary across simulated outcomes |
Expected Points¶
Expected Points (EP) combines advancement probabilities with a scoring rule to answer: “How many bracket points is each team worth?”
$$\text{EP}i = \sum{r=0}^{5} P(\text{team } i \text{ wins round } r) \times \text{points}(r)$$
Teams with high EP are valuable picks — they are likely to advance far and those rounds are worth many points. The Bracket Visualizer’s Expected Points table ranks all 64 teams by EP under your chosen scoring rule.
Tournament Scoring¶
The platform supports three built-in scoring systems and lets you define custom rules.
Standard Scoring (ESPN-style)¶
The most common pool format. Points double each round:
Round |
Abbrev. |
Games |
Points |
Max Points |
|---|---|---|---|---|
Round of 64 |
R64 |
32 |
1 |
32 |
Round of 32 |
R32 |
16 |
2 |
32 |
Sweet 16 |
S16 |
8 |
4 |
32 |
Elite Eight |
E8 |
4 |
8 |
32 |
Final Four |
F4 |
2 |
16 |
32 |
Championship |
NCG |
1 |
32 |
32 |
Total |
63 |
192 |
Worked example: You correctly pick 20 R64 games, 10 R32 games, 4 S16 games, 2 E8 games, 1 F4 game, and the champion:
20×1 + 10×2 + 4×4 + 2×8 + 1×16 + 1×32 = 20 + 20 + 16 + 16 + 16 + 32 = **120 points**
Fibonacci Scoring¶
Rewards later rounds more steeply than Standard:
Round |
Points |
|---|---|
R64 |
2 |
R32 |
3 |
S16 |
5 |
E8 |
8 |
F4 |
13 |
NCG |
21 |
Total (perfect) |
231 |
Fibonacci scoring gives more credit for getting later rounds right. Picking the champion is worth 21 points (vs. 32 in Standard), but the ratio of late-round-to-early-round points is higher.
Seed-Difference Bonus¶
Standard base points plus an upset bonus equal to the seed difference when the lower-seeded team wins:
Round |
Base Points |
Upset Bonus |
|---|---|---|
R64 |
1 |
+ |seed_winner − seed_loser| if upset |
R32 |
2 |
+ |seed_winner − seed_loser| if upset |
S16 |
4 |
+ |seed_winner − seed_loser| if upset |
E8 |
8 |
+ |seed_winner − seed_loser| if upset |
F4 |
16 |
+ |seed_winner − seed_loser| if upset |
NCG |
32 |
+ |seed_winner − seed_loser| if upset |
Worked example: A 12-seed beats a 5-seed in the R64. You get 1 (base) + 7 (seed diff) = 8 points for that single game — the same as getting an Elite Eight pick right under Standard scoring.
Tip
Seed-Difference Bonus rewards contrarian picks. If you are in a large pool where most people pick chalk, this scoring format lets you differentiate by picking well-chosen upsets.
Custom Scoring¶
The Pool Scorer page lets you define custom per-round point values. Check “Use custom scoring” and enter your pool’s specific point schedule. This is useful for pools with non-standard formats (e.g., 1-2-3-5-8-13 or 10-20-40-80-160-320).
Dashboard Guide¶
The dashboard has four main pages organized into two sections.
Lab: Backtest Leaderboard¶
Purpose: Compare all trained models side-by-side.
What you see:
KPI cards at the top showing the best score for each metric across all runs, with delta indicators comparing best vs. worst
Sortable table with every model run’s Log Loss, Brier Score, ROC-AUC, and ECE
Color-coded cells (red-yellow-green gradient) for quick visual comparison
If a year filter is active, metrics are shown for that year only; otherwise, metrics are averaged across all evaluated years
How to use it:
Select a tournament year in the sidebar (or leave unset for aggregate view)
Click any row to navigate to the Model Deep Dive for that run
Lab: Model Deep Dive¶
Purpose: Diagnose a single model’s calibration, accuracy, and feature behavior.
What you see:
Reliability diagram — calibration visualization (see Interpreting Results above)
Year drill-down — select a specific fold year to see how calibration varies over time
Per-year metric summary — table of Log Loss, Brier, AUC, and ECE broken out by year, with gradient coloring
Feature importance (XGBoost only) — horizontal bar chart showing which features contribute most to predictions
Hyperparameters — JSON view of the model’s configuration
How to use it:
Click a model run on the Leaderboard, or select a run in the sidebar
Use the year dropdown to compare calibration across different seasons
For XGBoost models, check feature importance to understand what drives predictions
Presentation: Bracket Visualizer¶
Purpose: Turn model predictions into a visual bracket with advancement probabilities.
What you see:
Most-likely bracket — interactive HTML bracket tree showing predicted winners at every matchup, with win probabilities displayed
Advancement heatmap — color-coded grid showing each team’s probability of reaching each round
Pairwise win probabilities — expandable section where you can pick any two teams and see the head-to-head probability
Expected Points table — all 64 teams ranked by expected points under the selected scoring rule
Score distribution (Monte Carlo only) — histogram of possible bracket scores
How to use it:
Select a model run, tournament year, and scoring format in the sidebar
Choose “Analytical (exact)” for speed or “Monte Carlo” for score distributions
If using Monte Carlo, adjust the number of simulations (more = more precise, but slower)
Use the Expected Points table to identify high-value picks
Expand “Pairwise Win Probabilities” to investigate specific matchups
User-Editable Bracket¶
You can override the model’s picks for any matchup and build your own bracket.
Editing picks:
Expand the Edit Picks section below the bracket visualization
Each game is shown as a selectbox with the two participating teams
Select a different winner to override the model’s prediction
Overridden games show the model’s original pick in a tooltip (“Model: Team X”)
How cascading works:
When you override a game, all downstream matchups automatically update:
The overridden winner advances to the next round
Later-round matchups are re-resolved: if the game has its own override and the overridden winner is still a valid participant, the override is kept; otherwise the model’s prediction (argmax of pairwise probability) is used
The champion, log-likelihood, and bracket tree all reflect your edits
Visual distinction:
Overridden games in the bracket tree display a golden border and a “USER” badge, making it easy to see which picks are yours vs. the model’s. An info bar shows the number of overridden picks (e.g., “3 of 63 picks overridden by user”).
Resetting:
Click the Reset to Model Predictions button (visible when overrides exist) to clear all overrides and revert to the model’s most-likely bracket.
When overrides are cleared automatically:
Overrides are invalidated and cleared whenever you change any parameter that affects the underlying bracket probabilities:
Model run (sidebar)
Tournament year (sidebar)
Scoring format (sidebar)
Upset Aggression slider
Seed-Weight slider
An info message confirms when overrides have been reset.
Pool Scorer integration:
When you navigate to the Pool Scorer page with overrides active, the Pool Scorer scores your user-edited bracket (not the model’s) against Monte Carlo simulations. The CSV export also reflects your edits.
Presentation: Pool Scorer¶
Purpose: Score your bracket against thousands of simulated tournament outcomes to understand your expected point distribution.
What you see:
Scoring configuration — choose a built-in rule or define custom per-round points
Outcome summary — median, mean, standard deviation, min/max, and percentile metrics for your bracket’s point total
Score distribution histogram — visual distribution of how your bracket would score across all simulated outcomes
CSV export — download your bracket as a CSV file for submission to your pool
How to use it:
Select a model run and tournament year in the sidebar
Configure your pool’s scoring rules (or use the default Standard scoring)
Adjust the number of Monte Carlo simulations (10,000 is a good default)
Click “Analyze Outcomes” to run the simulation
Review the outcome summary to understand your expected score range
Download the bracket CSV for your pool submission
Game Theory Sliders¶
Two sliders allow you to adjust the bracket strategy without retraining. Both sliders update the bracket visualization, advancement heatmap, expected points table, and pairwise probabilities in real time. The Monte Carlo score distribution (if enabled) is not affected — it represents the model’s original predictions of tournament outcomes.
Upset Aggression: Negative values reinforce favorites (chalk); positive values make upsets more likely (chaos).
Seed-Weight: 0% = pure model predictions; 100% = pure historical seed win rates.
Upset Aggression (range: −5 to +5, default: 0)
Controls whether your bracket picks favor chalk (favorites) or chaos (upsets):
Setting |
Temperature |
Effect |
|---|---|---|
−5 |
0.31 |
Extreme chalk — nearly every favorite wins |
−3 |
0.50 |
Strong chalk — favorites heavily reinforced |
0 |
1.00 |
Neutral — model probabilities unchanged |
+3 |
2.00 |
Strong chaos — probabilities compress toward 50/50 |
+5 |
3.17 |
Extreme chaos — nearly every game is a coin flip |
Mathematically, this applies a power transform to every win probability:
$$p’ = \frac{p^{1/T}}{p^{1/T} + (1-p)^{1/T}} \quad \text{where } T = 2^{v/3}$$
A probability of exactly 50% is never moved. Favorites remain favorites — the transform preserves the ranking of all probabilities.
Seed-Weight (range: 0% to 100%, default: 0%)
Blends the model’s predictions with historical seed-vs-seed win rates:
Setting |
Effect |
|---|---|
0% |
Pure model predictions |
25% |
75% model + 25% historical seed performance |
50% |
Equal blend of model and seed history |
100% |
Ignore the model entirely; use historical seed win rates |
This is useful when you believe the tournament seeding committee has information your model doesn’t capture, or when your model makes predictions that diverge significantly from historical seed performance.
Tip
In a small pool (< 10 people), pick chalk — you want the most likely bracket. In a large pool (50+ people), increase Upset Aggression to differentiate your bracket from the crowd.
Troubleshooting¶
Kaggle Authentication¶
Symptom: AuthenticationError: credentials not found when running python sync.py.
Fix:
Ensure you have a Kaggle API token saved at
~/.kaggle/access_token(not the legacykaggle.jsonformat). The token starts withKGAT_.Set permissions:
chmod 600 ~/.kaggle/access_token.Verify your Kaggle account has phone verification completed and you have accepted the competition rules for March Machine Learning Mania.
See the README — Kaggle API Authentication for step-by-step setup.
ESPN Rate Limits and Transient Failures¶
Symptom: NetworkError or warnings about failed team fetches during ESPN sync.
Fix:
The ESPN connector automatically retries transient failures with exponential backoff (up to 3 attempts per request). If failures persist:
Check your internet connection.
ESPN’s public API occasionally rate-limits aggressive requests. Wait a few minutes and re-run with
python sync.py --source espn --dest data/.The sync engine logs warnings for teams that fail after all retries but continues processing the remaining teams. A small number of failures is normal and does not invalidate the dataset.
Parquet Version Mismatches¶
Symptom: ArrowInvalid or ArrowIOError when reading cached Parquet files.
Fix:
This can happen when Parquet files were written by a different version of PyArrow than the one currently installed. Delete the cached files and re-sync:
rm -rf data/*.parquet
python sync.py --source all --dest data/
If the error persists, ensure your environment has a consistent PyArrow version:
pip install --upgrade pyarrow
Conda / Poetry Environment Issues¶
Symptom: ModuleNotFoundError: No module named 'ncaa_eval' or Poetry
commands fail.
Fix:
Ensure the conda environment is activated:
conda activate ncaa_eval.Install dependencies into the conda env (not a Poetry virtualenv):
POETRY_VIRTUALENVS_CREATE=false poetry install
If
poetryis not found, install it inside the conda env:pip install poetry
Verify the package is importable:
python -c "import ncaa_eval; print('OK')".