User Guide

Getting Started

This guide picks up where the README.md (project root) left off. Once you have installed the project and synced data, the typical workflow is:

  1. Sync data — download NCAA game results from Kaggle (and optionally ESPN):

    python sync.py --source all --dest data/
    
  2. Train a model — fit a prediction model on historical seasons:

    python -m ncaa_eval.cli train --model elo
    python -m ncaa_eval.cli train --model xgboost
    

    Common options:

    Flag

    Default

    Description

    --model

    (required)

    Registered model name (elo, xgboost, or custom)

    --start-year

    2015

    First training season (inclusive)

    --end-year

    2025

    Last training season (inclusive)

    --data-dir

    data/

    Path to synced Parquet files

    --output-dir

    data/

    Where to write run artifacts

    --config

    None

    JSON file overriding model hyperparameters

  3. Generate predictions — produce win-probability CSVs from a trained model:

    python -m ncaa_eval.cli predict --run-id <run-id> --season 2025
    python -m ncaa_eval.cli predict --run-id <run-id> --season 2025 --output preds.csv
    

    Common options:

    Flag

    Default

    Description

    --run-id

    (required)

    Model run ID (from the training step output)

    --season

    (required)

    Target season year

    --data-dir

    data/

    Path to synced Parquet files

    --output

    None

    Output CSV path (omit to write to stdout)

    Stateful models (Elo) produce pairwise probabilities for all C(n,2) team combinations. Stateless models (XGBoost, LogisticRegression) produce game-level predictions for every game in the season dataset.

    Output format:

    season,team_a_id,team_b_id,pred_win_prob
    2025,1101,1102,0.6234
    2025,1101,1103,0.4512
    
  4. Explore results in the dashboard — launch the Streamlit app:

    streamlit run dashboard/app.py
    

    The sidebar lets you select a tournament year, model run, and scoring format. All pages update automatically when you change these filters.

  5. Iterate — retrain with different hyperparameters, compare on the Leaderboard, inspect calibration in Model Deep Dive, and use the Bracket Visualizer and Pool Scorer to turn predictions into bracket picks.

Evaluation Metrics

The platform evaluates models on four complementary metrics. Each captures a different aspect of prediction quality.

Log Loss

What it measures: How well predicted probabilities match actual outcomes, with heavy penalties for confident wrong predictions.

Formula:

$$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \ln(p_i) + (1-y_i) \ln(1-p_i)\bigr]$$

Interpretation:

Value

Meaning

0.0

Perfect — every prediction was 0% or 100% and correct

~0.50

Good — a well-calibrated model typically lands here

0.693

Random baseline (equivalent to predicting 50% for every game)

> 0.693

Worse than guessing — the model is actively misleading

Tip

Log Loss is the primary ranking metric in the Kaggle March Machine Learning Mania competition. A score of 0.55 means your model is meaningfully better than random but has room to improve.

Warning

Log Loss punishes confident wrong predictions exponentially. A single game where you predicted 99% and the other team won adds roughly 4.6 to your loss — far more than 100 correct predictions at 60% confidence save.

Brier Score

What it measures: Mean squared error of probability predictions — a gentler alternative to Log Loss.

Formula:

$$\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2$$

Interpretation:

Value

Meaning

0.0

Perfect

~0.20

Good model

0.25

Random baseline (predicting 50% for every game)

> 0.25

Worse than guessing

Brier Score is more forgiving of confident wrong predictions than Log Loss because it uses squared error instead of logarithmic error. A 99%-confident wrong prediction adds 0.98 to Brier (vs. 4.6 to Log Loss).

ROC-AUC

What it measures: Discrimination — can the model distinguish winners from losers? Equivalently: if you pick a random winning team and a random losing team, what is the probability the model assigns a higher win probability to the winner?

Formula: Area under the Receiver Operating Characteristic curve.

Interpretation:

Value

Meaning

1.0

Perfect discrimination

~0.75

Good model

0.5

Random — no discrimination ability

< 0.5

Inversely correlated (predicting losers as winners)

Warning

ROC-AUC does not measure calibration. A model can have perfect AUC (1.0) but terrible calibration — e.g., predicting 99% for every game that the favored team wins and 1% for every upset. Always pair AUC with calibration metrics (ECE, reliability diagrams).

Expected Calibration Error (ECE)

What it measures: How well predicted probabilities correspond to actual win rates. If a model says “70% win probability” for 100 games, about 70 should actually be wins.

Formula:

$$\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|$$

where predictions are binned into $B$ equal-width bins, $n_b$ is the count in bin $b$, $\text{acc}(b)$ is the observed win rate, and $\text{conf}(b)$ is the mean predicted probability.

Interpretation:

Value

Meaning

0.0

Perfectly calibrated

< 0.03

Excellent calibration

0.03–0.08

Reasonable calibration

> 0.10

Poor calibration — predictions are systematically off

Tip

ECE is the single best number for answering “can I trust these probabilities at face value?” A low ECE means you can use the model’s probability outputs directly for bet sizing, pool strategy, and expected-value calculations.

Model Types

NCAA_eval supports two model paradigms through a common abstract base class.

Stateful Models

Stateful models maintain internal ratings that update game-by-game through a season. They process games sequentially — the order matters.

Reference implementation: Elo (elo)

The built-in Elo model tracks a rating for every team. After each game, ratings shift based on the outcome and margin of victory:

  • Winners gain rating points; losers lose them

  • Upset victories produce larger rating swings

  • Ratings mean-revert between seasons (configurable fraction)

  • Separate K-factors for early-season, regular-season, and tournament games

Best for: Capturing in-season trajectory and momentum. Elo is simple, interpretable, and requires no feature engineering.

Key hyperparameters:

Parameter

Default

Description

initial_rating

1500

Starting rating for new teams

k_early

56.0

K-factor for the first 20 games

k_regular

38.0

K-factor for regular-season games

k_tournament

47.5

K-factor for tournament games

mean_reversion_fraction

0.25

Fraction pulled toward mean between seasons

Tip

You can override any hyperparameter via a JSON config file: python -m ncaa_eval.cli train --model elo --config my_elo_config.json

Stateless Models

Stateless models are standard batch-trained classifiers. They take a feature matrix as input and produce probability predictions — game order does not matter.

Reference implementation: XGBoost (xgboost)

The built-in XGBoost model uses gradient-boosted decision trees trained on feature snapshots. Features are computed by the feature engineering pipeline (Epic 4) and include team statistics, strength-of-schedule metrics, and graph-based centrality measures.

Best for: Combining many feature dimensions for maximum predictive accuracy. XGBoost typically outperforms Elo when strong features are available.

Key hyperparameters:

Parameter

Default

Description

n_estimators

500

Maximum number of boosting rounds

max_depth

5

Maximum tree depth

learning_rate

0.05

Step size shrinkage

early_stopping_rounds

50

Stop if validation loss doesn’t improve

Plugin Registry

Models register themselves via the @register_model("name") decorator. To create a custom model:

  1. Subclass Model (stateless) or StatefulModel (stateful)

  2. Implement the required methods:

    • Stateless (Model): fit, predict_proba, save, load, get_config

    • Stateful (StatefulModel): _predict_one, update, start_season, get_state, set_state, save, load, get_configfit and predict_proba are provided by the template (_predict_one is the per-pair hook that predict_proba calls)

  3. Decorate with @register_model("my_model")

  4. Import the module before training so the decorator fires

The CLI discovers all registered models automatically:

# List available models
python -c "from ncaa_eval.model import list_models; print(list_models())"

For implementation details, see the API Reference.

Interpreting Results

Reliability Diagrams

A reliability diagram is the visual counterpart to ECE. It plots predicted probabilities against observed win rates:

  • X-axis: Predicted probability (grouped into bins, e.g., 0–10%, 10–20%, …)

  • Y-axis: Actual win rate within each bin

  • Perfect calibration line: The 45° diagonal — if your model says 70%, 70% of those games should be wins

How to read the diagram:

Pattern

Meaning

Action

Points on the diagonal

Well-calibrated

No action needed

Points above the diagonal

Under-confident — actual win rates exceed predictions

Model could be sharper

Points below the diagonal

Over-confident — predictions overstate win likelihood

Model needs calibration

S-shaped curve

Probabilities are too extreme on both ends

Retrain with calibration regularization; use a negative Upset Aggression slider value (chalk mode) to sharpen predictions

Flat line near 0.5

Model lacks discrimination

Improve features or model architecture

Tip

The Model Deep Dive page in the dashboard shows reliability diagrams with per-year drill-down. Compare diagrams across years to check whether calibration is stable or drifts.

Calibration in Plain Language

A well-calibrated model is one you can take at face value. When it says “Duke has a 65% chance of beating UNC,” that means that in a large sample of similar matchups, Duke would win about 65% of the time.

Why calibration matters for bracket pools:

  • Pool strategy depends on knowing how likely outcomes are, not just which team is favored

  • Expected-point calculations multiply advancement probabilities by scoring weights — if probabilities are wrong, the strategy is wrong

  • An over-confident model will undercount upsets, leading you to pick too much chalk

Over-Confidence vs. Under-Confidence

Type

Symptom

Reliability Diagram

Impact on Brackets

Over-confident

Predictions are too extreme (90% when reality is 70%)

Points below diagonal at high probabilities

Too much chalk; undervalues upsets

Under-confident

Predictions are too moderate (55% when reality is 70%)

Points above diagonal at high probabilities

Picks too many upsets; misses value in favorites

Well-calibrated

Predictions match reality

Points on or near diagonal

Bracket strategy reflects true odds

Tournament Simulation

Monte Carlo Methodology

The platform simulates the full 64-team NCAA tournament bracket using two methods:

Analytical (Phylourny algorithm): Computes exact advancement probabilities via a post-order tree traversal. Fast and deterministic — no random sampling needed. Best for expected-point calculations where you want precise values.

Monte Carlo simulation: Runs thousands of independent tournament simulations (default 10,000). Each simulation randomly resolves every game using the model’s pairwise win probabilities. Produces:

  • Advancement probabilities — fraction of simulations each team reaches each round

  • Score distributions — histogram of total bracket points across all simulations

  • Confidence intervals — percentile-based ranges for expected outcomes

Bracket Distribution

When you run a Monte Carlo simulation, the platform computes a full score distribution for the “chalk bracket” (picking the pre-game favorite in every matchup). This answers: “If the model’s probabilities are correct, what range of scores should I expect?”

Key statistics shown on the Pool Scorer page:

Statistic

What It Tells You

Median

The score you’d most typically get

Mean

Average expected score (weighted by probability)

5th percentile

Worst-case scenario (lots of upsets)

95th percentile

Best-case scenario (mostly chalk)

Std Dev

How much scores vary across simulated outcomes

Expected Points

Expected Points (EP) combines advancement probabilities with a scoring rule to answer: “How many bracket points is each team worth?”

$$\text{EP}i = \sum{r=0}^{5} P(\text{team } i \text{ wins round } r) \times \text{points}(r)$$

Teams with high EP are valuable picks — they are likely to advance far and those rounds are worth many points. The Bracket Visualizer’s Expected Points table ranks all 64 teams by EP under your chosen scoring rule.

Tournament Scoring

The platform supports three built-in scoring systems and lets you define custom rules.

Standard Scoring (ESPN-style)

The most common pool format. Points double each round:

Round

Abbrev.

Games

Points

Max Points

Round of 64

R64

32

1

32

Round of 32

R32

16

2

32

Sweet 16

S16

8

4

32

Elite Eight

E8

4

8

32

Final Four

F4

2

16

32

Championship

NCG

1

32

32

Total

63

192

Worked example: You correctly pick 20 R64 games, 10 R32 games, 4 S16 games, 2 E8 games, 1 F4 game, and the champion:

20×1 + 10×2 + 4×4 + 2×8 + 1×16 + 1×32 = 20 + 20 + 16 + 16 + 16 + 32 = **120 points**

Fibonacci Scoring

Rewards later rounds more steeply than Standard:

Round

Points

R64

2

R32

3

S16

5

E8

8

F4

13

NCG

21

Total (perfect)

231

Fibonacci scoring gives more credit for getting later rounds right. Picking the champion is worth 21 points (vs. 32 in Standard), but the ratio of late-round-to-early-round points is higher.

Seed-Difference Bonus

Standard base points plus an upset bonus equal to the seed difference when the lower-seeded team wins:

Round

Base Points

Upset Bonus

R64

1

+ |seed_winner − seed_loser| if upset

R32

2

+ |seed_winner − seed_loser| if upset

S16

4

+ |seed_winner − seed_loser| if upset

E8

8

+ |seed_winner − seed_loser| if upset

F4

16

+ |seed_winner − seed_loser| if upset

NCG

32

+ |seed_winner − seed_loser| if upset

Worked example: A 12-seed beats a 5-seed in the R64. You get 1 (base) + 7 (seed diff) = 8 points for that single game — the same as getting an Elite Eight pick right under Standard scoring.

Tip

Seed-Difference Bonus rewards contrarian picks. If you are in a large pool where most people pick chalk, this scoring format lets you differentiate by picking well-chosen upsets.

Custom Scoring

The Pool Scorer page lets you define custom per-round point values. Check “Use custom scoring” and enter your pool’s specific point schedule. This is useful for pools with non-standard formats (e.g., 1-2-3-5-8-13 or 10-20-40-80-160-320).

Dashboard Guide

The dashboard has four main pages organized into two sections.

Lab: Backtest Leaderboard

Purpose: Compare all trained models side-by-side.

What you see:

  • KPI cards at the top showing the best score for each metric across all runs, with delta indicators comparing best vs. worst

  • Sortable table with every model run’s Log Loss, Brier Score, ROC-AUC, and ECE

  • Color-coded cells (red-yellow-green gradient) for quick visual comparison

  • If a year filter is active, metrics are shown for that year only; otherwise, metrics are averaged across all evaluated years

How to use it:

  1. Select a tournament year in the sidebar (or leave unset for aggregate view)

  2. Click any row to navigate to the Model Deep Dive for that run

Lab: Model Deep Dive

Purpose: Diagnose a single model’s calibration, accuracy, and feature behavior.

What you see:

  • Reliability diagram — calibration visualization (see Interpreting Results above)

  • Year drill-down — select a specific fold year to see how calibration varies over time

  • Per-year metric summary — table of Log Loss, Brier, AUC, and ECE broken out by year, with gradient coloring

  • Feature importance (XGBoost only) — horizontal bar chart showing which features contribute most to predictions

  • Hyperparameters — JSON view of the model’s configuration

How to use it:

  1. Click a model run on the Leaderboard, or select a run in the sidebar

  2. Use the year dropdown to compare calibration across different seasons

  3. For XGBoost models, check feature importance to understand what drives predictions

Presentation: Bracket Visualizer

Purpose: Turn model predictions into a visual bracket with advancement probabilities.

What you see:

  • Most-likely bracket — interactive HTML bracket tree showing predicted winners at every matchup, with win probabilities displayed

  • Advancement heatmap — color-coded grid showing each team’s probability of reaching each round

  • Pairwise win probabilities — expandable section where you can pick any two teams and see the head-to-head probability

  • Expected Points table — all 64 teams ranked by expected points under the selected scoring rule

  • Score distribution (Monte Carlo only) — histogram of possible bracket scores

How to use it:

  1. Select a model run, tournament year, and scoring format in the sidebar

  2. Choose “Analytical (exact)” for speed or “Monte Carlo” for score distributions

  3. If using Monte Carlo, adjust the number of simulations (more = more precise, but slower)

  4. Use the Expected Points table to identify high-value picks

  5. Expand “Pairwise Win Probabilities” to investigate specific matchups

User-Editable Bracket

You can override the model’s picks for any matchup and build your own bracket.

Editing picks:

  1. Expand the Edit Picks section below the bracket visualization

  2. Each game is shown as a selectbox with the two participating teams

  3. Select a different winner to override the model’s prediction

  4. Overridden games show the model’s original pick in a tooltip (“Model: Team X”)

How cascading works:

When you override a game, all downstream matchups automatically update:

  • The overridden winner advances to the next round

  • Later-round matchups are re-resolved: if the game has its own override and the overridden winner is still a valid participant, the override is kept; otherwise the model’s prediction (argmax of pairwise probability) is used

  • The champion, log-likelihood, and bracket tree all reflect your edits

Visual distinction:

Overridden games in the bracket tree display a golden border and a “USER” badge, making it easy to see which picks are yours vs. the model’s. An info bar shows the number of overridden picks (e.g., “3 of 63 picks overridden by user”).

Resetting:

Click the Reset to Model Predictions button (visible when overrides exist) to clear all overrides and revert to the model’s most-likely bracket.

When overrides are cleared automatically:

Overrides are invalidated and cleared whenever you change any parameter that affects the underlying bracket probabilities:

  • Model run (sidebar)

  • Tournament year (sidebar)

  • Scoring format (sidebar)

  • Upset Aggression slider

  • Seed-Weight slider

An info message confirms when overrides have been reset.

Pool Scorer integration:

When you navigate to the Pool Scorer page with overrides active, the Pool Scorer scores your user-edited bracket (not the model’s) against Monte Carlo simulations. The CSV export also reflects your edits.

Presentation: Pool Scorer

Purpose: Score your bracket against thousands of simulated tournament outcomes to understand your expected point distribution.

What you see:

  • Scoring configuration — choose a built-in rule or define custom per-round points

  • Outcome summary — median, mean, standard deviation, min/max, and percentile metrics for your bracket’s point total

  • Score distribution histogram — visual distribution of how your bracket would score across all simulated outcomes

  • CSV export — download your bracket as a CSV file for submission to your pool

How to use it:

  1. Select a model run and tournament year in the sidebar

  2. Configure your pool’s scoring rules (or use the default Standard scoring)

  3. Adjust the number of Monte Carlo simulations (10,000 is a good default)

  4. Click “Analyze Outcomes” to run the simulation

  5. Review the outcome summary to understand your expected score range

  6. Download the bracket CSV for your pool submission

Game Theory Sliders

Two sliders allow you to adjust the bracket strategy without retraining. Both sliders update the bracket visualization, advancement heatmap, expected points table, and pairwise probabilities in real time. The Monte Carlo score distribution (if enabled) is not affected — it represents the model’s original predictions of tournament outcomes.

  • Upset Aggression: Negative values reinforce favorites (chalk); positive values make upsets more likely (chaos).

  • Seed-Weight: 0% = pure model predictions; 100% = pure historical seed win rates.

Upset Aggression (range: −5 to +5, default: 0)

Controls whether your bracket picks favor chalk (favorites) or chaos (upsets):

Setting

Temperature

Effect

−5

0.31

Extreme chalk — nearly every favorite wins

−3

0.50

Strong chalk — favorites heavily reinforced

0

1.00

Neutral — model probabilities unchanged

+3

2.00

Strong chaos — probabilities compress toward 50/50

+5

3.17

Extreme chaos — nearly every game is a coin flip

Mathematically, this applies a power transform to every win probability:

$$p’ = \frac{p^{1/T}}{p^{1/T} + (1-p)^{1/T}} \quad \text{where } T = 2^{v/3}$$

A probability of exactly 50% is never moved. Favorites remain favorites — the transform preserves the ranking of all probabilities.

Seed-Weight (range: 0% to 100%, default: 0%)

Blends the model’s predictions with historical seed-vs-seed win rates:

Setting

Effect

0%

Pure model predictions

25%

75% model + 25% historical seed performance

50%

Equal blend of model and seed history

100%

Ignore the model entirely; use historical seed win rates

This is useful when you believe the tournament seeding committee has information your model doesn’t capture, or when your model makes predictions that diverge significantly from historical seed performance.

Tip

In a small pool (< 10 people), pick chalk — you want the most likely bracket. In a large pool (50+ people), increase Upset Aggression to differentiate your bracket from the crowd.

Troubleshooting

Kaggle Authentication

Symptom: AuthenticationError: credentials not found when running python sync.py.

Fix:

  1. Ensure you have a Kaggle API token saved at ~/.kaggle/access_token (not the legacy kaggle.json format). The token starts with KGAT_.

  2. Set permissions: chmod 600 ~/.kaggle/access_token.

  3. Verify your Kaggle account has phone verification completed and you have accepted the competition rules for March Machine Learning Mania.

See the README — Kaggle API Authentication for step-by-step setup.

ESPN Rate Limits and Transient Failures

Symptom: NetworkError or warnings about failed team fetches during ESPN sync.

Fix:

The ESPN connector automatically retries transient failures with exponential backoff (up to 3 attempts per request). If failures persist:

  • Check your internet connection.

  • ESPN’s public API occasionally rate-limits aggressive requests. Wait a few minutes and re-run with python sync.py --source espn --dest data/.

  • The sync engine logs warnings for teams that fail after all retries but continues processing the remaining teams. A small number of failures is normal and does not invalidate the dataset.

Parquet Version Mismatches

Symptom: ArrowInvalid or ArrowIOError when reading cached Parquet files.

Fix:

This can happen when Parquet files were written by a different version of PyArrow than the one currently installed. Delete the cached files and re-sync:

rm -rf data/*.parquet
python sync.py --source all --dest data/

If the error persists, ensure your environment has a consistent PyArrow version:

pip install --upgrade pyarrow

Conda / Poetry Environment Issues

Symptom: ModuleNotFoundError: No module named 'ncaa_eval' or Poetry commands fail.

Fix:

  1. Ensure the conda environment is activated: conda activate ncaa_eval.

  2. Install dependencies into the conda env (not a Poetry virtualenv):

    POETRY_VIRTUALENVS_CREATE=false poetry install
    
  3. If poetry is not found, install it inside the conda env:

    pip install poetry
    
  4. Verify the package is importable: python -c "import ncaa_eval; print('OK')".