ncaa_eval.transform package¶

Submodules¶

Module contents¶

Feature engineering and data transformation module.

class ncaa_eval.transform.BatchRatingSolver(*, margin_cap: int = 25, ridge_lambda: float = 20.0, srs_max_iter: int = 10000)[source]¶

Bases: object

Batch rating solver that produces full-season opponent-adjusted ratings.

All solvers accept a pre-loaded DataFrame of compact regular-season games (caller must filter to is_tournament == False before passing).

Parameters:

margin_cap – Maximum point margin applied per game (default 25).
ridge_lambda – Regularization strength for Ridge solver (default 20.0).
srs_max_iter – Maximum iterations for SRS fixed-point convergence (default 10,000).

compute_colley(games_df: DataFrame) → DataFrame[source]¶

Compute Colley Matrix ratings (win/loss only, no margin).

Parameters:: games_df – DataFrame with columns w_team_id, l_team_id (regular-season games only; scores not used).
Returns:: DataFrame with columns ["team_id", "colley_rating"] (bounded [0, 1]).

compute_ridge(games_df: DataFrame) → DataFrame[source]¶

Compute Ridge regression ratings (regularized SRS).

Parameters:: games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score (regular-season games only).
Returns:: DataFrame with columns ["team_id", "ridge_rating"].

compute_srs(games_df: DataFrame) → DataFrame[source]¶

Compute SRS (Simple Rating System) ratings via fixed-point iteration.

Parameters:: games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score (regular-season games only).
Returns:: DataFrame with columns ["team_id", "srs_rating"] (zero-centered).

class ncaa_eval.transform.Calibrator(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for probability calibration transforms.

Both IsotonicCalibrator and SigmoidCalibrator structurally satisfy this protocol.

fit(y_true: ndarray[tuple[Any, ...], dtype[float64]], y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → None[source]¶: Fit the calibrator on observed labels and predicted probabilities.

transform(y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Transform raw probabilities into calibrated probabilities.

class ncaa_eval.transform.ChronologicalDataServer(repository: Repository)[source]¶

Bases: object

Serves game data in strict chronological order for walk-forward modeling.

Wraps a Repository and enforces temporal boundaries so that callers cannot accidentally access future game data during walk-forward validation.

Parameters:: repository – The data store from which games are retrieved.

Example:

from ncaa_eval.ingest.repository import ParquetRepository
from ncaa_eval.transform.serving import ChronologicalDataServer

repo = ParquetRepository(Path("data/"))
server = ChronologicalDataServer(repo)
season = server.get_chronological_season(2023)
for daily_batch in server.iter_games_by_date(2023):
    process(daily_batch)

get_chronological_season(year: int, cutoff_date: date | None = None) → SeasonGames[source]¶

Return all games for year sorted ascending by (date, game_id).

Applies optional temporal cutoff so callers cannot retrieve games that had not yet been played as of a given date. This is the primary leakage-prevention mechanism for walk-forward model training.

Parameters:

year – Season year (e.g., 2023 for the 2022-23 season).
cutoff_date – If provided, only games on or before this date are returned. Must not be in the future.

Returns:

SeasonGames with games sorted by (date, game_id) and the has_tournament flag reflecting known tournament cancellations.

Raises:

ValueError – If cutoff_date is strictly after today’s date.

iter_games_by_date(year: int, cutoff_date: date | None = None) → Iterator[list[Game]][source]¶

Yield batches of games grouped by calendar date, in chronological order.

Each yielded list contains all games played on a single calendar date. Dates with no games are skipped. Applies the same cutoff_date semantics as get_chronological_season().

Parameters:

year – Season year.
cutoff_date – Optional temporal cutoff (must not be in the future).

Yields:

Non-empty list[Game] for each calendar date, in ascending order.

class ncaa_eval.transform.ConferenceLookup(lookup: dict[tuple[int, int], str])[source]¶

Bases: object

Lookup table for team conference membership by (season, team_id).

Wraps MTeamConferences.csv into a dict-backed structure.

Parameters:: lookup – Mapping of (season, team_id) → conf_abbrev.

classmethod from_csv(path: Path) → ConferenceLookup[source]¶

Construct from MTeamConferences.csv.

Columns required: Season, TeamID, ConfAbbrev.

Parameters:: path – Path to MTeamConferences.csv.
Returns:: Initialised ConferenceLookup.

get(season: int, team_id: int) → str | None[source]¶

Return the conference abbreviation for (season, team_id), or None.

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.

Returns:

Conference abbreviation string, or None if not found.

class ncaa_eval.transform.CoverageGateResult(primary_systems: tuple[str, ...], fallback_used: bool, fallback_reason: str, recommended_systems: tuple[str, ...])[source]¶

Bases: object

Result of the Massey Ordinals coverage gate check.

primary_systems¶

The four primary composite systems (SAG, POM, MOR, WLK).

Type:: tuple[str, …]

fallback_used¶

True when SAG or WLK are missing for one or more seasons 2003–2026 and the fallback composite is recommended.

Type:: bool

fallback_reason¶

Human-readable description of why the fallback was triggered (empty string when fallback_used=False).

Type:: str

recommended_systems¶

The system names the caller should use for composite computation — either the primary composite or the confirmed-full-coverage fallback (MOR, POM, DOL).

Type:: tuple[str, …]

fallback_reason: str¶

fallback_used: bool¶

primary_systems: tuple[str, ...]¶

recommended_systems: tuple[str, ...]¶

class ncaa_eval.transform.DetailedResultsLoader(df: DataFrame)[source]¶

Bases: object

Loads detailed box-score results and provides per-team game views.

Reads MRegularSeasonDetailedResults.csv and MNCAATourneyDetailedResults.csv into a combined long-format DataFrame with one row per (team, game).

Box-score stats are only available from the 2003 season onwards. Pre-2003 seasons return empty DataFrames from get_team_season().

classmethod from_csvs(regular_path: Path, tourney_path: Path) → DetailedResultsLoader[source]¶

Construct a loader from the two Kaggle detailed-results CSV paths.

Parameters:

regular_path – Path to MRegularSeasonDetailedResults.csv.
tourney_path – Path to MNCAATourneyDetailedResults.csv.

Returns:

DetailedResultsLoader instance with combined data.

get_season_long_format(season: int) → DataFrame[source]¶

Return all games for a season in long format.

Parameters:: season – Season year (e.g., 2023).
Returns:: DataFrame sorted by (day_num, team_id), reset index.

get_team_season(team_id: int, season: int) → DataFrame[source]¶

Return all games for one team in one season, sorted by day_num.

Parameters:

team_id – Canonical Kaggle TeamID integer.
season – Season year (e.g., 2023).

Returns:

DataFrame sorted by day_num ascending, reset index. Returns empty DataFrame if team or season not found.

class ncaa_eval.transform.EloConfig(initial_rating: float = 1500.0, k_early: float = 56.0, k_regular: float = 38.0, k_tournament: float = 47.5, early_game_threshold: int = 20, margin_exponent: float = 0.85, max_margin: int = 25, home_advantage_elo: float = 3.5, mean_reversion_fraction: float = 0.25)[source]¶

Bases: object

Frozen configuration for the Elo feature engine.

All K-factor, margin scaling, home-court, and mean-reversion parameters are configurable with sensible defaults matching the Silver/SBCB model.

early_game_threshold: int = 20¶

home_advantage_elo: float = 3.5¶

initial_rating: float = 1500.0¶

k_early: float = 56.0¶

k_regular: float = 38.0¶

k_tournament: float = 47.5¶

margin_exponent: float = 0.85¶

max_margin: int = 25¶

mean_reversion_fraction: float = 0.25¶

class ncaa_eval.transform.EloFeatureEngine(config: EloConfig, conference_lookup: ConferenceLookup | None = None)[source]¶

Bases: object

Game-by-game Elo rating engine.

Parameters:

config – Frozen Elo configuration.
conference_lookup – Optional conference lookup for season mean-reversion. When None, mean-reversion falls back to global mean.

apply_season_mean_reversion(season: int) → None[source]¶

Regress each team toward its conference mean (or global mean).

Groups all rated teams by conference via ConferenceLookup, computes each conference’s mean rating, then shifts every team’s rating a fraction mean_reversion_fraction of the way toward its conference mean. Teams with no conference entry fall back to the global mean; when no ConferenceLookup is provided all teams use the global mean. Is a no-op when no prior ratings exist.

static expected_score(rating_a: float, rating_b: float) → float[source]¶

Logistic expected score for team A against team B.

expected = 1 / (1 + 10^((r_b − r_a) / 400))

get_all_ratings() → dict[int, float][source]¶: Return a copy of the current ratings dict.

get_game_counts() → dict[int, int][source]¶: Return a copy of the current game-counts dict.

get_rating(team_id: int) → float[source]¶: Return current Elo rating for team_id (initial_rating if unseen).

has_ratings() → bool[source]¶: Return True if at least one team has a rating.

predict_matchup(team_a_id: int, team_b_id: int) → float[source]¶: Return P(team_a wins) using the Elo expected-score formula.

process_season(games: list[Game], season: int) → pd.DataFrame[source]¶

Process all games for a season, returning before-ratings per game.

Calls start_new_season(season) if prior ratings exist (i.e., this is not the very first season).

Parameters:

games – Games sorted in chronological order.
season – Season year.

Returns:

DataFrame with columns [game_id, elo_w_before, elo_l_before].

reset_game_counts() → None[source]¶: Reset per-team game counts for a new season (affects variable K).

set_game_counts(counts: dict[int, int]) → None[source]¶: Replace all game counts with counts.

set_ratings(ratings: dict[int, float]) → None[source]¶: Replace all ratings with ratings.

start_new_season(season: int) → None[source]¶: Orchestrate season transition: mean-reversion then reset counts.

update_game(w_team_id: int, l_team_id: int, w_score: int, l_score: int, loc: str, is_tournament: bool, *, num_ot: int = 0) → tuple[float, float][source]¶

Process one game and update ratings.

Snapshots before-ratings for feature use, applies home-court effective-rating adjustment to expected-score computation, computes the margin-of-victory multiplier and variable K-factor, then mutates internal rating state for both teams.

Parameters:

w_team_id – Winner team ID.
l_team_id – Loser team ID.
w_score – Winner final score (raw).
l_score – Loser final score (raw).
loc – "H" (winner home), "A" (winner away), "N" (neutral).
is_tournament – Whether this is a tournament game.
num_ot – Number of overtime periods (used for margin rescaling).

Returns:

Tuple of (elo_w_before, elo_l_before) — the winner’s and loser’s ratings before this game’s update, suitable for use as walk-forward feature values.

class ncaa_eval.transform.FeatureBlock(*values)[source]¶

Bases: Enum

Individual feature building blocks that can be activated.

BATCH_RATING = 'batch_rating'¶

ELO = 'elo'¶

GRAPH = 'graph'¶

ORDINAL = 'ordinal'¶

SEED = 'seed'¶

SEQUENTIAL = 'sequential'¶

class ncaa_eval.transform.FeatureConfig(sequential_windows: tuple[int, ...] = (5, 10, 20), ewma_alphas: tuple[float, ...] = (0.15, 0.2), graph_features_enabled: bool = True, batch_rating_types: tuple[BatchRatingType, ...] = ('srs', 'ridge', 'colley'), ordinal_systems: tuple[str, ...] | None = None, ordinal_composite: OrdinalCompositeMethod | None = 'simple_average', matchup_deltas: bool = True, gender_scope: GenderScope = 'M', dataset_scope: DatasetScope = 'kaggle', elo_enabled: bool = False, elo_config: EloConfig | None = None)[source]¶

Bases: object

Declarative specification of which feature blocks and parameters to use.

sequential_windows¶

Rolling window sizes for sequential features (e.g., (5, 10, 20)).

Type:: tuple[int, …]

ewma_alphas¶

EWMA smoothing factors for sequential features (e.g., (0.15, 0.20)).

Type:: tuple[float, …]

graph_features_enabled¶

Whether to compute graph centrality features (PageRank, etc.).

Type:: bool

batch_rating_types¶

Which batch rating systems to include ("srs", "ridge", "colley").

Type:: tuple[BatchRatingType, …]

ordinal_systems¶

Massey ordinal systems to use; None means use coverage-gate defaults.

Type:: tuple[str, …] | None

ordinal_composite¶

Composite method: "simple_average", "weighted", "pca", or None to disable.

Type:: OrdinalCompositeMethod | None

matchup_deltas¶

Whether to compute team_A − team_B deltas for matchup features.

Type:: bool

gender_scope¶

"M" for men’s, "W" for women’s.

Type:: GenderScope

dataset_scope¶

"kaggle" for Kaggle-only games, "all" for Kaggle + ESPN enrichment.

Type:: DatasetScope

active_blocks() → frozenset[FeatureBlock][source]¶

Return the set of feature blocks that are currently enabled.

Checks each configuration flag (sequential windows, graph enabled, batch rating types, ordinal composite, Elo enabled) and adds the corresponding FeatureBlock enum value to a set, with SEED always included.

batch_rating_types: tuple[BatchRatingType, ...] = ('srs', 'ridge', 'colley')¶

dataset_scope: DatasetScope = 'kaggle'¶

elo_config: EloConfig | None = None¶

elo_enabled: bool = False¶

ewma_alphas: tuple[float, ...] = (0.15, 0.2)¶

gender_scope: GenderScope = 'M'¶

graph_features_enabled: bool = True¶

matchup_deltas: bool = True¶

ordinal_composite: OrdinalCompositeMethod | None = 'simple_average'¶

ordinal_systems: tuple[str, ...] | None = None¶

sequential_windows: tuple[int, ...] = (5, 10, 20)¶

class ncaa_eval.transform.GraphTransformer(margin_cap: int = 25, recency_window_days: int = 20, recency_multiplier: float = 1.5)[source]¶

Bases: object

Transform game DataFrames into graph-based centrality features.

Provides both batch (build + compute in one call) and incremental (add_game_to_graph) update strategies for walk-forward backtesting efficiency.

Typical walk-forward usage (Story 4.7):: transformer = GraphTransformer() graph = nx.DiGraph() prev_pagerank: dict[int, float] | None = None for game in chronological_games:

transformer.add_game_to_graph(graph, …) features_df = transformer.compute_features(graph, pagerank_init=prev_pagerank) prev_pagerank = dict(zip(features_df[“team_id”], features_df[“pagerank”]))

add_game_to_graph(graph: DiGraph, w_team_id: int, l_team_id: int, margin: int, day_num: int, reference_day_num: int) → None[source]¶

Add a single game to an existing graph in-place.

Supports incremental walk-forward updates without rebuilding the full graph. Edge direction: l_team_id → w_team_id (loser votes for winner).

Parameters:

graph – Existing nx.DiGraph to update in-place.
w_team_id – Winner team ID.
l_team_id – Loser team ID.
margin – Margin of victory (absolute score difference).
day_num – Day number of the game.
reference_day_num – Reference day for recency window evaluation.

build_graph(games_df: DataFrame, reference_day_num: int | None = None) → DiGraph[source]¶

Build a season graph from a games DataFrame.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.

Returns:

nx.DiGraph with loser→winner edges and “weight” attribute.

compute_features(graph: DiGraph, pagerank_init: dict[int, float] | None = None) → DataFrame[source]¶

Compute all centrality features for every team node in the graph.

Parameters:

graph – nx.DiGraph with loser→winner edges.
pagerank_init – Optional warm-start dict (team_id → probability) from a previous compute_pagerank() call. Reduces PageRank iterations from ~30–50 to ~2–5.

Returns:

pd.DataFrame with columns [“team_id”, “pagerank”, “betweenness_centrality”, “hits_hub”, “hits_authority”, “clustering_coefficient”], one row per team node. Returns empty DataFrame with correct columns if graph has no nodes.

transform(games_df: DataFrame, reference_day_num: int | None = None, pagerank_init: dict[int, float] | None = None) → DataFrame[source]¶

Convenience method: build graph then compute all centrality features.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
pagerank_init – Optional warm-start dict for PageRank.

Returns:

pd.DataFrame with centrality features, one row per team. Returns empty DataFrame with correct columns if games_df is empty.

class ncaa_eval.transform.IsotonicCalibrator[source]¶

Bases: object

Non-parametric monotonic probability calibration.

Wraps sklearn.isotonic.IsotonicRegression with y_min=0.0, y_max=1.0, and out_of_bounds="clip" for probability bounds.

Example:

cal = IsotonicCalibrator()
cal.fit(y_true_train, y_prob_train)
calibrated = cal.transform(y_prob_test)

fit(y_true: ndarray[tuple[Any, ...], dtype[float64]], y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → None[source]¶

Fit the isotonic regression on training fold predictions.

Parameters:

y_true – Binary labels (0 or 1) from the training fold.
y_prob – Model-predicted probabilities from the training fold.

transform(y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Apply calibration to test fold predictions.

Parameters:: y_prob – Model-predicted probabilities to calibrate.
Returns:: Calibrated probabilities in [0, 1].
Raises:: RuntimeError – If fit() has not been called.

class ncaa_eval.transform.MasseyOrdinalsStore(df: DataFrame)[source]¶

Bases: object

DataFrame-backed store for Massey Ordinal ranking systems.

Ingests MMasseyOrdinals.csv and provides temporal filtering, coverage gate validation, composite computation (Options A–D), and per-system normalization.

Parameters:: df – Raw DataFrame with columns [Season, RankingDayNum, SystemName, TeamID, OrdinalRank].

composite_pca(season: int, day_num: int, n_components: int | None = None, min_variance: float = 0.9) → DataFrame[source]¶

Option C: PCA reduction of all available systems.

When n_components=None, automatically selects the minimum number of components needed to capture min_variance of total variance.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
n_components – Number of principal components to retain. None triggers automatic selection based on min_variance.
min_variance – Minimum cumulative explained variance required when n_components=None (default 0.90 = 90%).

Returns:

DataFrame with columns PC1, PC2, ... indexed by TeamID. Rows with any NaN system value are dropped before PCA.

composite_simple_average(season: int, day_num: int, systems: list[str]) → Series[source]¶

Option A: simple average of ordinal ranks across systems per team.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
systems – List of system names to average.

Returns:

Series indexed by TeamID with mean ordinal rank per team.

composite_weighted(season: int, day_num: int, weights: dict[str, float]) → Series[source]¶

Option B: weighted average of ordinal ranks using caller-supplied weights.

Weights are normalized to sum to 1 before computation.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
weights – Mapping of system name → weight (any positive floats). Must not be empty.

Returns:

Series indexed by TeamID with weighted ordinal rank per team.

Raises:

ValueError – If weights is empty.

classmethod from_csv(path: Path) → MasseyOrdinalsStore[source]¶

Construct from MMasseyOrdinals.csv.

Columns required: Season, RankingDayNum, SystemName, TeamID, OrdinalRank.

Parameters:: path – Path to MMasseyOrdinals.csv.
Returns:: Initialised MasseyOrdinalsStore.

get_snapshot(season: int, day_num: int, systems: list[str] | None = None) → DataFrame[source]¶

Return wide-format ordinal ranks as of day_num for season.

For each (SystemName, TeamID) pair, uses the latest RankingDayNum that is ≤ day_num. Returns a DataFrame with TeamID as index and one column per ranking system.

Parameters:

season – Season year.
day_num – Inclusive upper bound on RankingDayNum.
systems – If provided, only include these system names. None returns all available systems.

Returns:

Wide-format DataFrame (index=TeamID, columns=SystemName). Empty DataFrame if no records satisfy the filters.

normalize_percentile(season: int, day_num: int, system: str) → Series[source]¶

Return per-season percentile rank for system bounded to [0, 1].

Computed as OrdinalRank / n_teams where n_teams is the number of teams with a rank in this season/system snapshot.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.

Returns:

Series indexed by TeamID with percentile values in [0, 1].

normalize_rank_delta(snapshot: DataFrame, team_a: int, team_b: int, system: str) → float[source]¶

Return ordinal rank delta for a matchup between team_a and team_b.

A positive result means team_a is ranked worse (higher rank number = worse) than team_b in this system.

Parameters:

snapshot – Wide-format snapshot DataFrame (index=TeamID, columns=SystemName) from get_snapshot().
team_a – First team’s canonical TeamID.
team_b – Second team’s canonical TeamID.
system – System name column to use.

Returns:

snapshot.loc[team_a, system] - snapshot.loc[team_b, system]

normalize_zscore(season: int, day_num: int, system: str) → Series[source]¶

Return per-season z-score for system.

Computed as (rank - mean_rank) / std_rank across all teams in the snapshot.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.

Returns:

Series indexed by TeamID with z-score values (mean ≈ 0, std ≈ 1).

pre_tournament_snapshot(season: int, systems: list[str] | None = None) → DataFrame[source]¶

Option D: pre-tournament snapshot using ordinals from RankingDayNum ≤ 128.

DayNum 128 corresponds approximately to Selection Sunday in the Kaggle calendar. Only ordinals available before the tournament begins are used.

Parameters:

season – Season year.
systems – If provided, only include these system names.

Returns:

Wide-format DataFrame in the same structure as get_snapshot().

run_coverage_gate() → CoverageGateResult[source]¶

Check whether SAG and WLK cover all seasons 2003–2026.

If either system has gaps the fallback composite (MOR, POM, DOL) is recommended instead of the primary composite (SAG, POM, MOR, WLK).

Returns:: CoverageGateResult with coverage findings and the recommended system list.

class ncaa_eval.transform.SeasonGames(year: int, games: list[Game], has_tournament: bool)[source]¶

Bases: object

Result of a chronological season query.

year¶

Season year (e.g., 2023 for the 2022-23 season).

Type:: int

games¶

All qualifying games sorted ascending by (date, game_id).

Type:: list[ncaa_eval.ingest.schema.Game]

has_tournament¶

False only for known no-tournament years (e.g., 2020 COVID cancellation). Signals to downstream walk-forward splitters that tournament evaluation should be skipped for this season.

Type:: bool

games: list[Game]¶

has_tournament: bool¶

year: int¶

class ncaa_eval.transform.SequentialTransformer(windows: list[int] | None = None, alphas: list[float] | None = None, alpha_fast: float = 0.2, alpha_slow: float = 0.1, stats: tuple[str, ...] | None = None)[source]¶

Bases: object

Orchestrates all sequential feature computation steps.

Applies OT rescaling, time-decay weighting, rolling windows, EWMA, momentum, streak, per-possession normalization, and Four Factors to a per-team game history in chronological order.

All features respect temporal ordering — no feature for game N uses data from games N+1 or later.

transform(team_games: DataFrame, reference_day_num: int | None = None) → DataFrame[source]¶

Compute all sequential features for a team’s game history.

Input must be sorted by day_num ascending to ensure temporal integrity (no future data leakage).

Orchestration order (critical for correctness): 1. OT rescaling (before any aggregation) 2. Time-decay weights 3. Rolling stats (on OT-rescaled stats, with weights) 4. EWMA (on OT-rescaled stats) 5. Momentum 6. Streak (on original won column) 7. Possessions + per-possession stats 8. Four Factors

Parameters:

team_games – Per-team game DataFrame sorted by day_num.
reference_day_num – Reference day for time-decay weights. Defaults to the last game’s day_num.

Returns:

New DataFrame with all feature columns appended to originals. Preserves input row order.

class ncaa_eval.transform.SigmoidCalibrator[source]¶

Bases: object

Parametric Platt scaling for probability calibration.

Uses logistic regression to fit a sigmoid function mapping raw probabilities to calibrated probabilities. More robust than isotonic regression for small samples (<1000).

Example:

cal = SigmoidCalibrator()
cal.fit(y_true_train, y_prob_train)
calibrated = cal.transform(y_prob_test)

fit(y_true: ndarray[tuple[Any, ...], dtype[float64]], y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → None[source]¶

Fit Platt scaling parameters on training fold predictions.

Parameters:

y_true – Binary labels (0 or 1) from the training fold.
y_prob – Model-predicted probabilities from the training fold.

transform(y_prob: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Apply sigmoid calibration to test fold predictions.

Parameters:: y_prob – Model-predicted probabilities to calibrate.
Returns:: Calibrated probabilities in [0, 1].
Raises:: RuntimeError – If fit() has not been called.

class ncaa_eval.transform.StatefulFeatureServer(config: FeatureConfig, data_server: ChronologicalDataServer, *, seed_table: TourneySeedTable | None = None, ordinals_store: MasseyOrdinalsStore | None = None, elo_engine: EloFeatureEngine | None = None)[source]¶

Bases: object

Combines feature building blocks into a single feature matrix.

Supports two consumption modes:

batch — compute all features for an entire season at once (suitable for stateless models like XGBoost).
stateful — iterate game-by-game, accumulating state incrementally (suitable for Elo-style models; placeholder until Story 4.8).

Parameters:

config – Declarative specification of which feature blocks to activate.
data_server – Chronological data serving layer wrapping the Repository.
seed_table – Tournament seed lookup table (optional; needed for seed features).
ordinals_store – Massey ordinals store (optional; needed for ordinal features).
elo_engine – Elo feature engine (optional; needed when elo_enabled=True).

serve_season_features(year: int, mode: Literal['batch', 'stateful'] = 'batch') → DataFrame[source]¶

Build the feature matrix for a full season.

Parameters:

year – Season year (e.g. 2023 for the 2022-23 season).
mode – "batch" or "stateful".

Returns:

One row per game with metadata, feature deltas, and the target label.

class ncaa_eval.transform.TeamNameNormalizer(spellings: dict[str, int])[source]¶

Bases: object

Maps diverse team name spellings to canonical TeamID integers.

Wraps the MTeamSpellings.csv lookup table. Matching is case-insensitive. On a miss, a WARNING is logged with any close prefix matches and None is returned (no exception raised). The lookup is idempotent: calling normalize() twice with the same input returns the same result.

Parameters:: spellings – Pre-lowercased mapping of team_name_spelling → team_id.

classmethod from_csv(path: Path) → TeamNameNormalizer[source]¶

Construct from MTeamSpellings.csv.

Columns required: TeamNameSpelling, TeamID.

Parameters:: path – Path to MTeamSpellings.csv.
Returns:: Initialised TeamNameNormalizer.

normalize(name: str) → int | None[source]¶

Look up name and return its canonical TeamID, or None on miss.

Parameters:: name – Team name string (any case).
Returns:: Canonical TeamID integer, or None if not found.

class ncaa_eval.transform.TourneySeed(season: int, team_id: int, seed_str: str, region: str, seed_num: int, is_play_in: bool)[source]¶

Bases: object

Structured representation of a single NCAA Tournament seed entry.

season¶

Season year (e.g., 2023).

Type:: int

team_id¶

Canonical Kaggle TeamID integer.

Type:: int

seed_str¶

Raw seed string as it appears in MNCAATourneySeeds.csv (e.g., "W01", "X11a").

Type:: str

region¶

Single-character region code: W, X, Y, or Z.

Type:: str

seed_num¶

Seed number 1–16.

Type:: int

is_play_in¶

True when the seed has an 'a' or 'b' suffix, indicating a First Four play-in game.

Type:: bool

is_play_in: bool¶

region: str¶

season: int¶

seed_num: int¶

seed_str: str¶

team_id: int¶

class ncaa_eval.transform.TourneySeedTable(seeds: dict[tuple[int, int], TourneySeed])[source]¶

Bases: object

Lookup table for NCAA Tournament seeds by (season, team_id).

Wraps MNCAATourneySeeds.csv into a dict-backed structure. Each seed is stored as a TourneySeed frozen dataclass.

Parameters:: seeds – Mapping of (season, team_id) → TourneySeed.

all_seeds(season: int | None = None) → list[TourneySeed][source]¶

Return all stored seeds, optionally filtered to a single season.

Parameters:: season – If provided, only seeds for this season are returned.
Returns:: List of TourneySeed objects.

classmethod from_csv(path: Path) → TourneySeedTable[source]¶

Construct from MNCAATourneySeeds.csv.

Columns required: Season, Seed, TeamID.

Uses itertuples (not iterrows) for per-row string parsing — acceptable because the per-row operation (parse_seed) contains branching logic that cannot be vectorized.

Parameters:: path – Path to MNCAATourneySeeds.csv.
Returns:: Initialised TourneySeedTable.

get(season: int, team_id: int) → TourneySeed | None[source]¶

Return the TourneySeed for (season, team_id), or None.

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.

Returns:

Matching TourneySeed, or None if not found.

ncaa_eval.transform.apply_ot_rescaling(team_games: DataFrame, stats: tuple[str, ...] = ('fgm', 'fga', 'fgm3', 'fga3', 'ftm', 'fta', 'oreb', 'dreb', 'ast', 'to', 'stl', 'blk', 'pf', 'score', 'opp_score')) → DataFrame[source]¶

Rescale all counting stats to 40-minute equivalent for OT games.

Applies: stat_adj = stat × 40 / (40 + 5 × num_ot) Regulation games (num_ot=0) are unchanged (multiplier = 1.0).

Returns a copy; does not modify the input DataFrame in-place.

Parameters:

team_games – Per-team game DataFrame containing a num_ot column.
stats – Tuple of stat column names to rescale.

Returns:

Copy of team_games with rescaled stat columns.

ncaa_eval.transform.build_season_graph(games_df: DataFrame, margin_cap: int = 25, reference_day_num: int | None = None, recency_window_days: int = 20, recency_multiplier: float = 1.5) → DiGraph[source]¶

Build a directed graph from season game results.

Edges: loser → winner (loser ‘votes for’ winner quality, PageRank metaphor). Edge weight: min(margin, margin_cap) × optional_recency_multiplier.

Parallel edges (same team pair playing multiple times) are aggregated by summing their weights before passing to nx.from_pandas_edgelist().

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
margin_cap – Maximum margin-of-victory to use as edge weight (prevents blowout distortion).
reference_day_num – Day number used to compute recency window. Defaults to max day_num in games_df if None.
recency_window_days – Games within this many days of reference_day_num get recency boost.
recency_multiplier – Weight multiplier for recent games within recency_window_days.

Returns:

nx.DiGraph with edges loser→winner and “weight” attribute on each edge. Returns empty DiGraph if games_df is empty.

ncaa_eval.transform.compute_betweenness_centrality(G: DiGraph) → dict[int, float][source]¶

Compute betweenness centrality for each team in the graph.

Captures structural “bridge” position — distinct signal from PageRank (strength) and SoS (schedule quality).

Parameters:: G – Directed graph.
Returns:: dict mapping team_id → betweenness centrality score. Empty dict if graph has no nodes.

ncaa_eval.transform.compute_clustering_coefficient(G: DiGraph) → dict[int, float][source]¶

Compute undirected clustering coefficient for each team.

Schedule diversity metric: low clustering = broad cross-conference scheduling. Uses undirected conversion so that mutual matchups count once (natural interpretation).

Parameters:: G – Directed graph (converted to undirected internally).
Returns:: dict mapping team_id → clustering coefficient. Empty dict if graph has no nodes.

ncaa_eval.transform.compute_colley_ratings(games_df: DataFrame) → DataFrame[source]¶

Compute Colley Matrix win/loss-only ratings.

Parameters:: games_df – DataFrame with columns w_team_id, l_team_id (regular-season games only; scores not used).
Returns:: DataFrame with columns ["team_id", "colley_rating"].

ncaa_eval.transform.compute_ewma_stats(team_games: DataFrame, alphas: list[float], stats: tuple[str, ...]) → DataFrame[source]¶

Compute EWMA features for all specified alphas and stats.

Uses adjust=False for standard exponential smoothing: value_t = α × obs_t + (1−α) × value_{t−1}

Parameters:

team_games – Per-team game DataFrame (sorted by day_num ascending).
alphas – List of smoothing factors (e.g., [0.15, 0.20]).
stats – Tuple of stat column names.

Returns:

DataFrame with columns ewma_{alpha_str}_{stat} where alpha_str replaces the decimal point with ‘p’ (e.g., ewma_0p15_score).

ncaa_eval.transform.compute_four_factors(team_games: DataFrame, possessions: Series) → DataFrame[source]¶

Compute Dean Oliver’s Four Factors efficiency ratios.

efg_pct: Effective field goal % = (FGM + 0.5 × FGM3) / FGA
orb_pct: Offensive rebound % = OR / (OR + opp_DR)
ftr: Free throw rate = FTA / FGA
to_pct: Turnover % = TO / possessions

All denominators are guarded against zero (returns NaN when zero).

Parameters:

team_games – Per-team game DataFrame with box-score columns.
possessions – Series of possession counts (used for TO%).

Returns:

DataFrame with columns ["efg_pct", "orb_pct", "ftr", "to_pct"].

ncaa_eval.transform.compute_game_weights(day_nums: Series, reference_day_num: int | None = None) → Series[source]¶

BartTorvik time-decay weights: 1% per day after 40 days old; floor 60%.

Formula: weight = max(0.6, 1 − 0.01 × max(0, days_ago − 40))

Parameters:

day_nums – Series of game day numbers (ascending order).
reference_day_num – Reference point for days_ago. Defaults to max(day_nums).

Returns:

Series of weights in [0.6, 1.0] for each game.

ncaa_eval.transform.compute_hits(G: DiGraph, max_iter: int = 100) → tuple[dict[int, float], dict[int, float]][source]¶

Compute HITS hub and authority scores for each team.

Authority ≈ PageRank (r≈0.908 correlation). Hub = “quality schedule despite losses” — distinct signal. Both are returned from a single nx.hits() call.

Parameters:

G – Directed graph.
max_iter – Maximum iterations for HITS power iteration.

Returns:

Tuple of (hub_dict, authority_dict), each mapping team_id → score. Returns uniform 0.0 scores for all nodes if graph has no edges. Returns uniform 1/n scores on convergence failure (with warning logged).

ncaa_eval.transform.compute_momentum(team_games: DataFrame, alpha_fast: float, alpha_slow: float, stats: tuple[str, ...]) → DataFrame[source]¶

Compute ewma_fast − ewma_slow momentum for each stat.

Positive momentum means recent performance is above the longer-term trend (improving form into tournament).

Parameters:

team_games – Per-team game DataFrame (sorted by day_num ascending).
alpha_fast – Fast EWMA smoothing factor (larger → more reactive).
alpha_slow – Slow EWMA smoothing factor (smaller → smoother baseline).
stats – Tuple of stat column names.

Returns:

DataFrame with columns momentum_{stat}.

ncaa_eval.transform.compute_pagerank(G: DiGraph, alpha: float = 0.85, nstart: dict[int, float] | None = None) → dict[int, float][source]¶

Compute PageRank for each team in the graph.

Captures transitive win-chain strength (2 hops vs. SoS 1 hop). Peer-reviewed NCAA validation: 71.6% vs. 64.2% naive win-ratio (Matthews et al. 2021).

Parameters:

G – Directed graph with loser→winner edges and “weight” attribute.
alpha – Damping factor (teleportation probability = 1 - alpha).
nstart – Optional warm-start dictionary (team_id → initial probability). Initialize with previous solution for 2–5 iterations instead of 30–50.

Returns:

dict mapping team_id → PageRank score. Empty dict if graph has no nodes.

ncaa_eval.transform.compute_per_possession_stats(team_games: DataFrame, stats: tuple[str, ...], possessions: Series) → DataFrame[source]¶

Normalize counting stats by possessions (per-100 possessions).

Parameters:

team_games – Per-team game DataFrame.
stats – Tuple of stat column names to normalize.
possessions – Series of possession counts (NaN for guard rows).

Returns:

DataFrame with columns {stat}_per100.

ncaa_eval.transform.compute_possessions(team_games: DataFrame) → Series[source]¶

Compute possession count: FGA − OR + TO + 0.44 × FTA.

Zero or negative possession counts (rare in short fixtures) are replaced with NaN to prevent division-by-zero downstream.

Parameters:: team_games – Per-team game DataFrame with box-score columns.
Returns:: Series named "possessions".

ncaa_eval.transform.compute_ridge_ratings(games_df: DataFrame, *, lam: float = 20.0, margin_cap: int = 25) → DataFrame[source]¶

Compute Ridge regression ratings.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score (regular-season games only).
lam – Ridge regularization parameter λ (default 20.0).
margin_cap – Maximum point margin cap per game (default 25).

Returns:

DataFrame with columns ["team_id", "ridge_rating"].

ncaa_eval.transform.compute_rolling_stats(team_games: DataFrame, windows: list[int], stats: tuple[str, ...], weights: Series | None = None) → DataFrame[source]¶

Compute rolling mean features for all specified windows and stats.

No future data leakage: rolling window at position i only uses rows at positions ≤ i (pandas rolling default closed=’right’).

Parameters:

team_games – Per-team game DataFrame (sorted by day_num ascending).
windows – List of window sizes (e.g., [5, 10, 20]).
stats – Tuple of stat column names.
weights – Optional per-game weights for weighted rolling mean.

Returns:

DataFrame with columns rolling_{w}_{stat} and rolling_full_{stat} (expanding mean).

ncaa_eval.transform.compute_srs_ratings(games_df: DataFrame, *, margin_cap: int = 25, max_iter: int = 10000) → DataFrame[source]¶

Compute SRS ratings using default solver config.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score (regular-season games only).
margin_cap – Maximum point margin cap per game (default 25).
max_iter – Maximum SRS iterations (default 10,000).

Returns:

DataFrame with columns ["team_id", "srs_rating"].

ncaa_eval.transform.compute_streak(won: Series) → Series[source]¶

Compute signed win/loss streak.

Returns +N for a winning streak of N games, −N for a losing streak. Vectorized using cumsum-based grouping; no iterrows.

Parameters:: won – Boolean Series of game outcomes (True = win), sorted by day_num.
Returns:: Integer Series named "streak".

ncaa_eval.transform.parse_seed(season: int, team_id: int, seed_str: str) → TourneySeed[source]¶

Parse a raw tournament seed string into a structured TourneySeed.

Seed strings from MNCAATourneySeeds.csv follow the pattern [WXYZ][0-9]{2}[ab]?:

"W01" → region=”W”, seed_num=1, is_play_in=False
"X16a" → region=”X”, seed_num=16, is_play_in=True
"Y11b" → region=”Y”, seed_num=11, is_play_in=True

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.
seed_str – Raw seed string (e.g., "W01", "X11a").

Returns:

Fully parsed TourneySeed.

Raises:

ValueError – If seed_str is shorter than 3 characters.

ncaa_eval.transform.rescale_overtime(score: int, num_ot: int) → float[source]¶

Rescale a game score to a 40-minute equivalent for OT normalization.

Overtime games inflate per-game scoring statistics because they involve more than 40 minutes of play. The standard correction (Edwards 2021) normalises every game to a 40-minute basis:

adjusted = raw_score × 40 / (40 + 5 × num_ot)

Parameters:

score – Raw final score (not adjusted).
num_ot – Number of overtime periods played (0 for regulation).

Returns:

Score normalised to a 40-minute equivalent.

Examples

>>> rescale_overtime(75, 0)   # Regulation: no change
75.0
>>> rescale_overtime(80, 1)   # 1 OT: 80 × 40 / 45 ≈ 71.11
71.11111111111111