ncaa_eval.transform package¶
Submodules¶
- ncaa_eval.transform.calibration module
- ncaa_eval.transform.constants module
- ncaa_eval.transform.elo module
EloConfigEloFeatureEngineEloFeatureEngine.apply_season_mean_reversion()EloFeatureEngine.expected_score()EloFeatureEngine.get_all_ratings()EloFeatureEngine.get_game_counts()EloFeatureEngine.get_rating()EloFeatureEngine.has_ratings()EloFeatureEngine.predict_matchup()EloFeatureEngine.process_season()EloFeatureEngine.reset_game_counts()EloFeatureEngine.set_game_counts()EloFeatureEngine.set_ratings()EloFeatureEngine.start_new_season()EloFeatureEngine.update_game()
- ncaa_eval.transform.feature_serving module
FeatureBlockFeatureConfigFeatureConfig.sequential_windowsFeatureConfig.ewma_alphasFeatureConfig.graph_features_enabledFeatureConfig.batch_rating_typesFeatureConfig.ordinal_systemsFeatureConfig.ordinal_compositeFeatureConfig.matchup_deltasFeatureConfig.gender_scopeFeatureConfig.dataset_scopeFeatureConfig.active_blocks()FeatureConfig.batch_rating_typesFeatureConfig.dataset_scopeFeatureConfig.elo_configFeatureConfig.elo_enabledFeatureConfig.ewma_alphasFeatureConfig.gender_scopeFeatureConfig.graph_features_enabledFeatureConfig.matchup_deltasFeatureConfig.ordinal_compositeFeatureConfig.ordinal_systemsFeatureConfig.sequential_windows
StatefulFeatureServer
- ncaa_eval.transform.graph module
- ncaa_eval.transform.normalization module
ConferenceLookupCoverageGateResultMasseyOrdinalsStoreMasseyOrdinalsStore.composite_pca()MasseyOrdinalsStore.composite_simple_average()MasseyOrdinalsStore.composite_weighted()MasseyOrdinalsStore.from_csv()MasseyOrdinalsStore.get_snapshot()MasseyOrdinalsStore.normalize_percentile()MasseyOrdinalsStore.normalize_rank_delta()MasseyOrdinalsStore.normalize_zscore()MasseyOrdinalsStore.pre_tournament_snapshot()MasseyOrdinalsStore.run_coverage_gate()
TeamNameNormalizerTourneySeedTourneySeedTableparse_seed()
- ncaa_eval.transform.opponent module
- ncaa_eval.transform.sequential module
- ncaa_eval.transform.serving module
Module contents¶
Feature engineering and data transformation module.
- class ncaa_eval.transform.BatchRatingSolver(*, margin_cap: int = 25, ridge_lambda: float = 20.0, srs_max_iter: int = 10000)[source]¶
Bases:
objectBatch rating solver that produces full-season opponent-adjusted ratings.
All solvers accept a pre-loaded DataFrame of compact regular-season games (caller must filter to
is_tournament == Falsebefore passing).- Parameters:
margin_cap – Maximum point margin applied per game (default 25).
ridge_lambda – Regularization strength for Ridge solver (default 20.0).
srs_max_iter – Maximum iterations for SRS fixed-point convergence (default 10,000).
- compute_colley(games_df: DataFrame) DataFrame[source]¶
Compute Colley Matrix ratings (win/loss only, no margin).
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id(regular-season games only; scores not used).- Returns:
DataFrame with columns
["team_id", "colley_rating"](bounded [0, 1]).
- compute_ridge(games_df: DataFrame) DataFrame[source]¶
Compute Ridge regression ratings (regularized SRS).
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id,w_score,l_score(regular-season games only).- Returns:
DataFrame with columns
["team_id", "ridge_rating"].
- compute_srs(games_df: DataFrame) DataFrame[source]¶
Compute SRS (Simple Rating System) ratings via fixed-point iteration.
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id,w_score,l_score(regular-season games only).- Returns:
DataFrame with columns
["team_id", "srs_rating"](zero-centered).
- class ncaa_eval.transform.Calibrator(*args, **kwargs)[source]¶
Bases:
ProtocolProtocol for probability calibration transforms.
Both
IsotonicCalibratorandSigmoidCalibratorstructurally satisfy this protocol.
- class ncaa_eval.transform.ChronologicalDataServer(repository: Repository)[source]¶
Bases:
objectServes game data in strict chronological order for walk-forward modeling.
Wraps a
Repositoryand enforces temporal boundaries so that callers cannot accidentally access future game data during walk-forward validation.- Parameters:
repository – The data store from which games are retrieved.
Example:
from ncaa_eval.ingest.repository import ParquetRepository from ncaa_eval.transform.serving import ChronologicalDataServer repo = ParquetRepository(Path("data/")) server = ChronologicalDataServer(repo) season = server.get_chronological_season(2023) for daily_batch in server.iter_games_by_date(2023): process(daily_batch)
- get_chronological_season(year: int, cutoff_date: date | None = None) SeasonGames[source]¶
Return all games for year sorted ascending by (date, game_id).
Applies optional temporal cutoff so callers cannot retrieve games that had not yet been played as of a given date. This is the primary leakage-prevention mechanism for walk-forward model training.
- Parameters:
year – Season year (e.g., 2023 for the 2022-23 season).
cutoff_date – If provided, only games on or before this date are returned. Must not be in the future.
- Returns:
SeasonGameswith games sorted by(date, game_id)and thehas_tournamentflag reflecting known tournament cancellations.- Raises:
ValueError – If
cutoff_dateis strictly after today’s date.
- iter_games_by_date(year: int, cutoff_date: date | None = None) Iterator[list[Game]][source]¶
Yield batches of games grouped by calendar date, in chronological order.
Each yielded list contains all games played on a single calendar date. Dates with no games are skipped. Applies the same
cutoff_datesemantics asget_chronological_season().- Parameters:
year – Season year.
cutoff_date – Optional temporal cutoff (must not be in the future).
- Yields:
Non-empty
list[Game]for each calendar date, in ascending order.
- class ncaa_eval.transform.ConferenceLookup(lookup: dict[tuple[int, int], str])[source]¶
Bases:
objectLookup table for team conference membership by
(season, team_id).Wraps
MTeamConferences.csvinto a dict-backed structure.- Parameters:
lookup – Mapping of
(season, team_id) → conf_abbrev.
- classmethod from_csv(path: Path) ConferenceLookup[source]¶
Construct from
MTeamConferences.csv.Columns required:
Season,TeamID,ConfAbbrev.- Parameters:
path – Path to
MTeamConferences.csv.- Returns:
Initialised
ConferenceLookup.
- class ncaa_eval.transform.CoverageGateResult(primary_systems: tuple[str, ...], fallback_used: bool, fallback_reason: str, recommended_systems: tuple[str, ...])[source]¶
Bases:
objectResult of the Massey Ordinals coverage gate check.
- primary_systems¶
The four primary composite systems (SAG, POM, MOR, WLK).
- Type:
tuple[str, …]
- fallback_used¶
True when SAG or WLK are missing for one or more seasons 2003–2026 and the fallback composite is recommended.
- Type:
bool
- fallback_reason¶
Human-readable description of why the fallback was triggered (empty string when
fallback_used=False).- Type:
str
- recommended_systems¶
The system names the caller should use for composite computation — either the primary composite or the confirmed-full-coverage fallback (MOR, POM, DOL).
- Type:
tuple[str, …]
- fallback_reason: str¶
- fallback_used: bool¶
- primary_systems: tuple[str, ...]¶
- recommended_systems: tuple[str, ...]¶
- class ncaa_eval.transform.DetailedResultsLoader(df: DataFrame)[source]¶
Bases:
objectLoads detailed box-score results and provides per-team game views.
Reads
MRegularSeasonDetailedResults.csvandMNCAATourneyDetailedResults.csvinto a combined long-format DataFrame with one row per (team, game).Box-score stats are only available from the 2003 season onwards. Pre-2003 seasons return empty DataFrames from
get_team_season().- classmethod from_csvs(regular_path: Path, tourney_path: Path) DetailedResultsLoader[source]¶
Construct a loader from the two Kaggle detailed-results CSV paths.
- Parameters:
regular_path – Path to
MRegularSeasonDetailedResults.csv.tourney_path – Path to
MNCAATourneyDetailedResults.csv.
- Returns:
DetailedResultsLoaderinstance with combined data.
- get_season_long_format(season: int) DataFrame[source]¶
Return all games for a season in long format.
- Parameters:
season – Season year (e.g., 2023).
- Returns:
DataFrame sorted by
(day_num, team_id), reset index.
- get_team_season(team_id: int, season: int) DataFrame[source]¶
Return all games for one team in one season, sorted by day_num.
- Parameters:
team_id – Canonical Kaggle TeamID integer.
season – Season year (e.g., 2023).
- Returns:
DataFrame sorted by
day_numascending, reset index. Returns empty DataFrame if team or season not found.
- class ncaa_eval.transform.EloConfig(initial_rating: float = 1500.0, k_early: float = 56.0, k_regular: float = 38.0, k_tournament: float = 47.5, early_game_threshold: int = 20, margin_exponent: float = 0.85, max_margin: int = 25, home_advantage_elo: float = 3.5, mean_reversion_fraction: float = 0.25)[source]¶
Bases:
objectFrozen configuration for the Elo feature engine.
All K-factor, margin scaling, home-court, and mean-reversion parameters are configurable with sensible defaults matching the Silver/SBCB model.
- early_game_threshold: int = 20¶
- home_advantage_elo: float = 3.5¶
- initial_rating: float = 1500.0¶
- k_early: float = 56.0¶
- k_regular: float = 38.0¶
- k_tournament: float = 47.5¶
- margin_exponent: float = 0.85¶
- max_margin: int = 25¶
- mean_reversion_fraction: float = 0.25¶
- class ncaa_eval.transform.EloFeatureEngine(config: EloConfig, conference_lookup: ConferenceLookup | None = None)[source]¶
Bases:
objectGame-by-game Elo rating engine.
- Parameters:
config – Frozen Elo configuration.
conference_lookup – Optional conference lookup for season mean-reversion. When
None, mean-reversion falls back to global mean.
- apply_season_mean_reversion(season: int) None[source]¶
Regress each team toward its conference mean (or global mean).
Groups all rated teams by conference via
ConferenceLookup, computes each conference’s mean rating, then shifts every team’s rating a fractionmean_reversion_fractionof the way toward its conference mean. Teams with no conference entry fall back to the global mean; when noConferenceLookupis provided all teams use the global mean. Is a no-op when no prior ratings exist.
- static expected_score(rating_a: float, rating_b: float) float[source]¶
Logistic expected score for team A against team B.
expected = 1 / (1 + 10^((r_b − r_a) / 400))
- get_rating(team_id: int) float[source]¶
Return current Elo rating for team_id (initial_rating if unseen).
- predict_matchup(team_a_id: int, team_b_id: int) float[source]¶
Return P(team_a wins) using the Elo expected-score formula.
- process_season(games: list[Game], season: int) pd.DataFrame[source]¶
Process all games for a season, returning before-ratings per game.
Calls
start_new_season(season)if prior ratings exist (i.e., this is not the very first season).- Parameters:
games – Games sorted in chronological order.
season – Season year.
- Returns:
DataFrame with columns
[game_id, elo_w_before, elo_l_before].
- start_new_season(season: int) None[source]¶
Orchestrate season transition: mean-reversion then reset counts.
- update_game(w_team_id: int, l_team_id: int, w_score: int, l_score: int, loc: str, is_tournament: bool, *, num_ot: int = 0) tuple[float, float][source]¶
Process one game and update ratings.
Snapshots before-ratings for feature use, applies home-court effective-rating adjustment to expected-score computation, computes the margin-of-victory multiplier and variable K-factor, then mutates internal rating state for both teams.
- Parameters:
w_team_id – Winner team ID.
l_team_id – Loser team ID.
w_score – Winner final score (raw).
l_score – Loser final score (raw).
loc –
"H"(winner home),"A"(winner away),"N"(neutral).is_tournament – Whether this is a tournament game.
num_ot – Number of overtime periods (used for margin rescaling).
- Returns:
Tuple of
(elo_w_before, elo_l_before)— the winner’s and loser’s ratings before this game’s update, suitable for use as walk-forward feature values.
- class ncaa_eval.transform.FeatureBlock(*values)[source]¶
Bases:
EnumIndividual feature building blocks that can be activated.
- BATCH_RATING = 'batch_rating'¶
- ELO = 'elo'¶
- GRAPH = 'graph'¶
- ORDINAL = 'ordinal'¶
- SEED = 'seed'¶
- SEQUENTIAL = 'sequential'¶
- class ncaa_eval.transform.FeatureConfig(sequential_windows: tuple[int, ...] = (5, 10, 20), ewma_alphas: tuple[float, ...] = (0.15, 0.2), graph_features_enabled: bool = True, batch_rating_types: tuple[BatchRatingType, ...] = ('srs', 'ridge', 'colley'), ordinal_systems: tuple[str, ...] | None = None, ordinal_composite: OrdinalCompositeMethod | None = 'simple_average', matchup_deltas: bool = True, gender_scope: GenderScope = 'M', dataset_scope: DatasetScope = 'kaggle', elo_enabled: bool = False, elo_config: EloConfig | None = None)[source]¶
Bases:
objectDeclarative specification of which feature blocks and parameters to use.
- sequential_windows¶
Rolling window sizes for sequential features (e.g.,
(5, 10, 20)).- Type:
tuple[int, …]
- ewma_alphas¶
EWMA smoothing factors for sequential features (e.g.,
(0.15, 0.20)).- Type:
tuple[float, …]
- graph_features_enabled¶
Whether to compute graph centrality features (PageRank, etc.).
- Type:
bool
- batch_rating_types¶
Which batch rating systems to include (
"srs","ridge","colley").- Type:
tuple[BatchRatingType, …]
- ordinal_systems¶
Massey ordinal systems to use;
Nonemeans use coverage-gate defaults.- Type:
tuple[str, …] | None
- ordinal_composite¶
Composite method:
"simple_average","weighted","pca", orNoneto disable.- Type:
OrdinalCompositeMethod | None
- matchup_deltas¶
Whether to compute team_A − team_B deltas for matchup features.
- Type:
bool
- gender_scope¶
"M"for men’s,"W"for women’s.- Type:
GenderScope
- dataset_scope¶
"kaggle"for Kaggle-only games,"all"for Kaggle + ESPN enrichment.- Type:
DatasetScope
- active_blocks() frozenset[FeatureBlock][source]¶
Return the set of feature blocks that are currently enabled.
Checks each configuration flag (sequential windows, graph enabled, batch rating types, ordinal composite, Elo enabled) and adds the corresponding FeatureBlock enum value to a set, with SEED always included.
- batch_rating_types: tuple[BatchRatingType, ...] = ('srs', 'ridge', 'colley')¶
- dataset_scope: DatasetScope = 'kaggle'¶
- elo_enabled: bool = False¶
- ewma_alphas: tuple[float, ...] = (0.15, 0.2)¶
- gender_scope: GenderScope = 'M'¶
- graph_features_enabled: bool = True¶
- matchup_deltas: bool = True¶
- ordinal_composite: OrdinalCompositeMethod | None = 'simple_average'¶
- ordinal_systems: tuple[str, ...] | None = None¶
- sequential_windows: tuple[int, ...] = (5, 10, 20)¶
- class ncaa_eval.transform.GraphTransformer(margin_cap: int = 25, recency_window_days: int = 20, recency_multiplier: float = 1.5)[source]¶
Bases:
objectTransform game DataFrames into graph-based centrality features.
Provides both batch (build + compute in one call) and incremental (add_game_to_graph) update strategies for walk-forward backtesting efficiency.
- Typical walk-forward usage (Story 4.7):
transformer = GraphTransformer() graph = nx.DiGraph() prev_pagerank: dict[int, float] | None = None for game in chronological_games:
transformer.add_game_to_graph(graph, …) features_df = transformer.compute_features(graph, pagerank_init=prev_pagerank) prev_pagerank = dict(zip(features_df[“team_id”], features_df[“pagerank”]))
- add_game_to_graph(graph: DiGraph, w_team_id: int, l_team_id: int, margin: int, day_num: int, reference_day_num: int) None[source]¶
Add a single game to an existing graph in-place.
Supports incremental walk-forward updates without rebuilding the full graph. Edge direction: l_team_id → w_team_id (loser votes for winner).
- Parameters:
graph – Existing nx.DiGraph to update in-place.
w_team_id – Winner team ID.
l_team_id – Loser team ID.
margin – Margin of victory (absolute score difference).
day_num – Day number of the game.
reference_day_num – Reference day for recency window evaluation.
- build_graph(games_df: DataFrame, reference_day_num: int | None = None) DiGraph[source]¶
Build a season graph from a games DataFrame.
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
- Returns:
nx.DiGraph with loser→winner edges and “weight” attribute.
- compute_features(graph: DiGraph, pagerank_init: dict[int, float] | None = None) DataFrame[source]¶
Compute all centrality features for every team node in the graph.
- Parameters:
graph – nx.DiGraph with loser→winner edges.
pagerank_init – Optional warm-start dict (team_id → probability) from a previous compute_pagerank() call. Reduces PageRank iterations from ~30–50 to ~2–5.
- Returns:
pd.DataFrame with columns [“team_id”, “pagerank”, “betweenness_centrality”, “hits_hub”, “hits_authority”, “clustering_coefficient”], one row per team node. Returns empty DataFrame with correct columns if graph has no nodes.
- transform(games_df: DataFrame, reference_day_num: int | None = None, pagerank_init: dict[int, float] | None = None) DataFrame[source]¶
Convenience method: build graph then compute all centrality features.
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
pagerank_init – Optional warm-start dict for PageRank.
- Returns:
pd.DataFrame with centrality features, one row per team. Returns empty DataFrame with correct columns if games_df is empty.
- class ncaa_eval.transform.IsotonicCalibrator[source]¶
Bases:
objectNon-parametric monotonic probability calibration.
Wraps
sklearn.isotonic.IsotonicRegressionwithy_min=0.0,y_max=1.0, andout_of_bounds="clip"for probability bounds.Example:
cal = IsotonicCalibrator() cal.fit(y_true_train, y_prob_train) calibrated = cal.transform(y_prob_test)
- fit(y_true: ndarray[tuple[Any, ...], dtype[float64]], y_prob: ndarray[tuple[Any, ...], dtype[float64]]) None[source]¶
Fit the isotonic regression on training fold predictions.
- Parameters:
y_true – Binary labels (0 or 1) from the training fold.
y_prob – Model-predicted probabilities from the training fold.
- transform(y_prob: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Apply calibration to test fold predictions.
- Parameters:
y_prob – Model-predicted probabilities to calibrate.
- Returns:
Calibrated probabilities in [0, 1].
- Raises:
RuntimeError – If
fit()has not been called.
- class ncaa_eval.transform.MasseyOrdinalsStore(df: DataFrame)[source]¶
Bases:
objectDataFrame-backed store for Massey Ordinal ranking systems.
Ingests
MMasseyOrdinals.csvand provides temporal filtering, coverage gate validation, composite computation (Options A–D), and per-system normalization.- Parameters:
df – Raw DataFrame with columns
[Season, RankingDayNum, SystemName, TeamID, OrdinalRank].
- composite_pca(season: int, day_num: int, n_components: int | None = None, min_variance: float = 0.9) DataFrame[source]¶
Option C: PCA reduction of all available systems.
When
n_components=None, automatically selects the minimum number of components needed to capturemin_varianceof total variance.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
n_components – Number of principal components to retain.
Nonetriggers automatic selection based onmin_variance.min_variance – Minimum cumulative explained variance required when
n_components=None(default 0.90 = 90%).
- Returns:
DataFrame with columns
PC1, PC2, ...indexed by TeamID. Rows with any NaN system value are dropped before PCA.
- composite_simple_average(season: int, day_num: int, systems: list[str]) Series[source]¶
Option A: simple average of ordinal ranks across systems per team.
- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
systems – List of system names to average.
- Returns:
Series indexed by TeamID with mean ordinal rank per team.
- composite_weighted(season: int, day_num: int, weights: dict[str, float]) Series[source]¶
Option B: weighted average of ordinal ranks using caller-supplied weights.
Weights are normalized to sum to 1 before computation.
- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
weights – Mapping of system name → weight (any positive floats). Must not be empty.
- Returns:
Series indexed by TeamID with weighted ordinal rank per team.
- Raises:
ValueError – If
weightsis empty.
- classmethod from_csv(path: Path) MasseyOrdinalsStore[source]¶
Construct from
MMasseyOrdinals.csv.Columns required:
Season,RankingDayNum,SystemName,TeamID,OrdinalRank.- Parameters:
path – Path to
MMasseyOrdinals.csv.- Returns:
Initialised
MasseyOrdinalsStore.
- get_snapshot(season: int, day_num: int, systems: list[str] | None = None) DataFrame[source]¶
Return wide-format ordinal ranks as of day_num for season.
For each
(SystemName, TeamID)pair, uses the latestRankingDayNumthat is≤ day_num. Returns a DataFrame withTeamIDas index and one column per ranking system.- Parameters:
season – Season year.
day_num – Inclusive upper bound on
RankingDayNum.systems – If provided, only include these system names.
Nonereturns all available systems.
- Returns:
Wide-format DataFrame (index=TeamID, columns=SystemName). Empty DataFrame if no records satisfy the filters.
- normalize_percentile(season: int, day_num: int, system: str) Series[source]¶
Return per-season percentile rank for system bounded to
[0, 1].Computed as
OrdinalRank / n_teamswheren_teamsis the number of teams with a rank in this season/system snapshot.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.
- Returns:
Series indexed by TeamID with percentile values in
[0, 1].
- normalize_rank_delta(snapshot: DataFrame, team_a: int, team_b: int, system: str) float[source]¶
Return ordinal rank delta for a matchup between team_a and team_b.
A positive result means team_a is ranked worse (higher rank number = worse) than team_b in this system.
- Parameters:
snapshot – Wide-format snapshot DataFrame (index=TeamID, columns=SystemName) from
get_snapshot().team_a – First team’s canonical TeamID.
team_b – Second team’s canonical TeamID.
system – System name column to use.
- Returns:
snapshot.loc[team_a, system] - snapshot.loc[team_b, system]
- normalize_zscore(season: int, day_num: int, system: str) Series[source]¶
Return per-season z-score for system.
Computed as
(rank - mean_rank) / std_rankacross all teams in the snapshot.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.
- Returns:
Series indexed by TeamID with z-score values (mean ≈ 0, std ≈ 1).
- pre_tournament_snapshot(season: int, systems: list[str] | None = None) DataFrame[source]¶
Option D: pre-tournament snapshot using ordinals from
RankingDayNum ≤ 128.DayNum 128 corresponds approximately to Selection Sunday in the Kaggle calendar. Only ordinals available before the tournament begins are used.
- Parameters:
season – Season year.
systems – If provided, only include these system names.
- Returns:
Wide-format DataFrame in the same structure as
get_snapshot().
- run_coverage_gate() CoverageGateResult[source]¶
Check whether SAG and WLK cover all seasons 2003–2026.
If either system has gaps the fallback composite (MOR, POM, DOL) is recommended instead of the primary composite (SAG, POM, MOR, WLK).
- Returns:
CoverageGateResultwith coverage findings and the recommended system list.
- class ncaa_eval.transform.SeasonGames(year: int, games: list[Game], has_tournament: bool)[source]¶
Bases:
objectResult of a chronological season query.
- year¶
Season year (e.g., 2023 for the 2022-23 season).
- Type:
int
- games¶
All qualifying games sorted ascending by (date, game_id).
- Type:
- has_tournament¶
False only for known no-tournament years (e.g., 2020 COVID cancellation). Signals to downstream walk-forward splitters that tournament evaluation should be skipped for this season.
- Type:
bool
- has_tournament: bool¶
- year: int¶
- class ncaa_eval.transform.SequentialTransformer(windows: list[int] | None = None, alphas: list[float] | None = None, alpha_fast: float = 0.2, alpha_slow: float = 0.1, stats: tuple[str, ...] | None = None)[source]¶
Bases:
objectOrchestrates all sequential feature computation steps.
Applies OT rescaling, time-decay weighting, rolling windows, EWMA, momentum, streak, per-possession normalization, and Four Factors to a per-team game history in chronological order.
All features respect temporal ordering — no feature for game N uses data from games N+1 or later.
- transform(team_games: DataFrame, reference_day_num: int | None = None) DataFrame[source]¶
Compute all sequential features for a team’s game history.
Input must be sorted by
day_numascending to ensure temporal integrity (no future data leakage).Orchestration order (critical for correctness): 1. OT rescaling (before any aggregation) 2. Time-decay weights 3. Rolling stats (on OT-rescaled stats, with weights) 4. EWMA (on OT-rescaled stats) 5. Momentum 6. Streak (on original won column) 7. Possessions + per-possession stats 8. Four Factors
- Parameters:
team_games – Per-team game DataFrame sorted by
day_num.reference_day_num – Reference day for time-decay weights. Defaults to the last game’s
day_num.
- Returns:
New DataFrame with all feature columns appended to originals. Preserves input row order.
- class ncaa_eval.transform.SigmoidCalibrator[source]¶
Bases:
objectParametric Platt scaling for probability calibration.
Uses logistic regression to fit a sigmoid function mapping raw probabilities to calibrated probabilities. More robust than isotonic regression for small samples (<1000).
Example:
cal = SigmoidCalibrator() cal.fit(y_true_train, y_prob_train) calibrated = cal.transform(y_prob_test)
- fit(y_true: ndarray[tuple[Any, ...], dtype[float64]], y_prob: ndarray[tuple[Any, ...], dtype[float64]]) None[source]¶
Fit Platt scaling parameters on training fold predictions.
- Parameters:
y_true – Binary labels (0 or 1) from the training fold.
y_prob – Model-predicted probabilities from the training fold.
- transform(y_prob: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Apply sigmoid calibration to test fold predictions.
- Parameters:
y_prob – Model-predicted probabilities to calibrate.
- Returns:
Calibrated probabilities in [0, 1].
- Raises:
RuntimeError – If
fit()has not been called.
- class ncaa_eval.transform.StatefulFeatureServer(config: FeatureConfig, data_server: ChronologicalDataServer, *, seed_table: TourneySeedTable | None = None, ordinals_store: MasseyOrdinalsStore | None = None, elo_engine: EloFeatureEngine | None = None)[source]¶
Bases:
objectCombines feature building blocks into a single feature matrix.
Supports two consumption modes:
batch — compute all features for an entire season at once (suitable for stateless models like XGBoost).
stateful — iterate game-by-game, accumulating state incrementally (suitable for Elo-style models; placeholder until Story 4.8).
- Parameters:
config – Declarative specification of which feature blocks to activate.
data_server – Chronological data serving layer wrapping the Repository.
seed_table – Tournament seed lookup table (optional; needed for seed features).
ordinals_store – Massey ordinals store (optional; needed for ordinal features).
elo_engine – Elo feature engine (optional; needed when
elo_enabled=True).
- serve_season_features(year: int, mode: Literal['batch', 'stateful'] = 'batch') DataFrame[source]¶
Build the feature matrix for a full season.
- Parameters:
year – Season year (e.g. 2023 for the 2022-23 season).
mode –
"batch"or"stateful".
- Returns:
One row per game with metadata, feature deltas, and the target label.
- class ncaa_eval.transform.TeamNameNormalizer(spellings: dict[str, int])[source]¶
Bases:
objectMaps diverse team name spellings to canonical
TeamIDintegers.Wraps the
MTeamSpellings.csvlookup table. Matching is case-insensitive. On a miss, a WARNING is logged with any close prefix matches andNoneis returned (no exception raised). The lookup is idempotent: callingnormalize()twice with the same input returns the same result.- Parameters:
spellings – Pre-lowercased mapping of
team_name_spelling → team_id.
- classmethod from_csv(path: Path) TeamNameNormalizer[source]¶
Construct from
MTeamSpellings.csv.Columns required:
TeamNameSpelling,TeamID.- Parameters:
path – Path to
MTeamSpellings.csv.- Returns:
Initialised
TeamNameNormalizer.
- class ncaa_eval.transform.TourneySeed(season: int, team_id: int, seed_str: str, region: str, seed_num: int, is_play_in: bool)[source]¶
Bases:
objectStructured representation of a single NCAA Tournament seed entry.
- season¶
Season year (e.g., 2023).
- Type:
int
- team_id¶
Canonical Kaggle TeamID integer.
- Type:
int
- seed_str¶
Raw seed string as it appears in
MNCAATourneySeeds.csv(e.g.,"W01","X11a").- Type:
str
- region¶
Single-character region code: W, X, Y, or Z.
- Type:
str
- seed_num¶
Seed number 1–16.
- Type:
int
- is_play_in¶
True when the seed has an
'a'or'b'suffix, indicating a First Four play-in game.- Type:
bool
- is_play_in: bool¶
- region: str¶
- season: int¶
- seed_num: int¶
- seed_str: str¶
- team_id: int¶
- class ncaa_eval.transform.TourneySeedTable(seeds: dict[tuple[int, int], TourneySeed])[source]¶
Bases:
objectLookup table for NCAA Tournament seeds by
(season, team_id).Wraps
MNCAATourneySeeds.csvinto a dict-backed structure. Each seed is stored as aTourneySeedfrozen dataclass.- Parameters:
seeds – Mapping of
(season, team_id) → TourneySeed.
- all_seeds(season: int | None = None) list[TourneySeed][source]¶
Return all stored seeds, optionally filtered to a single season.
- Parameters:
season – If provided, only seeds for this season are returned.
- Returns:
List of
TourneySeedobjects.
- classmethod from_csv(path: Path) TourneySeedTable[source]¶
Construct from
MNCAATourneySeeds.csv.Columns required:
Season,Seed,TeamID.Uses
itertuples(notiterrows) for per-row string parsing — acceptable because the per-row operation (parse_seed) contains branching logic that cannot be vectorized.- Parameters:
path – Path to
MNCAATourneySeeds.csv.- Returns:
Initialised
TourneySeedTable.
- get(season: int, team_id: int) TourneySeed | None[source]¶
Return the
TourneySeedfor(season, team_id), orNone.- Parameters:
season – Season year.
team_id – Canonical Kaggle TeamID.
- Returns:
Matching
TourneySeed, orNoneif not found.
- ncaa_eval.transform.apply_ot_rescaling(team_games: DataFrame, stats: tuple[str, ...] = ('fgm', 'fga', 'fgm3', 'fga3', 'ftm', 'fta', 'oreb', 'dreb', 'ast', 'to', 'stl', 'blk', 'pf', 'score', 'opp_score')) DataFrame[source]¶
Rescale all counting stats to 40-minute equivalent for OT games.
Applies:
stat_adj = stat × 40 / (40 + 5 × num_ot)Regulation games (num_ot=0) are unchanged (multiplier = 1.0).Returns a copy; does not modify the input DataFrame in-place.
- Parameters:
team_games – Per-team game DataFrame containing a
num_otcolumn.stats – Tuple of stat column names to rescale.
- Returns:
Copy of
team_gameswith rescaled stat columns.
- ncaa_eval.transform.build_season_graph(games_df: DataFrame, margin_cap: int = 25, reference_day_num: int | None = None, recency_window_days: int = 20, recency_multiplier: float = 1.5) DiGraph[source]¶
Build a directed graph from season game results.
Edges: loser → winner (loser ‘votes for’ winner quality, PageRank metaphor). Edge weight: min(margin, margin_cap) × optional_recency_multiplier.
Parallel edges (same team pair playing multiple times) are aggregated by summing their weights before passing to nx.from_pandas_edgelist().
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
margin_cap – Maximum margin-of-victory to use as edge weight (prevents blowout distortion).
reference_day_num – Day number used to compute recency window. Defaults to max day_num in games_df if None.
recency_window_days – Games within this many days of reference_day_num get recency boost.
recency_multiplier – Weight multiplier for recent games within recency_window_days.
- Returns:
nx.DiGraph with edges loser→winner and “weight” attribute on each edge. Returns empty DiGraph if games_df is empty.
- ncaa_eval.transform.compute_betweenness_centrality(G: DiGraph) dict[int, float][source]¶
Compute betweenness centrality for each team in the graph.
Captures structural “bridge” position — distinct signal from PageRank (strength) and SoS (schedule quality).
- Parameters:
G – Directed graph.
- Returns:
dict mapping team_id → betweenness centrality score. Empty dict if graph has no nodes.
- ncaa_eval.transform.compute_clustering_coefficient(G: DiGraph) dict[int, float][source]¶
Compute undirected clustering coefficient for each team.
Schedule diversity metric: low clustering = broad cross-conference scheduling. Uses undirected conversion so that mutual matchups count once (natural interpretation).
- Parameters:
G – Directed graph (converted to undirected internally).
- Returns:
dict mapping team_id → clustering coefficient. Empty dict if graph has no nodes.
- ncaa_eval.transform.compute_colley_ratings(games_df: DataFrame) DataFrame[source]¶
Compute Colley Matrix win/loss-only ratings.
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id(regular-season games only; scores not used).- Returns:
DataFrame with columns
["team_id", "colley_rating"].
- ncaa_eval.transform.compute_ewma_stats(team_games: DataFrame, alphas: list[float], stats: tuple[str, ...]) DataFrame[source]¶
Compute EWMA features for all specified alphas and stats.
Uses
adjust=Falsefor standard exponential smoothing:value_t = α × obs_t + (1−α) × value_{t−1}- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
alphas – List of smoothing factors (e.g., [0.15, 0.20]).
stats – Tuple of stat column names.
- Returns:
DataFrame with columns
ewma_{alpha_str}_{stat}wherealpha_strreplaces the decimal point with ‘p’ (e.g.,ewma_0p15_score).
- ncaa_eval.transform.compute_four_factors(team_games: DataFrame, possessions: Series) DataFrame[source]¶
Compute Dean Oliver’s Four Factors efficiency ratios.
efg_pct: Effective field goal % = (FGM + 0.5 × FGM3) / FGAorb_pct: Offensive rebound % = OR / (OR + opp_DR)ftr: Free throw rate = FTA / FGAto_pct: Turnover % = TO / possessions
All denominators are guarded against zero (returns NaN when zero).
- Parameters:
team_games – Per-team game DataFrame with box-score columns.
possessions – Series of possession counts (used for TO%).
- Returns:
DataFrame with columns
["efg_pct", "orb_pct", "ftr", "to_pct"].
- ncaa_eval.transform.compute_game_weights(day_nums: Series, reference_day_num: int | None = None) Series[source]¶
BartTorvik time-decay weights: 1% per day after 40 days old; floor 60%.
Formula:
weight = max(0.6, 1 − 0.01 × max(0, days_ago − 40))- Parameters:
day_nums – Series of game day numbers (ascending order).
reference_day_num – Reference point for
days_ago. Defaults tomax(day_nums).
- Returns:
Series of weights in [0.6, 1.0] for each game.
- ncaa_eval.transform.compute_hits(G: DiGraph, max_iter: int = 100) tuple[dict[int, float], dict[int, float]][source]¶
Compute HITS hub and authority scores for each team.
Authority ≈ PageRank (r≈0.908 correlation). Hub = “quality schedule despite losses” — distinct signal. Both are returned from a single nx.hits() call.
- Parameters:
G – Directed graph.
max_iter – Maximum iterations for HITS power iteration.
- Returns:
Tuple of (hub_dict, authority_dict), each mapping team_id → score. Returns uniform 0.0 scores for all nodes if graph has no edges. Returns uniform 1/n scores on convergence failure (with warning logged).
- ncaa_eval.transform.compute_momentum(team_games: DataFrame, alpha_fast: float, alpha_slow: float, stats: tuple[str, ...]) DataFrame[source]¶
Compute ewma_fast − ewma_slow momentum for each stat.
Positive momentum means recent performance is above the longer-term trend (improving form into tournament).
- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
alpha_fast – Fast EWMA smoothing factor (larger → more reactive).
alpha_slow – Slow EWMA smoothing factor (smaller → smoother baseline).
stats – Tuple of stat column names.
- Returns:
DataFrame with columns
momentum_{stat}.
- ncaa_eval.transform.compute_pagerank(G: DiGraph, alpha: float = 0.85, nstart: dict[int, float] | None = None) dict[int, float][source]¶
Compute PageRank for each team in the graph.
Captures transitive win-chain strength (2 hops vs. SoS 1 hop). Peer-reviewed NCAA validation: 71.6% vs. 64.2% naive win-ratio (Matthews et al. 2021).
- Parameters:
G – Directed graph with loser→winner edges and “weight” attribute.
alpha – Damping factor (teleportation probability = 1 - alpha).
nstart – Optional warm-start dictionary (team_id → initial probability). Initialize with previous solution for 2–5 iterations instead of 30–50.
- Returns:
dict mapping team_id → PageRank score. Empty dict if graph has no nodes.
- ncaa_eval.transform.compute_per_possession_stats(team_games: DataFrame, stats: tuple[str, ...], possessions: Series) DataFrame[source]¶
Normalize counting stats by possessions (per-100 possessions).
- Parameters:
team_games – Per-team game DataFrame.
stats – Tuple of stat column names to normalize.
possessions – Series of possession counts (NaN for guard rows).
- Returns:
DataFrame with columns
{stat}_per100.
- ncaa_eval.transform.compute_possessions(team_games: DataFrame) Series[source]¶
Compute possession count: FGA − OR + TO + 0.44 × FTA.
Zero or negative possession counts (rare in short fixtures) are replaced with NaN to prevent division-by-zero downstream.
- Parameters:
team_games – Per-team game DataFrame with box-score columns.
- Returns:
Series named
"possessions".
- ncaa_eval.transform.compute_ridge_ratings(games_df: DataFrame, *, lam: float = 20.0, margin_cap: int = 25) DataFrame[source]¶
Compute Ridge regression ratings.
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id,w_score,l_score(regular-season games only).lam – Ridge regularization parameter λ (default 20.0).
margin_cap – Maximum point margin cap per game (default 25).
- Returns:
DataFrame with columns
["team_id", "ridge_rating"].
- ncaa_eval.transform.compute_rolling_stats(team_games: DataFrame, windows: list[int], stats: tuple[str, ...], weights: Series | None = None) DataFrame[source]¶
Compute rolling mean features for all specified windows and stats.
No future data leakage: rolling window at position i only uses rows at positions ≤ i (pandas
rollingdefault closed=’right’).- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
windows – List of window sizes (e.g., [5, 10, 20]).
stats – Tuple of stat column names.
weights – Optional per-game weights for weighted rolling mean.
- Returns:
DataFrame with columns
rolling_{w}_{stat}androlling_full_{stat}(expanding mean).
- ncaa_eval.transform.compute_srs_ratings(games_df: DataFrame, *, margin_cap: int = 25, max_iter: int = 10000) DataFrame[source]¶
Compute SRS ratings using default solver config.
- Parameters:
games_df – DataFrame with columns
w_team_id,l_team_id,w_score,l_score(regular-season games only).margin_cap – Maximum point margin cap per game (default 25).
max_iter – Maximum SRS iterations (default 10,000).
- Returns:
DataFrame with columns
["team_id", "srs_rating"].
- ncaa_eval.transform.compute_streak(won: Series) Series[source]¶
Compute signed win/loss streak.
Returns +N for a winning streak of N games, −N for a losing streak. Vectorized using cumsum-based grouping; no iterrows.
- Parameters:
won – Boolean Series of game outcomes (True = win), sorted by day_num.
- Returns:
Integer Series named
"streak".
- ncaa_eval.transform.parse_seed(season: int, team_id: int, seed_str: str) TourneySeed[source]¶
Parse a raw tournament seed string into a structured
TourneySeed.Seed strings from
MNCAATourneySeeds.csvfollow the pattern[WXYZ][0-9]{2}[ab]?:"W01"→ region=”W”, seed_num=1, is_play_in=False"X16a"→ region=”X”, seed_num=16, is_play_in=True"Y11b"→ region=”Y”, seed_num=11, is_play_in=True
- Parameters:
season – Season year.
team_id – Canonical Kaggle TeamID.
seed_str – Raw seed string (e.g.,
"W01","X11a").
- Returns:
Fully parsed
TourneySeed.- Raises:
ValueError – If
seed_stris shorter than 3 characters.
- ncaa_eval.transform.rescale_overtime(score: int, num_ot: int) float[source]¶
Rescale a game score to a 40-minute equivalent for OT normalization.
Overtime games inflate per-game scoring statistics because they involve more than 40 minutes of play. The standard correction (Edwards 2021) normalises every game to a 40-minute basis:
adjusted = raw_score × 40 / (40 + 5 × num_ot)
- Parameters:
score – Raw final score (not adjusted).
num_ot – Number of overtime periods played (0 for regulation).
- Returns:
Score normalised to a 40-minute equivalent.
Examples
>>> rescale_overtime(75, 0) # Regulation: no change 75.0 >>> rescale_overtime(80, 1) # 1 OT: 80 × 40 / 45 ≈ 71.11 71.11111111111111