ncaa_eval.transform.normalization module¶
Canonical team ID mapping, lookup tables, and Massey Ordinals ingestion.
Provides normalization and lookup infrastructure for the feature pipeline:
TeamNameNormalizer— maps diverse team name spellings to canonicalTeamIDintegers usingMTeamSpellings.csv.TourneySeedTable— wrapsMNCAATourneySeeds.csvinto a structured(season, team_id) → TourneySeedlookup.ConferenceLookup— wrapsMTeamConferences.csvinto a(season, team_id) → conf_abbrevlookup.MasseyOrdinalsStore— DataFrame-backed store forMMasseyOrdinals.csvwith temporal filtering, coverage gate, and composite computation methods.
Design invariants:
- No imports from ncaa_eval.ingest — this module is a pure CSV-loading layer.
- No df.iterrows() — vectorized pandas operations throughout; itertuples
is acceptable only for non-vectorizable dict construction with string parsing.
mypy --strictcompliant: all types fully annotated, no bareAny.
- class ncaa_eval.transform.normalization.ConferenceLookup(lookup: dict[tuple[int, int], str])[source]¶
Bases:
objectLookup table for team conference membership by
(season, team_id).Wraps
MTeamConferences.csvinto a dict-backed structure.- Parameters:
lookup – Mapping of
(season, team_id) → conf_abbrev.
- classmethod from_csv(path: Path) ConferenceLookup[source]¶
Construct from
MTeamConferences.csv.Columns required:
Season,TeamID,ConfAbbrev.- Parameters:
path – Path to
MTeamConferences.csv.- Returns:
Initialised
ConferenceLookup.
- class ncaa_eval.transform.normalization.CoverageGateResult(primary_systems: tuple[str, ...], fallback_used: bool, fallback_reason: str, recommended_systems: tuple[str, ...])[source]¶
Bases:
objectResult of the Massey Ordinals coverage gate check.
- primary_systems¶
The four primary composite systems (SAG, POM, MOR, WLK).
- Type:
tuple[str, …]
- fallback_used¶
True when SAG or WLK are missing for one or more seasons 2003–2026 and the fallback composite is recommended.
- Type:
bool
- fallback_reason¶
Human-readable description of why the fallback was triggered (empty string when
fallback_used=False).- Type:
str
- recommended_systems¶
The system names the caller should use for composite computation — either the primary composite or the confirmed-full-coverage fallback (MOR, POM, DOL).
- Type:
tuple[str, …]
- fallback_reason: str¶
- fallback_used: bool¶
- primary_systems: tuple[str, ...]¶
- recommended_systems: tuple[str, ...]¶
- class ncaa_eval.transform.normalization.MasseyOrdinalsStore(df: DataFrame)[source]¶
Bases:
objectDataFrame-backed store for Massey Ordinal ranking systems.
Ingests
MMasseyOrdinals.csvand provides temporal filtering, coverage gate validation, composite computation (Options A–D), and per-system normalization.- Parameters:
df – Raw DataFrame with columns
[Season, RankingDayNum, SystemName, TeamID, OrdinalRank].
- composite_pca(season: int, day_num: int, n_components: int | None = None, min_variance: float = 0.9) DataFrame[source]¶
Option C: PCA reduction of all available systems.
When
n_components=None, automatically selects the minimum number of components needed to capturemin_varianceof total variance.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
n_components – Number of principal components to retain.
Nonetriggers automatic selection based onmin_variance.min_variance – Minimum cumulative explained variance required when
n_components=None(default 0.90 = 90%).
- Returns:
DataFrame with columns
PC1, PC2, ...indexed by TeamID. Rows with any NaN system value are dropped before PCA.
- composite_simple_average(season: int, day_num: int, systems: list[str]) Series[source]¶
Option A: simple average of ordinal ranks across systems per team.
- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
systems – List of system names to average.
- Returns:
Series indexed by TeamID with mean ordinal rank per team.
- composite_weighted(season: int, day_num: int, weights: dict[str, float]) Series[source]¶
Option B: weighted average of ordinal ranks using caller-supplied weights.
Weights are normalized to sum to 1 before computation.
- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
weights – Mapping of system name → weight (any positive floats). Must not be empty.
- Returns:
Series indexed by TeamID with weighted ordinal rank per team.
- Raises:
ValueError – If
weightsis empty.
- classmethod from_csv(path: Path) MasseyOrdinalsStore[source]¶
Construct from
MMasseyOrdinals.csv.Columns required:
Season,RankingDayNum,SystemName,TeamID,OrdinalRank.- Parameters:
path – Path to
MMasseyOrdinals.csv.- Returns:
Initialised
MasseyOrdinalsStore.
- get_snapshot(season: int, day_num: int, systems: list[str] | None = None) DataFrame[source]¶
Return wide-format ordinal ranks as of day_num for season.
For each
(SystemName, TeamID)pair, uses the latestRankingDayNumthat is≤ day_num. Returns a DataFrame withTeamIDas index and one column per ranking system.- Parameters:
season – Season year.
day_num – Inclusive upper bound on
RankingDayNum.systems – If provided, only include these system names.
Nonereturns all available systems.
- Returns:
Wide-format DataFrame (index=TeamID, columns=SystemName). Empty DataFrame if no records satisfy the filters.
- normalize_percentile(season: int, day_num: int, system: str) Series[source]¶
Return per-season percentile rank for system bounded to
[0, 1].Computed as
OrdinalRank / n_teamswheren_teamsis the number of teams with a rank in this season/system snapshot.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.
- Returns:
Series indexed by TeamID with percentile values in
[0, 1].
- normalize_rank_delta(snapshot: DataFrame, team_a: int, team_b: int, system: str) float[source]¶
Return ordinal rank delta for a matchup between team_a and team_b.
A positive result means team_a is ranked worse (higher rank number = worse) than team_b in this system.
- Parameters:
snapshot – Wide-format snapshot DataFrame (index=TeamID, columns=SystemName) from
get_snapshot().team_a – First team’s canonical TeamID.
team_b – Second team’s canonical TeamID.
system – System name column to use.
- Returns:
snapshot.loc[team_a, system] - snapshot.loc[team_b, system]
- normalize_zscore(season: int, day_num: int, system: str) Series[source]¶
Return per-season z-score for system.
Computed as
(rank - mean_rank) / std_rankacross all teams in the snapshot.- Parameters:
season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.
- Returns:
Series indexed by TeamID with z-score values (mean ≈ 0, std ≈ 1).
- pre_tournament_snapshot(season: int, systems: list[str] | None = None) DataFrame[source]¶
Option D: pre-tournament snapshot using ordinals from
RankingDayNum ≤ 128.DayNum 128 corresponds approximately to Selection Sunday in the Kaggle calendar. Only ordinals available before the tournament begins are used.
- Parameters:
season – Season year.
systems – If provided, only include these system names.
- Returns:
Wide-format DataFrame in the same structure as
get_snapshot().
- run_coverage_gate() CoverageGateResult[source]¶
Check whether SAG and WLK cover all seasons 2003–2026.
If either system has gaps the fallback composite (MOR, POM, DOL) is recommended instead of the primary composite (SAG, POM, MOR, WLK).
- Returns:
CoverageGateResultwith coverage findings and the recommended system list.
- class ncaa_eval.transform.normalization.TeamNameNormalizer(spellings: dict[str, int])[source]¶
Bases:
objectMaps diverse team name spellings to canonical
TeamIDintegers.Wraps the
MTeamSpellings.csvlookup table. Matching is case-insensitive. On a miss, a WARNING is logged with any close prefix matches andNoneis returned (no exception raised). The lookup is idempotent: callingnormalize()twice with the same input returns the same result.- Parameters:
spellings – Pre-lowercased mapping of
team_name_spelling → team_id.
- classmethod from_csv(path: Path) TeamNameNormalizer[source]¶
Construct from
MTeamSpellings.csv.Columns required:
TeamNameSpelling,TeamID.- Parameters:
path – Path to
MTeamSpellings.csv.- Returns:
Initialised
TeamNameNormalizer.
- class ncaa_eval.transform.normalization.TourneySeed(season: int, team_id: int, seed_str: str, region: str, seed_num: int, is_play_in: bool)[source]¶
Bases:
objectStructured representation of a single NCAA Tournament seed entry.
- season¶
Season year (e.g., 2023).
- Type:
int
- team_id¶
Canonical Kaggle TeamID integer.
- Type:
int
- seed_str¶
Raw seed string as it appears in
MNCAATourneySeeds.csv(e.g.,"W01","X11a").- Type:
str
- region¶
Single-character region code: W, X, Y, or Z.
- Type:
str
- seed_num¶
Seed number 1–16.
- Type:
int
- is_play_in¶
True when the seed has an
'a'or'b'suffix, indicating a First Four play-in game.- Type:
bool
- is_play_in: bool¶
- region: str¶
- season: int¶
- seed_num: int¶
- seed_str: str¶
- team_id: int¶
- class ncaa_eval.transform.normalization.TourneySeedTable(seeds: dict[tuple[int, int], TourneySeed])[source]¶
Bases:
objectLookup table for NCAA Tournament seeds by
(season, team_id).Wraps
MNCAATourneySeeds.csvinto a dict-backed structure. Each seed is stored as aTourneySeedfrozen dataclass.- Parameters:
seeds – Mapping of
(season, team_id) → TourneySeed.
- all_seeds(season: int | None = None) list[TourneySeed][source]¶
Return all stored seeds, optionally filtered to a single season.
- Parameters:
season – If provided, only seeds for this season are returned.
- Returns:
List of
TourneySeedobjects.
- classmethod from_csv(path: Path) TourneySeedTable[source]¶
Construct from
MNCAATourneySeeds.csv.Columns required:
Season,Seed,TeamID.Uses
itertuples(notiterrows) for per-row string parsing — acceptable because the per-row operation (parse_seed) contains branching logic that cannot be vectorized.- Parameters:
path – Path to
MNCAATourneySeeds.csv.- Returns:
Initialised
TourneySeedTable.
- get(season: int, team_id: int) TourneySeed | None[source]¶
Return the
TourneySeedfor(season, team_id), orNone.- Parameters:
season – Season year.
team_id – Canonical Kaggle TeamID.
- Returns:
Matching
TourneySeed, orNoneif not found.
- ncaa_eval.transform.normalization.parse_seed(season: int, team_id: int, seed_str: str) TourneySeed[source]¶
Parse a raw tournament seed string into a structured
TourneySeed.Seed strings from
MNCAATourneySeeds.csvfollow the pattern[WXYZ][0-9]{2}[ab]?:"W01"→ region=”W”, seed_num=1, is_play_in=False"X16a"→ region=”X”, seed_num=16, is_play_in=True"Y11b"→ region=”Y”, seed_num=11, is_play_in=True
- Parameters:
season – Season year.
team_id – Canonical Kaggle TeamID.
seed_str – Raw seed string (e.g.,
"W01","X11a").
- Returns:
Fully parsed
TourneySeed.- Raises:
ValueError – If
seed_stris shorter than 3 characters.