ncaa_eval.transform.normalization module¶

Canonical team ID mapping, lookup tables, and Massey Ordinals ingestion.

Provides normalization and lookup infrastructure for the feature pipeline:

TeamNameNormalizer — maps diverse team name spellings to canonical TeamID integers using MTeamSpellings.csv.
TourneySeedTable — wraps MNCAATourneySeeds.csv into a structured (season, team_id) → TourneySeed lookup.
ConferenceLookup — wraps MTeamConferences.csv into a (season, team_id) → conf_abbrev lookup.
MasseyOrdinalsStore — DataFrame-backed store for MMasseyOrdinals.csv with temporal filtering, coverage gate, and composite computation methods.

Design invariants: - No imports from ncaa_eval.ingest — this module is a pure CSV-loading layer. - No df.iterrows() — vectorized pandas operations throughout; itertuples

is acceptable only for non-vectorizable dict construction with string parsing.

mypy --strict compliant: all types fully annotated, no bare Any.

class ncaa_eval.transform.normalization.ConferenceLookup(lookup: dict[tuple[int, int], str])[source]¶

Bases: object

Lookup table for team conference membership by (season, team_id).

Wraps MTeamConferences.csv into a dict-backed structure.

Parameters:: lookup – Mapping of (season, team_id) → conf_abbrev.

classmethod from_csv(path: Path) → ConferenceLookup[source]¶

Construct from MTeamConferences.csv.

Columns required: Season, TeamID, ConfAbbrev.

Parameters:: path – Path to MTeamConferences.csv.
Returns:: Initialised ConferenceLookup.

get(season: int, team_id: int) → str | None[source]¶

Return the conference abbreviation for (season, team_id), or None.

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.

Returns:

Conference abbreviation string, or None if not found.

class ncaa_eval.transform.normalization.CoverageGateResult(primary_systems: tuple[str, ...], fallback_used: bool, fallback_reason: str, recommended_systems: tuple[str, ...])[source]¶

Bases: object

Result of the Massey Ordinals coverage gate check.

primary_systems¶

The four primary composite systems (SAG, POM, MOR, WLK).

Type:: tuple[str, …]

fallback_used¶

True when SAG or WLK are missing for one or more seasons 2003–2026 and the fallback composite is recommended.

Type:: bool

fallback_reason¶

Human-readable description of why the fallback was triggered (empty string when fallback_used=False).

Type:: str

recommended_systems¶

The system names the caller should use for composite computation — either the primary composite or the confirmed-full-coverage fallback (MOR, POM, DOL).

Type:: tuple[str, …]

fallback_reason: str¶

fallback_used: bool¶

primary_systems: tuple[str, ...]¶

recommended_systems: tuple[str, ...]¶

class ncaa_eval.transform.normalization.MasseyOrdinalsStore(df: DataFrame)[source]¶

Bases: object

DataFrame-backed store for Massey Ordinal ranking systems.

Ingests MMasseyOrdinals.csv and provides temporal filtering, coverage gate validation, composite computation (Options A–D), and per-system normalization.

Parameters:: df – Raw DataFrame with columns [Season, RankingDayNum, SystemName, TeamID, OrdinalRank].

composite_pca(season: int, day_num: int, n_components: int | None = None, min_variance: float = 0.9) → DataFrame[source]¶

Option C: PCA reduction of all available systems.

When n_components=None, automatically selects the minimum number of components needed to capture min_variance of total variance.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
n_components – Number of principal components to retain. None triggers automatic selection based on min_variance.
min_variance – Minimum cumulative explained variance required when n_components=None (default 0.90 = 90%).

Returns:

DataFrame with columns PC1, PC2, ... indexed by TeamID. Rows with any NaN system value are dropped before PCA.

composite_simple_average(season: int, day_num: int, systems: list[str]) → Series[source]¶

Option A: simple average of ordinal ranks across systems per team.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
systems – List of system names to average.

Returns:

Series indexed by TeamID with mean ordinal rank per team.

composite_weighted(season: int, day_num: int, weights: dict[str, float]) → Series[source]¶

Option B: weighted average of ordinal ranks using caller-supplied weights.

Weights are normalized to sum to 1 before computation.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
weights – Mapping of system name → weight (any positive floats). Must not be empty.

Returns:

Series indexed by TeamID with weighted ordinal rank per team.

Raises:

ValueError – If weights is empty.

classmethod from_csv(path: Path) → MasseyOrdinalsStore[source]¶

Construct from MMasseyOrdinals.csv.

Columns required: Season, RankingDayNum, SystemName, TeamID, OrdinalRank.

Parameters:: path – Path to MMasseyOrdinals.csv.
Returns:: Initialised MasseyOrdinalsStore.

get_snapshot(season: int, day_num: int, systems: list[str] | None = None) → DataFrame[source]¶

Return wide-format ordinal ranks as of day_num for season.

For each (SystemName, TeamID) pair, uses the latest RankingDayNum that is ≤ day_num. Returns a DataFrame with TeamID as index and one column per ranking system.

Parameters:

season – Season year.
day_num – Inclusive upper bound on RankingDayNum.
systems – If provided, only include these system names. None returns all available systems.

Returns:

Wide-format DataFrame (index=TeamID, columns=SystemName). Empty DataFrame if no records satisfy the filters.

normalize_percentile(season: int, day_num: int, system: str) → Series[source]¶

Return per-season percentile rank for system bounded to [0, 1].

Computed as OrdinalRank / n_teams where n_teams is the number of teams with a rank in this season/system snapshot.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.

Returns:

Series indexed by TeamID with percentile values in [0, 1].

normalize_rank_delta(snapshot: DataFrame, team_a: int, team_b: int, system: str) → float[source]¶

Return ordinal rank delta for a matchup between team_a and team_b.

A positive result means team_a is ranked worse (higher rank number = worse) than team_b in this system.

Parameters:

snapshot – Wide-format snapshot DataFrame (index=TeamID, columns=SystemName) from get_snapshot().
team_a – First team’s canonical TeamID.
team_b – Second team’s canonical TeamID.
system – System name column to use.

Returns:

snapshot.loc[team_a, system] - snapshot.loc[team_b, system]

normalize_zscore(season: int, day_num: int, system: str) → Series[source]¶

Return per-season z-score for system.

Computed as (rank - mean_rank) / std_rank across all teams in the snapshot.

Parameters:

season – Season year.
day_num – Temporal cutoff (inclusive).
system – System name.

Returns:

Series indexed by TeamID with z-score values (mean ≈ 0, std ≈ 1).

pre_tournament_snapshot(season: int, systems: list[str] | None = None) → DataFrame[source]¶

Option D: pre-tournament snapshot using ordinals from RankingDayNum ≤ 128.

DayNum 128 corresponds approximately to Selection Sunday in the Kaggle calendar. Only ordinals available before the tournament begins are used.

Parameters:

season – Season year.
systems – If provided, only include these system names.

Returns:

Wide-format DataFrame in the same structure as get_snapshot().

run_coverage_gate() → CoverageGateResult[source]¶

Check whether SAG and WLK cover all seasons 2003–2026.

If either system has gaps the fallback composite (MOR, POM, DOL) is recommended instead of the primary composite (SAG, POM, MOR, WLK).

Returns:: CoverageGateResult with coverage findings and the recommended system list.

class ncaa_eval.transform.normalization.TeamNameNormalizer(spellings: dict[str, int])[source]¶

Bases: object

Maps diverse team name spellings to canonical TeamID integers.

Wraps the MTeamSpellings.csv lookup table. Matching is case-insensitive. On a miss, a WARNING is logged with any close prefix matches and None is returned (no exception raised). The lookup is idempotent: calling normalize() twice with the same input returns the same result.

Parameters:: spellings – Pre-lowercased mapping of team_name_spelling → team_id.

classmethod from_csv(path: Path) → TeamNameNormalizer[source]¶

Construct from MTeamSpellings.csv.

Columns required: TeamNameSpelling, TeamID.

Parameters:: path – Path to MTeamSpellings.csv.
Returns:: Initialised TeamNameNormalizer.

normalize(name: str) → int | None[source]¶

Look up name and return its canonical TeamID, or None on miss.

Parameters:: name – Team name string (any case).
Returns:: Canonical TeamID integer, or None if not found.

class ncaa_eval.transform.normalization.TourneySeed(season: int, team_id: int, seed_str: str, region: str, seed_num: int, is_play_in: bool)[source]¶

Bases: object

Structured representation of a single NCAA Tournament seed entry.

season¶

Season year (e.g., 2023).

Type:: int

team_id¶

Canonical Kaggle TeamID integer.

Type:: int

seed_str¶

Raw seed string as it appears in MNCAATourneySeeds.csv (e.g., "W01", "X11a").

Type:: str

region¶

Single-character region code: W, X, Y, or Z.

Type:: str

seed_num¶

Seed number 1–16.

Type:: int

is_play_in¶

True when the seed has an 'a' or 'b' suffix, indicating a First Four play-in game.

Type:: bool

is_play_in: bool¶

region: str¶

season: int¶

seed_num: int¶

seed_str: str¶

team_id: int¶

class ncaa_eval.transform.normalization.TourneySeedTable(seeds: dict[tuple[int, int], TourneySeed])[source]¶

Bases: object

Lookup table for NCAA Tournament seeds by (season, team_id).

Wraps MNCAATourneySeeds.csv into a dict-backed structure. Each seed is stored as a TourneySeed frozen dataclass.

Parameters:: seeds – Mapping of (season, team_id) → TourneySeed.

all_seeds(season: int | None = None) → list[TourneySeed][source]¶

Return all stored seeds, optionally filtered to a single season.

Parameters:: season – If provided, only seeds for this season are returned.
Returns:: List of TourneySeed objects.

classmethod from_csv(path: Path) → TourneySeedTable[source]¶

Construct from MNCAATourneySeeds.csv.

Columns required: Season, Seed, TeamID.

Uses itertuples (not iterrows) for per-row string parsing — acceptable because the per-row operation (parse_seed) contains branching logic that cannot be vectorized.

Parameters:: path – Path to MNCAATourneySeeds.csv.
Returns:: Initialised TourneySeedTable.

get(season: int, team_id: int) → TourneySeed | None[source]¶

Return the TourneySeed for (season, team_id), or None.

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.

Returns:

Matching TourneySeed, or None if not found.

ncaa_eval.transform.normalization.parse_seed(season: int, team_id: int, seed_str: str) → TourneySeed[source]¶

Parse a raw tournament seed string into a structured TourneySeed.

Seed strings from MNCAATourneySeeds.csv follow the pattern [WXYZ][0-9]{2}[ab]?:

"W01" → region=”W”, seed_num=1, is_play_in=False
"X16a" → region=”X”, seed_num=16, is_play_in=True
"Y11b" → region=”Y”, seed_num=11, is_play_in=True

Parameters:

season – Season year.
team_id – Canonical Kaggle TeamID.
seed_str – Raw seed string (e.g., "W01", "X11a").

Returns:

Fully parsed TourneySeed.

Raises:

ValueError – If seed_str is shorter than 3 characters.