ncaa_eval.transform.sequential module¶
Sequential feature transformations for NCAA basketball game data.
Provides rolling windows, EWMA, momentum, streak, per-possession, and Four Factor features computed from chronologically ordered game data.
DetailedResultsLoader— loads box-score CSVs and provides per-team, per-season game views in long format.SequentialTransformer— orchestrates all sequential feature computation steps in temporal order without data leakage.
Design invariants:
- No imports from ncaa_eval.ingest — pure CSV-loading transform layer.
- No df.iterrows() — vectorized pandas operations throughout.
- mypy --strict compliant: all types fully annotated.
- No hardcoded data paths — accept Path parameters.
- class ncaa_eval.transform.sequential.DetailedResultsLoader(df: DataFrame)[source]¶
Bases:
objectLoads detailed box-score results and provides per-team game views.
Reads
MRegularSeasonDetailedResults.csvandMNCAATourneyDetailedResults.csvinto a combined long-format DataFrame with one row per (team, game).Box-score stats are only available from the 2003 season onwards. Pre-2003 seasons return empty DataFrames from
get_team_season().- classmethod from_csvs(regular_path: Path, tourney_path: Path) DetailedResultsLoader[source]¶
Construct a loader from the two Kaggle detailed-results CSV paths.
- Parameters:
regular_path – Path to
MRegularSeasonDetailedResults.csv.tourney_path – Path to
MNCAATourneyDetailedResults.csv.
- Returns:
DetailedResultsLoaderinstance with combined data.
- get_season_long_format(season: int) DataFrame[source]¶
Return all games for a season in long format.
- Parameters:
season – Season year (e.g., 2023).
- Returns:
DataFrame sorted by
(day_num, team_id), reset index.
- get_team_season(team_id: int, season: int) DataFrame[source]¶
Return all games for one team in one season, sorted by day_num.
- Parameters:
team_id – Canonical Kaggle TeamID integer.
season – Season year (e.g., 2023).
- Returns:
DataFrame sorted by
day_numascending, reset index. Returns empty DataFrame if team or season not found.
- class ncaa_eval.transform.sequential.SequentialTransformer(windows: list[int] | None = None, alphas: list[float] | None = None, alpha_fast: float = 0.2, alpha_slow: float = 0.1, stats: tuple[str, ...] | None = None)[source]¶
Bases:
objectOrchestrates all sequential feature computation steps.
Applies OT rescaling, time-decay weighting, rolling windows, EWMA, momentum, streak, per-possession normalization, and Four Factors to a per-team game history in chronological order.
All features respect temporal ordering — no feature for game N uses data from games N+1 or later.
- transform(team_games: DataFrame, reference_day_num: int | None = None) DataFrame[source]¶
Compute all sequential features for a team’s game history.
Input must be sorted by
day_numascending to ensure temporal integrity (no future data leakage).Orchestration order (critical for correctness): 1. OT rescaling (before any aggregation) 2. Time-decay weights 3. Rolling stats (on OT-rescaled stats, with weights) 4. EWMA (on OT-rescaled stats) 5. Momentum 6. Streak (on original won column) 7. Possessions + per-possession stats 8. Four Factors
- Parameters:
team_games – Per-team game DataFrame sorted by
day_num.reference_day_num – Reference day for time-decay weights. Defaults to the last game’s
day_num.
- Returns:
New DataFrame with all feature columns appended to originals. Preserves input row order.
- ncaa_eval.transform.sequential.apply_ot_rescaling(team_games: DataFrame, stats: tuple[str, ...] = ('fgm', 'fga', 'fgm3', 'fga3', 'ftm', 'fta', 'oreb', 'dreb', 'ast', 'to', 'stl', 'blk', 'pf', 'score', 'opp_score')) DataFrame[source]¶
Rescale all counting stats to 40-minute equivalent for OT games.
Applies:
stat_adj = stat × 40 / (40 + 5 × num_ot)Regulation games (num_ot=0) are unchanged (multiplier = 1.0).Returns a copy; does not modify the input DataFrame in-place.
- Parameters:
team_games – Per-team game DataFrame containing a
num_otcolumn.stats – Tuple of stat column names to rescale.
- Returns:
Copy of
team_gameswith rescaled stat columns.
- ncaa_eval.transform.sequential.compute_ewma_stats(team_games: DataFrame, alphas: list[float], stats: tuple[str, ...]) DataFrame[source]¶
Compute EWMA features for all specified alphas and stats.
Uses
adjust=Falsefor standard exponential smoothing:value_t = α × obs_t + (1−α) × value_{t−1}- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
alphas – List of smoothing factors (e.g., [0.15, 0.20]).
stats – Tuple of stat column names.
- Returns:
DataFrame with columns
ewma_{alpha_str}_{stat}wherealpha_strreplaces the decimal point with ‘p’ (e.g.,ewma_0p15_score).
- ncaa_eval.transform.sequential.compute_four_factors(team_games: DataFrame, possessions: Series) DataFrame[source]¶
Compute Dean Oliver’s Four Factors efficiency ratios.
efg_pct: Effective field goal % = (FGM + 0.5 × FGM3) / FGAorb_pct: Offensive rebound % = OR / (OR + opp_DR)ftr: Free throw rate = FTA / FGAto_pct: Turnover % = TO / possessions
All denominators are guarded against zero (returns NaN when zero).
- Parameters:
team_games – Per-team game DataFrame with box-score columns.
possessions – Series of possession counts (used for TO%).
- Returns:
DataFrame with columns
["efg_pct", "orb_pct", "ftr", "to_pct"].
- ncaa_eval.transform.sequential.compute_game_weights(day_nums: Series, reference_day_num: int | None = None) Series[source]¶
BartTorvik time-decay weights: 1% per day after 40 days old; floor 60%.
Formula:
weight = max(0.6, 1 − 0.01 × max(0, days_ago − 40))- Parameters:
day_nums – Series of game day numbers (ascending order).
reference_day_num – Reference point for
days_ago. Defaults tomax(day_nums).
- Returns:
Series of weights in [0.6, 1.0] for each game.
- ncaa_eval.transform.sequential.compute_momentum(team_games: DataFrame, alpha_fast: float, alpha_slow: float, stats: tuple[str, ...]) DataFrame[source]¶
Compute ewma_fast − ewma_slow momentum for each stat.
Positive momentum means recent performance is above the longer-term trend (improving form into tournament).
- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
alpha_fast – Fast EWMA smoothing factor (larger → more reactive).
alpha_slow – Slow EWMA smoothing factor (smaller → smoother baseline).
stats – Tuple of stat column names.
- Returns:
DataFrame with columns
momentum_{stat}.
- ncaa_eval.transform.sequential.compute_per_possession_stats(team_games: DataFrame, stats: tuple[str, ...], possessions: Series) DataFrame[source]¶
Normalize counting stats by possessions (per-100 possessions).
- Parameters:
team_games – Per-team game DataFrame.
stats – Tuple of stat column names to normalize.
possessions – Series of possession counts (NaN for guard rows).
- Returns:
DataFrame with columns
{stat}_per100.
- ncaa_eval.transform.sequential.compute_possessions(team_games: DataFrame) Series[source]¶
Compute possession count: FGA − OR + TO + 0.44 × FTA.
Zero or negative possession counts (rare in short fixtures) are replaced with NaN to prevent division-by-zero downstream.
- Parameters:
team_games – Per-team game DataFrame with box-score columns.
- Returns:
Series named
"possessions".
- ncaa_eval.transform.sequential.compute_rolling_stats(team_games: DataFrame, windows: list[int], stats: tuple[str, ...], weights: Series | None = None) DataFrame[source]¶
Compute rolling mean features for all specified windows and stats.
No future data leakage: rolling window at position i only uses rows at positions ≤ i (pandas
rollingdefault closed=’right’).- Parameters:
team_games – Per-team game DataFrame (sorted by day_num ascending).
windows – List of window sizes (e.g., [5, 10, 20]).
stats – Tuple of stat column names.
weights – Optional per-game weights for weighted rolling mean.
- Returns:
DataFrame with columns
rolling_{w}_{stat}androlling_full_{stat}(expanding mean).
- ncaa_eval.transform.sequential.compute_streak(won: Series) Series[source]¶
Compute signed win/loss streak.
Returns +N for a winning streak of N games, −N for a losing streak. Vectorized using cumsum-based grouping; no iterrows.
- Parameters:
won – Boolean Series of game outcomes (True = win), sorted by day_num.
- Returns:
Integer Series named
"streak".