ncaa_eval.transform.serving module

Chronological data serving layer for walk-forward model training.

Provides ChronologicalDataServer, which wraps a Repository and streams game data in strict date order with temporal boundary enforcement. Downstream consumers (walk-forward splitters, feature pipelines) use this layer to ensure no data from future games leaks into model training.

class ncaa_eval.transform.serving.ChronologicalDataServer(repository: Repository)[source]

Bases: object

Serves game data in strict chronological order for walk-forward modeling.

Wraps a Repository and enforces temporal boundaries so that callers cannot accidentally access future game data during walk-forward validation.

Parameters:

repository – The data store from which games are retrieved.

Example:

from ncaa_eval.ingest.repository import ParquetRepository
from ncaa_eval.transform.serving import ChronologicalDataServer

repo = ParquetRepository(Path("data/"))
server = ChronologicalDataServer(repo)
season = server.get_chronological_season(2023)
for daily_batch in server.iter_games_by_date(2023):
    process(daily_batch)
get_chronological_season(year: int, cutoff_date: date | None = None) SeasonGames[source]

Return all games for year sorted ascending by (date, game_id).

Applies optional temporal cutoff so callers cannot retrieve games that had not yet been played as of a given date. This is the primary leakage-prevention mechanism for walk-forward model training.

Parameters:
  • year – Season year (e.g., 2023 for the 2022-23 season).

  • cutoff_date – If provided, only games on or before this date are returned. Must not be in the future.

Returns:

SeasonGames with games sorted by (date, game_id) and the has_tournament flag reflecting known tournament cancellations.

Raises:

ValueError – If cutoff_date is strictly after today’s date.

iter_games_by_date(year: int, cutoff_date: date | None = None) Iterator[list[Game]][source]

Yield batches of games grouped by calendar date, in chronological order.

Each yielded list contains all games played on a single calendar date. Dates with no games are skipped. Applies the same cutoff_date semantics as get_chronological_season().

Parameters:
  • year – Season year.

  • cutoff_date – Optional temporal cutoff (must not be in the future).

Yields:

Non-empty list[Game] for each calendar date, in ascending order.

class ncaa_eval.transform.serving.SeasonGames(year: int, games: list[Game], has_tournament: bool)[source]

Bases: object

Result of a chronological season query.

year

Season year (e.g., 2023 for the 2022-23 season).

Type:

int

games

All qualifying games sorted ascending by (date, game_id).

Type:

list[ncaa_eval.ingest.schema.Game]

has_tournament

False only for known no-tournament years (e.g., 2020 COVID cancellation). Signals to downstream walk-forward splitters that tournament evaluation should be skipped for this season.

Type:

bool

games: list[Game]
has_tournament: bool
year: int
ncaa_eval.transform.serving.rescale_overtime(score: int, num_ot: int) float[source]

Rescale a game score to a 40-minute equivalent for OT normalization.

Overtime games inflate per-game scoring statistics because they involve more than 40 minutes of play. The standard correction (Edwards 2021) normalises every game to a 40-minute basis:

adjusted = raw_score × 40 / (40 + 5 × num_ot)

Parameters:
  • score – Raw final score (not adjusted).

  • num_ot – Number of overtime periods played (0 for regulation).

Returns:

Score normalised to a 40-minute equivalent.

Examples

>>> rescale_overtime(75, 0)   # Regulation: no change
75.0
>>> rescale_overtime(80, 1)   # 1 OT: 80 × 40 / 45 ≈ 71.11
71.11111111111111