ncaa_eval.ingest package

Subpackages

Submodules

Module contents

Data ingestion module.

exception ncaa_eval.ingest.AuthenticationError[source]

Bases: ConnectorError

Credentials missing, invalid, or expired.

class ncaa_eval.ingest.Connector[source]

Bases: ABC

Abstract base class for NCAA data source connectors.

All connectors must implement fetch_games(), which is the universal capability. fetch_teams() and fetch_seasons() are optional capabilities — subclasses that do not support them inherit the default implementation, which raises NotImplementedError. Callers should use isinstance() checks or try/except NotImplementedError to probe optional capabilities before calling them.

abstractmethod fetch_games(season: int) list[Game][source]

Fetch game results for a given season year.

fetch_seasons() list[Season][source]

Fetch available seasons from the source.

Optional capability — not all connectors provide season master data.

Raises:

NotImplementedError – If this connector does not support fetching seasons.

fetch_teams() list[Team][source]

Fetch team data from the source.

Optional capability — not all connectors provide team master data.

Raises:

NotImplementedError – If this connector does not support fetching teams.

exception ncaa_eval.ingest.ConnectorError[source]

Bases: Exception

Base exception for all connector errors.

exception ncaa_eval.ingest.DataFormatError[source]

Bases: ConnectorError

Raw data (CSV / API response) does not match the expected schema.

class ncaa_eval.ingest.EspnConnector(team_name_to_id: dict[str, int], season_day_zeros: dict[int, date])[source]

Bases: Connector

Connector for ESPN game data via the cbbpy scraper.

Parameters:
  • team_name_to_id – Mapping from team name strings to Kaggle TeamIDs.

  • season_day_zeros – Mapping from season year to DayZero date.

fetch_games(season: int) list[Game][source]

Fetch game results for season from ESPN via cbbpy.

Uses get_team_schedule() for each team in the mapping and deduplicates by ESPN game ID.

class ncaa_eval.ingest.Game(*, GameID: Annotated[str, MinLen(min_length=1)], Season: Annotated[int, Ge(ge=1985)], DayNum: Annotated[int, Ge(ge=0)], Date: date | None = None, WTeamID: Annotated[int, Ge(ge=1)], LTeamID: Annotated[int, Ge(ge=1)], WScore: Annotated[int, Ge(ge=0)], LScore: Annotated[int, Ge(ge=0)], Loc: Literal['H', 'A', 'N'], NumOT: Annotated[int, Ge(ge=0)] = 0, IsTournament: bool = False)[source]

Bases: BaseModel

A single NCAA basketball game result.

date: datetime.date | None
day_num: int
game_id: str
is_tournament: bool
l_score: int
l_team_id: int
loc: Literal['H', 'A', 'N']
model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ot: int
season: int
w_score: int
w_team_id: int
class ncaa_eval.ingest.KaggleConnector(extract_dir: Path, competition: str = 'march-machine-learning-mania-2026')[source]

Bases: Connector

Connector for Kaggle March Machine Learning Mania competition data.

Parameters:
  • extract_dir – Local directory where CSV files are downloaded/extracted.

  • competition – Kaggle competition slug.

download(*, force: bool = False) None[source]

Download and extract competition CSV files via the Kaggle API.

Parameters:

force – Re-download even if files already exist.

Raises:
fetch_games(season: int) list[Game][source]

Parse regular-season and tournament CSVs into Game models.

Games from MRegularSeasonCompactResults.csv have is_tournament=False; games from MNCAATourneyCompactResults.csv have is_tournament=True.

fetch_seasons() list[Season][source]

Parse MSeasons.csv into Season models.

Delegates to load_day_zeros() (which already reads and validates MSeasons.csv) to avoid a second disk read and Pandera validation pass.

fetch_team_spellings() dict[str, int][source]

Parse MTeamSpellings.csv into a spelling → TeamID mapping.

Returns every alternate spelling (lower-cased) for each team, which provides much wider coverage than the canonical names in MTeams.csv when resolving ESPN team name strings to Kaggle IDs.

fetch_teams() list[Team][source]

Parse MTeams.csv into Team models.

Reads MTeams.csv, validates required columns, then constructs Team models from each row’s TeamID and TeamName.

load_day_zeros() dict[int, date][source]

Load and cache the season → DayZero mapping.

Returns:

Mapping of season year to the date of Day 0 for that season.

exception ncaa_eval.ingest.NetworkError[source]

Bases: ConnectorError

Connection failure, timeout, or HTTP error.

class ncaa_eval.ingest.ParquetRepository(base_path: Path)[source]

Bases: Repository

Repository implementation backed by Parquet files.

Directory layout:

{base_path}/
    teams.parquet
    seasons.parquet
    games/
        season={year}/
            data.parquet
get_games(season: int) list[Game][source]

Load games for a single season from hive-partitioned Parquet.

get_seasons() list[Season][source]

Load all season records from the seasons Parquet file.

get_teams() list[Team][source]

Load all teams from the teams Parquet file.

save_games(games: list[Game]) None[source]

Persist game records to hive-partitioned Parquet by season.

save_seasons(seasons: list[Season]) None[source]

Persist season records to a Parquet file.

save_teams(teams: list[Team]) None[source]

Persist team records to a Parquet file.

class ncaa_eval.ingest.Repository[source]

Bases: ABC

Abstract base class for NCAA data persistence.

abstractmethod get_games(season: int) list[Game][source]

Return all games for a given season year.

abstractmethod get_seasons() list[Season][source]

Return all stored seasons.

abstractmethod get_teams() list[Team][source]

Return all stored teams.

abstractmethod save_games(games: list[Game]) None[source]

Persist a collection of games (overwrite per season partition).

abstractmethod save_seasons(seasons: list[Season]) None[source]

Persist a collection of seasons (overwrite).

abstractmethod save_teams(teams: list[Team]) None[source]

Persist a collection of teams (overwrite).

class ncaa_eval.ingest.Season(*, Year: Annotated[int, Ge(ge=1985)])[source]

Bases: BaseModel

A single NCAA basketball season (identified by calendar year).

model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

year: int
class ncaa_eval.ingest.SyncEngine(repository: Repository, data_dir: Path)[source]

Bases: object

Orchestrates data sync from external sources into the local repository.

Parameters:
  • repository – Repository instance used for reading and writing data.

  • data_dir – Root directory for local Parquet files and cached CSVs.

sync_all(force_refresh: bool = False) list[SyncResult][source]

Sync all configured sources: Kaggle first, then ESPN.

Parameters:

force_refresh – Bypass caches for all sources.

Returns:

List of SyncResult, one per source (kaggle, espn).

sync_espn(force_refresh: bool = False) SyncResult[source]

Sync the most recent season’s games from ESPN.

Requires Kaggle data to be synced first (needs team and season mappings). Uses a marker-file cache: if .espn_synced_{year} exists the season is considered up-to-date unless force_refresh.

ESPN games are merged with existing Kaggle games for the same season partition before saving (because save_games overwrites).

Parameters:

force_refresh – Bypass marker-file cache and re-fetch from ESPN.

Returns:

SyncResult summarising games written and seasons cached.

Raises:

RuntimeError – Kaggle data has not been synced yet.

sync_kaggle(force_refresh: bool = False) SyncResult[source]

Sync NCAA data from Kaggle with Parquet-level caching.

Downloads CSVs (if not cached) and converts them to Parquet. Skips individual entities whose Parquet files already exist, unless force_refresh is True.

Parameters:

force_refresh – Bypass all caches and re-fetch everything.

Returns:

SyncResult summarising teams/seasons/games written and cached.

class ncaa_eval.ingest.SyncResult(source: str, teams_written: int = 0, seasons_written: int = 0, games_written: int = 0, seasons_cached: int = 0)[source]

Bases: object

Summary of a single source sync operation.

games_written: int = 0
seasons_cached: int = 0
seasons_written: int = 0
source: str
teams_written: int = 0
class ncaa_eval.ingest.Team(*, TeamID: Annotated[int, Ge(ge=1)], TeamName: Annotated[str, MinLen(min_length=1)], CanonicalName: str = '')[source]

Bases: BaseModel

A college basketball team.

canonical_name: str
model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

team_id: int
team_name: str
class ncaa_eval.ingest.ValidationReport(*, results: list[ValidationResult])[source]

Bases: BaseModel

Aggregated results from all validation checks.

property all_passed: bool

Return True if every check passed.

model_config = {'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

results: list[ValidationResult]
ncaa_eval.ingest.validate_sync(repo: Repository) ValidationReport[source]

Run all post-sync validation checks and return a report.

This function is non-fatal — it never raises on validation failures. Unexpected I/O errors (e.g., corrupt Parquet) may still propagate. The caller is responsible for logging results.