ncaa_eval.ingest package¶

Submodules¶

Module contents¶

Data ingestion module.

exception ncaa_eval.ingest.AuthenticationError[source]¶

Bases: ConnectorError

Credentials missing, invalid, or expired.

class ncaa_eval.ingest.Connector[source]¶

Bases: ABC

Abstract base class for NCAA data source connectors.

All connectors must implement fetch_games(), which is the universal capability. fetch_teams() and fetch_seasons() are optional capabilities — subclasses that do not support them inherit the default implementation, which raises NotImplementedError. Callers should use isinstance() checks or try/except NotImplementedError to probe optional capabilities before calling them.

abstractmethod fetch_games(season: int) → list[Game][source]¶: Fetch game results for a given season year.

fetch_seasons() → list[Season][source]¶

Fetch available seasons from the source.

Optional capability — not all connectors provide season master data.

Raises:: NotImplementedError – If this connector does not support fetching seasons.

fetch_teams() → list[Team][source]¶

Fetch team data from the source.

Optional capability — not all connectors provide team master data.

Raises:: NotImplementedError – If this connector does not support fetching teams.

exception ncaa_eval.ingest.ConnectorError[source]¶

Bases: Exception

Base exception for all connector errors.

exception ncaa_eval.ingest.DataFormatError[source]¶

Bases: ConnectorError

Raw data (CSV / API response) does not match the expected schema.

class ncaa_eval.ingest.EspnConnector(team_name_to_id: dict[str, int], season_day_zeros: dict[int, date])[source]¶

Bases: Connector

Connector for ESPN game data via the cbbpy scraper.

Parameters:

team_name_to_id – Mapping from team name strings to Kaggle TeamIDs.
season_day_zeros – Mapping from season year to DayZero date.

fetch_games(season: int) → list[Game][source]¶

Fetch game results for season from ESPN via cbbpy.

Uses get_team_schedule() for each team in the mapping and deduplicates by ESPN game ID.

class ncaa_eval.ingest.Game(*, GameID: Annotated[str, MinLen(min_length=1)], Season: Annotated[int, Ge(ge=1985)], DayNum: Annotated[int, Ge(ge=0)], Date: date | None = None, WTeamID: Annotated[int, Ge(ge=1)], LTeamID: Annotated[int, Ge(ge=1)], WScore: Annotated[int, Ge(ge=0)], LScore: Annotated[int, Ge(ge=0)], Loc: Literal['H', 'A', 'N'], NumOT: Annotated[int, Ge(ge=0)] = 0, IsTournament: bool = False)[source]¶

Bases: BaseModel

A single NCAA basketball game result.

date: datetime.date | None¶

day_num: int¶

game_id: str¶

is_tournament: bool¶

l_score: int¶

l_team_id: int¶

loc: Literal['H', 'A', 'N']¶

model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_ot: int¶

season: int¶

w_score: int¶

w_team_id: int¶

class ncaa_eval.ingest.KaggleConnector(extract_dir: Path, competition: str = 'march-machine-learning-mania-2026')[source]¶

Bases: Connector

Connector for Kaggle March Machine Learning Mania competition data.

Parameters:

extract_dir – Local directory where CSV files are downloaded/extracted.
competition – Kaggle competition slug.

download(*, force: bool = False) → None[source]¶

Download and extract competition CSV files via the Kaggle API.

Parameters:

force – Re-download even if files already exist.

Raises:

AuthenticationError – Credentials missing or invalid.
NetworkError – Download failed due to connection issues.

fetch_games(season: int) → list[Game][source]¶

Parse regular-season and tournament CSVs into Game models.

Games from MRegularSeasonCompactResults.csv have is_tournament=False; games from MNCAATourneyCompactResults.csv have is_tournament=True.

fetch_seasons() → list[Season][source]¶

Parse MSeasons.csv into Season models.

Delegates to load_day_zeros() (which already reads and validates MSeasons.csv) to avoid a second disk read and Pandera validation pass.

fetch_team_spellings() → dict[str, int][source]¶

Parse MTeamSpellings.csv into a spelling → TeamID mapping.

Returns every alternate spelling (lower-cased) for each team, which provides much wider coverage than the canonical names in MTeams.csv when resolving ESPN team name strings to Kaggle IDs.

fetch_teams() → list[Team][source]¶

Parse MTeams.csv into Team models.

Reads MTeams.csv, validates required columns, then constructs Team models from each row’s TeamID and TeamName.

load_day_zeros() → dict[int, date][source]¶

Load and cache the season → DayZero mapping.

Returns:: Mapping of season year to the date of Day 0 for that season.

exception ncaa_eval.ingest.NetworkError[source]¶

Bases: ConnectorError

Connection failure, timeout, or HTTP error.

class ncaa_eval.ingest.ParquetRepository(base_path: Path)[source]¶

Bases: Repository

Repository implementation backed by Parquet files.

Directory layout:

{base_path}/
    teams.parquet
    seasons.parquet
    games/
        season={year}/
            data.parquet

get_games(season: int) → list[Game][source]¶: Load games for a single season from hive-partitioned Parquet.

get_seasons() → list[Season][source]¶: Load all season records from the seasons Parquet file.

get_teams() → list[Team][source]¶: Load all teams from the teams Parquet file.

save_games(games: list[Game]) → None[source]¶: Persist game records to hive-partitioned Parquet by season.

save_seasons(seasons: list[Season]) → None[source]¶: Persist season records to a Parquet file.

save_teams(teams: list[Team]) → None[source]¶: Persist team records to a Parquet file.

class ncaa_eval.ingest.Repository[source]¶

Bases: ABC

Abstract base class for NCAA data persistence.

abstractmethod get_games(season: int) → list[Game][source]¶: Return all games for a given season year.

abstractmethod get_seasons() → list[Season][source]¶: Return all stored seasons.

abstractmethod get_teams() → list[Team][source]¶: Return all stored teams.

abstractmethod save_games(games: list[Game]) → None[source]¶: Persist a collection of games (overwrite per season partition).

abstractmethod save_seasons(seasons: list[Season]) → None[source]¶: Persist a collection of seasons (overwrite).

abstractmethod save_teams(teams: list[Team]) → None[source]¶: Persist a collection of teams (overwrite).

class ncaa_eval.ingest.Season(*, Year: Annotated[int, Ge(ge=1985)])[source]¶

Bases: BaseModel

A single NCAA basketball season (identified by calendar year).

model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

year: int¶

class ncaa_eval.ingest.SyncEngine(repository: Repository, data_dir: Path)[source]¶

Bases: object

Orchestrates data sync from external sources into the local repository.

Parameters:

repository – Repository instance used for reading and writing data.
data_dir – Root directory for local Parquet files and cached CSVs.

sync_all(force_refresh: bool = False) → list[SyncResult][source]¶

Sync all configured sources: Kaggle first, then ESPN.

Parameters:: force_refresh – Bypass caches for all sources.
Returns:: List of SyncResult, one per source (kaggle, espn).

sync_espn(force_refresh: bool = False) → SyncResult[source]¶

Sync the most recent season’s games from ESPN.

Requires Kaggle data to be synced first (needs team and season mappings). Uses a marker-file cache: if .espn_synced_{year} exists the season is considered up-to-date unless force_refresh.

ESPN games are merged with existing Kaggle games for the same season partition before saving (because save_games overwrites).

Parameters:: force_refresh – Bypass marker-file cache and re-fetch from ESPN.
Returns:: SyncResult summarising games written and seasons cached.
Raises:: RuntimeError – Kaggle data has not been synced yet.

sync_kaggle(force_refresh: bool = False) → SyncResult[source]¶

Sync NCAA data from Kaggle with Parquet-level caching.

Downloads CSVs (if not cached) and converts them to Parquet. Skips individual entities whose Parquet files already exist, unless force_refresh is True.

Parameters:: force_refresh – Bypass all caches and re-fetch everything.
Returns:: SyncResult summarising teams/seasons/games written and cached.

class ncaa_eval.ingest.SyncResult(source: str, teams_written: int = 0, seasons_written: int = 0, games_written: int = 0, seasons_cached: int = 0)[source]¶

Bases: object

Summary of a single source sync operation.

games_written: int = 0¶

seasons_cached: int = 0¶

seasons_written: int = 0¶

source: str¶

teams_written: int = 0¶

class ncaa_eval.ingest.Team(*, TeamID: Annotated[int, Ge(ge=1)], TeamName: Annotated[str, MinLen(min_length=1)], CanonicalName: str = '')[source]¶

Bases: BaseModel

A college basketball team.

canonical_name: str¶

model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

team_id: int¶

team_name: str¶

class ncaa_eval.ingest.ValidationReport(*, results: list[ValidationResult])[source]¶

Bases: BaseModel

Aggregated results from all validation checks.

property all_passed: bool¶: Return True if every check passed.

model_config = {'frozen': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

results: list[ValidationResult]¶

ncaa_eval.ingest.validate_sync(repo: Repository) → ValidationReport[source]¶

Run all post-sync validation checks and return a report.

This function is non-fatal — it never raises on validation failures. Unexpected I/O errors (e.g., corrupt Parquet) may still propagate. The caller is responsible for logging results.

ncaa_eval.ingest package¶

Subpackages¶

Submodules¶

Module contents¶