ncaa_eval.ingest package¶
Subpackages¶
Submodules¶
Module contents¶
Data ingestion module.
- exception ncaa_eval.ingest.AuthenticationError[source]¶
Bases:
ConnectorErrorCredentials missing, invalid, or expired.
- class ncaa_eval.ingest.Connector[source]¶
Bases:
ABCAbstract base class for NCAA data source connectors.
All connectors must implement
fetch_games(), which is the universal capability.fetch_teams()andfetch_seasons()are optional capabilities — subclasses that do not support them inherit the default implementation, which raisesNotImplementedError. Callers should useisinstance()checks ortry/except NotImplementedErrorto probe optional capabilities before calling them.- abstractmethod fetch_games(season: int) list[Game][source]¶
Fetch game results for a given season year.
- exception ncaa_eval.ingest.ConnectorError[source]¶
Bases:
ExceptionBase exception for all connector errors.
- exception ncaa_eval.ingest.DataFormatError[source]¶
Bases:
ConnectorErrorRaw data (CSV / API response) does not match the expected schema.
- class ncaa_eval.ingest.EspnConnector(team_name_to_id: dict[str, int], season_day_zeros: dict[int, date])[source]¶
Bases:
ConnectorConnector for ESPN game data via the cbbpy scraper.
- Parameters:
team_name_to_id – Mapping from team name strings to Kaggle TeamIDs.
season_day_zeros – Mapping from season year to DayZero date.
- class ncaa_eval.ingest.Game(*, GameID: Annotated[str, MinLen(min_length=1)], Season: Annotated[int, Ge(ge=1985)], DayNum: Annotated[int, Ge(ge=0)], Date: date | None = None, WTeamID: Annotated[int, Ge(ge=1)], LTeamID: Annotated[int, Ge(ge=1)], WScore: Annotated[int, Ge(ge=0)], LScore: Annotated[int, Ge(ge=0)], Loc: Literal['H', 'A', 'N'], NumOT: Annotated[int, Ge(ge=0)] = 0, IsTournament: bool = False)[source]¶
Bases:
BaseModelA single NCAA basketball game result.
- date: datetime.date | None¶
- day_num: int¶
- game_id: str¶
- is_tournament: bool¶
- l_score: int¶
- l_team_id: int¶
- loc: Literal['H', 'A', 'N']¶
- model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_ot: int¶
- season: int¶
- w_score: int¶
- w_team_id: int¶
- class ncaa_eval.ingest.KaggleConnector(extract_dir: Path, competition: str = 'march-machine-learning-mania-2026')[source]¶
Bases:
ConnectorConnector for Kaggle March Machine Learning Mania competition data.
- Parameters:
extract_dir – Local directory where CSV files are downloaded/extracted.
competition – Kaggle competition slug.
- download(*, force: bool = False) None[source]¶
Download and extract competition CSV files via the Kaggle API.
- Parameters:
force – Re-download even if files already exist.
- Raises:
AuthenticationError – Credentials missing or invalid.
NetworkError – Download failed due to connection issues.
- fetch_games(season: int) list[Game][source]¶
Parse regular-season and tournament CSVs into Game models.
Games from
MRegularSeasonCompactResults.csvhaveis_tournament=False; games fromMNCAATourneyCompactResults.csvhaveis_tournament=True.
- fetch_seasons() list[Season][source]¶
Parse
MSeasons.csvinto Season models.Delegates to
load_day_zeros()(which already reads and validates MSeasons.csv) to avoid a second disk read and Pandera validation pass.
- fetch_team_spellings() dict[str, int][source]¶
Parse
MTeamSpellings.csvinto a spelling → TeamID mapping.Returns every alternate spelling (lower-cased) for each team, which provides much wider coverage than the canonical names in MTeams.csv when resolving ESPN team name strings to Kaggle IDs.
- exception ncaa_eval.ingest.NetworkError[source]¶
Bases:
ConnectorErrorConnection failure, timeout, or HTTP error.
- class ncaa_eval.ingest.ParquetRepository(base_path: Path)[source]¶
Bases:
RepositoryRepository implementation backed by Parquet files.
Directory layout:
{base_path}/ teams.parquet seasons.parquet games/ season={year}/ data.parquet
- get_games(season: int) list[Game][source]¶
Load games for a single season from hive-partitioned Parquet.
- class ncaa_eval.ingest.Repository[source]¶
Bases:
ABCAbstract base class for NCAA data persistence.
- abstractmethod save_games(games: list[Game]) None[source]¶
Persist a collection of games (overwrite per season partition).
- class ncaa_eval.ingest.Season(*, Year: Annotated[int, Ge(ge=1985)])[source]¶
Bases:
BaseModelA single NCAA basketball season (identified by calendar year).
- model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- year: int¶
- class ncaa_eval.ingest.SyncEngine(repository: Repository, data_dir: Path)[source]¶
Bases:
objectOrchestrates data sync from external sources into the local repository.
- Parameters:
repository – Repository instance used for reading and writing data.
data_dir – Root directory for local Parquet files and cached CSVs.
- sync_all(force_refresh: bool = False) list[SyncResult][source]¶
Sync all configured sources: Kaggle first, then ESPN.
- Parameters:
force_refresh – Bypass caches for all sources.
- Returns:
List of SyncResult, one per source (kaggle, espn).
- sync_espn(force_refresh: bool = False) SyncResult[source]¶
Sync the most recent season’s games from ESPN.
Requires Kaggle data to be synced first (needs team and season mappings). Uses a marker-file cache: if
.espn_synced_{year}exists the season is considered up-to-date unless force_refresh.ESPN games are merged with existing Kaggle games for the same season partition before saving (because
save_gamesoverwrites).- Parameters:
force_refresh – Bypass marker-file cache and re-fetch from ESPN.
- Returns:
SyncResult summarising games written and seasons cached.
- Raises:
RuntimeError – Kaggle data has not been synced yet.
- sync_kaggle(force_refresh: bool = False) SyncResult[source]¶
Sync NCAA data from Kaggle with Parquet-level caching.
Downloads CSVs (if not cached) and converts them to Parquet. Skips individual entities whose Parquet files already exist, unless force_refresh is
True.- Parameters:
force_refresh – Bypass all caches and re-fetch everything.
- Returns:
SyncResult summarising teams/seasons/games written and cached.
- class ncaa_eval.ingest.SyncResult(source: str, teams_written: int = 0, seasons_written: int = 0, games_written: int = 0, seasons_cached: int = 0)[source]¶
Bases:
objectSummary of a single source sync operation.
- games_written: int = 0¶
- seasons_cached: int = 0¶
- seasons_written: int = 0¶
- source: str¶
- teams_written: int = 0¶
- class ncaa_eval.ingest.Team(*, TeamID: Annotated[int, Ge(ge=1)], TeamName: Annotated[str, MinLen(min_length=1)], CanonicalName: str = '')[source]¶
Bases:
BaseModelA college basketball team.
- canonical_name: str¶
- model_config = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- team_id: int¶
- team_name: str¶
- class ncaa_eval.ingest.ValidationReport(*, results: list[ValidationResult])[source]¶
Bases:
BaseModelAggregated results from all validation checks.
- property all_passed: bool¶
Return
Trueif every check passed.
- model_config = {'frozen': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- results: list[ValidationResult]¶
- ncaa_eval.ingest.validate_sync(repo: Repository) ValidationReport[source]¶
Run all post-sync validation checks and return a report.
This function is non-fatal — it never raises on validation failures. Unexpected I/O errors (e.g., corrupt Parquet) may still propagate. The caller is responsible for logging results.