ncaa_eval.ingest.sync module

Sync engine for fetching NCAA data and persisting it with smart caching.

SyncEngine orchestrates data retrieval from configured connectors (Kaggle, ESPN) and stores results via a Repository. Parquet-level caching prevents redundant fetches on subsequent runs.

class ncaa_eval.ingest.sync.SyncEngine(repository: Repository, data_dir: Path)[source]

Bases: object

Orchestrates data sync from external sources into the local repository.

Parameters:
  • repository – Repository instance used for reading and writing data.

  • data_dir – Root directory for local Parquet files and cached CSVs.

sync_all(force_refresh: bool = False) list[SyncResult][source]

Sync all configured sources: Kaggle first, then ESPN.

Parameters:

force_refresh – Bypass caches for all sources.

Returns:

List of SyncResult, one per source (kaggle, espn).

sync_espn(force_refresh: bool = False) SyncResult[source]

Sync the most recent season’s games from ESPN.

Requires Kaggle data to be synced first (needs team and season mappings). Uses a marker-file cache: if .espn_synced_{year} exists the season is considered up-to-date unless force_refresh.

ESPN games are merged with existing Kaggle games for the same season partition before saving (because save_games overwrites).

Parameters:

force_refresh – Bypass marker-file cache and re-fetch from ESPN.

Returns:

SyncResult summarising games written and seasons cached.

Raises:

RuntimeError – Kaggle data has not been synced yet.

sync_kaggle(force_refresh: bool = False) SyncResult[source]

Sync NCAA data from Kaggle with Parquet-level caching.

Downloads CSVs (if not cached) and converts them to Parquet. Skips individual entities whose Parquet files already exist, unless force_refresh is True.

Parameters:

force_refresh – Bypass all caches and re-fetch everything.

Returns:

SyncResult summarising teams/seasons/games written and cached.

class ncaa_eval.ingest.sync.SyncResult(source: str, teams_written: int = 0, seasons_written: int = 0, games_written: int = 0, seasons_cached: int = 0)[source]

Bases: object

Summary of a single source sync operation.

games_written: int = 0
seasons_cached: int = 0
seasons_written: int = 0
source: str
teams_written: int = 0