ncaa_eval.ingest.sync module¶
Sync engine for fetching NCAA data and persisting it with smart caching.
SyncEngine orchestrates data retrieval from configured connectors (Kaggle, ESPN) and stores results via a Repository. Parquet-level caching prevents redundant fetches on subsequent runs.
- class ncaa_eval.ingest.sync.SyncEngine(repository: Repository, data_dir: Path)[source]¶
Bases:
objectOrchestrates data sync from external sources into the local repository.
- Parameters:
repository – Repository instance used for reading and writing data.
data_dir – Root directory for local Parquet files and cached CSVs.
- sync_all(force_refresh: bool = False) list[SyncResult][source]¶
Sync all configured sources: Kaggle first, then ESPN.
- Parameters:
force_refresh – Bypass caches for all sources.
- Returns:
List of SyncResult, one per source (kaggle, espn).
- sync_espn(force_refresh: bool = False) SyncResult[source]¶
Sync the most recent season’s games from ESPN.
Requires Kaggle data to be synced first (needs team and season mappings). Uses a marker-file cache: if
.espn_synced_{year}exists the season is considered up-to-date unless force_refresh.ESPN games are merged with existing Kaggle games for the same season partition before saving (because
save_gamesoverwrites).- Parameters:
force_refresh – Bypass marker-file cache and re-fetch from ESPN.
- Returns:
SyncResult summarising games written and seasons cached.
- Raises:
RuntimeError – Kaggle data has not been synced yet.
- sync_kaggle(force_refresh: bool = False) SyncResult[source]¶
Sync NCAA data from Kaggle with Parquet-level caching.
Downloads CSVs (if not cached) and converts them to Parquet. Skips individual entities whose Parquet files already exist, unless force_refresh is
True.- Parameters:
force_refresh – Bypass all caches and re-fetch everything.
- Returns:
SyncResult summarising teams/seasons/games written and cached.
- class ncaa_eval.ingest.sync.SyncResult(source: str, teams_written: int = 0, seasons_written: int = 0, games_written: int = 0, seasons_cached: int = 0)[source]¶
Bases:
objectSummary of a single source sync operation.
- games_written: int = 0¶
- seasons_cached: int = 0¶
- seasons_written: int = 0¶
- source: str¶
- teams_written: int = 0¶