ncaa_eval.transform.graph module¶
Graph-based centrality feature engineering for NCAA team schedules.
This module builds directed NetworkX graphs from season game results and computes centrality-based features (PageRank, betweenness, HITS, clustering coefficient).
Graph semantics: - Edges are directed loser → winner (“loser votes for winner quality”, PageRank metaphor). - Edge weight = min(margin, margin_cap) × optional recency multiplier. - High PageRank ≡ transitively strong team (many high-quality wins flow toward you).
Architecture constraints: - Pure transform-layer component: accepts pre-loaded DataFrames, does NOT load CSV files. - No imports from ncaa_eval.ingest (Repository coupling is forbidden here). - Caller is responsible for deduplicating games by (w_team_id, l_team_id, day_num)
before calling graph functions for any season with ESPN+Kaggle overlap.
Walk-forward usage (Story 4.7): - Start with an empty DiGraph at the beginning of each season. - Add games one-by-one in chronological order via add_game_to_graph(). - Call compute_features(graph, pagerank_init=prev_pagerank) after each game. - Store pagerank_init (result of previous compute_pagerank()) for warm-start efficiency.
- class ncaa_eval.transform.graph.GraphTransformer(margin_cap: int = 25, recency_window_days: int = 20, recency_multiplier: float = 1.5)[source]¶
Bases:
objectTransform game DataFrames into graph-based centrality features.
Provides both batch (build + compute in one call) and incremental (add_game_to_graph) update strategies for walk-forward backtesting efficiency.
- Typical walk-forward usage (Story 4.7):
transformer = GraphTransformer() graph = nx.DiGraph() prev_pagerank: dict[int, float] | None = None for game in chronological_games:
transformer.add_game_to_graph(graph, …) features_df = transformer.compute_features(graph, pagerank_init=prev_pagerank) prev_pagerank = dict(zip(features_df[“team_id”], features_df[“pagerank”]))
- add_game_to_graph(graph: DiGraph, w_team_id: int, l_team_id: int, margin: int, day_num: int, reference_day_num: int) None[source]¶
Add a single game to an existing graph in-place.
Supports incremental walk-forward updates without rebuilding the full graph. Edge direction: l_team_id → w_team_id (loser votes for winner).
- Parameters:
graph – Existing nx.DiGraph to update in-place.
w_team_id – Winner team ID.
l_team_id – Loser team ID.
margin – Margin of victory (absolute score difference).
day_num – Day number of the game.
reference_day_num – Reference day for recency window evaluation.
- build_graph(games_df: DataFrame, reference_day_num: int | None = None) DiGraph[source]¶
Build a season graph from a games DataFrame.
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
- Returns:
nx.DiGraph with loser→winner edges and “weight” attribute.
- compute_features(graph: DiGraph, pagerank_init: dict[int, float] | None = None) DataFrame[source]¶
Compute all centrality features for every team node in the graph.
- Parameters:
graph – nx.DiGraph with loser→winner edges.
pagerank_init – Optional warm-start dict (team_id → probability) from a previous compute_pagerank() call. Reduces PageRank iterations from ~30–50 to ~2–5.
- Returns:
pd.DataFrame with columns [“team_id”, “pagerank”, “betweenness_centrality”, “hits_hub”, “hits_authority”, “clustering_coefficient”], one row per team node. Returns empty DataFrame with correct columns if graph has no nodes.
- transform(games_df: DataFrame, reference_day_num: int | None = None, pagerank_init: dict[int, float] | None = None) DataFrame[source]¶
Convenience method: build graph then compute all centrality features.
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
pagerank_init – Optional warm-start dict for PageRank.
- Returns:
pd.DataFrame with centrality features, one row per team. Returns empty DataFrame with correct columns if games_df is empty.
- ncaa_eval.transform.graph.build_season_graph(games_df: DataFrame, margin_cap: int = 25, reference_day_num: int | None = None, recency_window_days: int = 20, recency_multiplier: float = 1.5) DiGraph[source]¶
Build a directed graph from season game results.
Edges: loser → winner (loser ‘votes for’ winner quality, PageRank metaphor). Edge weight: min(margin, margin_cap) × optional_recency_multiplier.
Parallel edges (same team pair playing multiple times) are aggregated by summing their weights before passing to nx.from_pandas_edgelist().
- Parameters:
games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
margin_cap – Maximum margin-of-victory to use as edge weight (prevents blowout distortion).
reference_day_num – Day number used to compute recency window. Defaults to max day_num in games_df if None.
recency_window_days – Games within this many days of reference_day_num get recency boost.
recency_multiplier – Weight multiplier for recent games within recency_window_days.
- Returns:
nx.DiGraph with edges loser→winner and “weight” attribute on each edge. Returns empty DiGraph if games_df is empty.
- ncaa_eval.transform.graph.compute_betweenness_centrality(G: DiGraph) dict[int, float][source]¶
Compute betweenness centrality for each team in the graph.
Captures structural “bridge” position — distinct signal from PageRank (strength) and SoS (schedule quality).
- Parameters:
G – Directed graph.
- Returns:
dict mapping team_id → betweenness centrality score. Empty dict if graph has no nodes.
- ncaa_eval.transform.graph.compute_clustering_coefficient(G: DiGraph) dict[int, float][source]¶
Compute undirected clustering coefficient for each team.
Schedule diversity metric: low clustering = broad cross-conference scheduling. Uses undirected conversion so that mutual matchups count once (natural interpretation).
- Parameters:
G – Directed graph (converted to undirected internally).
- Returns:
dict mapping team_id → clustering coefficient. Empty dict if graph has no nodes.
- ncaa_eval.transform.graph.compute_hits(G: DiGraph, max_iter: int = 100) tuple[dict[int, float], dict[int, float]][source]¶
Compute HITS hub and authority scores for each team.
Authority ≈ PageRank (r≈0.908 correlation). Hub = “quality schedule despite losses” — distinct signal. Both are returned from a single nx.hits() call.
- Parameters:
G – Directed graph.
max_iter – Maximum iterations for HITS power iteration.
- Returns:
Tuple of (hub_dict, authority_dict), each mapping team_id → score. Returns uniform 0.0 scores for all nodes if graph has no edges. Returns uniform 1/n scores on convergence failure (with warning logged).
- ncaa_eval.transform.graph.compute_pagerank(G: DiGraph, alpha: float = 0.85, nstart: dict[int, float] | None = None) dict[int, float][source]¶
Compute PageRank for each team in the graph.
Captures transitive win-chain strength (2 hops vs. SoS 1 hop). Peer-reviewed NCAA validation: 71.6% vs. 64.2% naive win-ratio (Matthews et al. 2021).
- Parameters:
G – Directed graph with loser→winner edges and “weight” attribute.
alpha – Damping factor (teleportation probability = 1 - alpha).
nstart – Optional warm-start dictionary (team_id → initial probability). Initialize with previous solution for 2–5 iterations instead of 30–50.
- Returns:
dict mapping team_id → PageRank score. Empty dict if graph has no nodes.