ncaa_eval.transform.graph module¶

Graph-based centrality feature engineering for NCAA team schedules.

This module builds directed NetworkX graphs from season game results and computes centrality-based features (PageRank, betweenness, HITS, clustering coefficient).

Graph semantics: - Edges are directed loser → winner (“loser votes for winner quality”, PageRank metaphor). - Edge weight = min(margin, margin_cap) × optional recency multiplier. - High PageRank ≡ transitively strong team (many high-quality wins flow toward you).

Architecture constraints: - Pure transform-layer component: accepts pre-loaded DataFrames, does NOT load CSV files. - No imports from ncaa_eval.ingest (Repository coupling is forbidden here). - Caller is responsible for deduplicating games by (w_team_id, l_team_id, day_num)

before calling graph functions for any season with ESPN+Kaggle overlap.

Walk-forward usage (Story 4.7): - Start with an empty DiGraph at the beginning of each season. - Add games one-by-one in chronological order via add_game_to_graph(). - Call compute_features(graph, pagerank_init=prev_pagerank) after each game. - Store pagerank_init (result of previous compute_pagerank()) for warm-start efficiency.

class ncaa_eval.transform.graph.GraphTransformer(margin_cap: int = 25, recency_window_days: int = 20, recency_multiplier: float = 1.5)[source]¶

Bases: object

Transform game DataFrames into graph-based centrality features.

Provides both batch (build + compute in one call) and incremental (add_game_to_graph) update strategies for walk-forward backtesting efficiency.

Typical walk-forward usage (Story 4.7):: transformer = GraphTransformer() graph = nx.DiGraph() prev_pagerank: dict[int, float] | None = None for game in chronological_games:

transformer.add_game_to_graph(graph, …) features_df = transformer.compute_features(graph, pagerank_init=prev_pagerank) prev_pagerank = dict(zip(features_df[“team_id”], features_df[“pagerank”]))

add_game_to_graph(graph: DiGraph, w_team_id: int, l_team_id: int, margin: int, day_num: int, reference_day_num: int) → None[source]¶

Add a single game to an existing graph in-place.

Supports incremental walk-forward updates without rebuilding the full graph. Edge direction: l_team_id → w_team_id (loser votes for winner).

Parameters:

graph – Existing nx.DiGraph to update in-place.
w_team_id – Winner team ID.
l_team_id – Loser team ID.
margin – Margin of victory (absolute score difference).
day_num – Day number of the game.
reference_day_num – Reference day for recency window evaluation.

build_graph(games_df: DataFrame, reference_day_num: int | None = None) → DiGraph[source]¶

Build a season graph from a games DataFrame.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.

Returns:

nx.DiGraph with loser→winner edges and “weight” attribute.

compute_features(graph: DiGraph, pagerank_init: dict[int, float] | None = None) → DataFrame[source]¶

Compute all centrality features for every team node in the graph.

Parameters:

graph – nx.DiGraph with loser→winner edges.
pagerank_init – Optional warm-start dict (team_id → probability) from a previous compute_pagerank() call. Reduces PageRank iterations from ~30–50 to ~2–5.

Returns:

pd.DataFrame with columns [“team_id”, “pagerank”, “betweenness_centrality”, “hits_hub”, “hits_authority”, “clustering_coefficient”], one row per team node. Returns empty DataFrame with correct columns if graph has no nodes.

transform(games_df: DataFrame, reference_day_num: int | None = None, pagerank_init: dict[int, float] | None = None) → DataFrame[source]¶

Convenience method: build graph then compute all centrality features.

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
reference_day_num – Reference day for recency weighting. Defaults to max day_num.
pagerank_init – Optional warm-start dict for PageRank.

Returns:

pd.DataFrame with centrality features, one row per team. Returns empty DataFrame with correct columns if games_df is empty.

ncaa_eval.transform.graph.build_season_graph(games_df: DataFrame, margin_cap: int = 25, reference_day_num: int | None = None, recency_window_days: int = 20, recency_multiplier: float = 1.5) → DiGraph[source]¶

Build a directed graph from season game results.

Edges: loser → winner (loser ‘votes for’ winner quality, PageRank metaphor). Edge weight: min(margin, margin_cap) × optional_recency_multiplier.

Parallel edges (same team pair playing multiple times) are aggregated by summing their weights before passing to nx.from_pandas_edgelist().

Parameters:

games_df – DataFrame with columns w_team_id, l_team_id, w_score, l_score, day_num.
margin_cap – Maximum margin-of-victory to use as edge weight (prevents blowout distortion).
reference_day_num – Day number used to compute recency window. Defaults to max day_num in games_df if None.
recency_window_days – Games within this many days of reference_day_num get recency boost.
recency_multiplier – Weight multiplier for recent games within recency_window_days.

Returns:

nx.DiGraph with edges loser→winner and “weight” attribute on each edge. Returns empty DiGraph if games_df is empty.

ncaa_eval.transform.graph.compute_betweenness_centrality(G: DiGraph) → dict[int, float][source]¶

Compute betweenness centrality for each team in the graph.

Captures structural “bridge” position — distinct signal from PageRank (strength) and SoS (schedule quality).

Parameters:: G – Directed graph.
Returns:: dict mapping team_id → betweenness centrality score. Empty dict if graph has no nodes.

ncaa_eval.transform.graph.compute_clustering_coefficient(G: DiGraph) → dict[int, float][source]¶

Compute undirected clustering coefficient for each team.

Schedule diversity metric: low clustering = broad cross-conference scheduling. Uses undirected conversion so that mutual matchups count once (natural interpretation).

Parameters:: G – Directed graph (converted to undirected internally).
Returns:: dict mapping team_id → clustering coefficient. Empty dict if graph has no nodes.

ncaa_eval.transform.graph.compute_hits(G: DiGraph, max_iter: int = 100) → tuple[dict[int, float], dict[int, float]][source]¶

Compute HITS hub and authority scores for each team.

Authority ≈ PageRank (r≈0.908 correlation). Hub = “quality schedule despite losses” — distinct signal. Both are returned from a single nx.hits() call.

Parameters:

G – Directed graph.
max_iter – Maximum iterations for HITS power iteration.

Returns:

Tuple of (hub_dict, authority_dict), each mapping team_id → score. Returns uniform 0.0 scores for all nodes if graph has no edges. Returns uniform 1/n scores on convergence failure (with warning logged).

ncaa_eval.transform.graph.compute_pagerank(G: DiGraph, alpha: float = 0.85, nstart: dict[int, float] | None = None) → dict[int, float][source]¶

Compute PageRank for each team in the graph.

Captures transitive win-chain strength (2 hops vs. SoS 1 hop). Peer-reviewed NCAA validation: 71.6% vs. 64.2% naive win-ratio (Matthews et al. 2021).

Parameters:

G – Directed graph with loser→winner edges and “weight” attribute.
alpha – Damping factor (teleportation probability = 1 - alpha).
nstart – Optional warm-start dictionary (team_id → initial probability). Initialize with previous solution for 2–5 iterations instead of 30–50.

Returns:

dict mapping team_id → PageRank score. Empty dict if graph has no nodes.