ncaa_eval.model.xgboost_model module¶

XGBoost gradient-boosting model — reference stateless model.

Wraps xgboost.XGBClassifier behind the Model ABC, providing fit / predict_proba / save / load with XGBoost’s native UBJSON persistence format.

class ncaa_eval.model.xgboost_model.XGBoostModel(config: XGBoostModelConfig | None = None, *, batch_rating_types: tuple[Literal['srs', 'ridge', 'colley'], ...] = ('srs',), graph_features_enabled: bool = False, ordinal_composite: Literal['simple_average', 'weighted', 'pca'] | None = None)[source]¶

Bases: Model

XGBoost binary classifier wrapping XGBClassifier.

This is a stateless model — it implements Model directly (no StatefulModel lifecycle hooks).

Label balance convention: The feature server typically assigns team_a = w_team_id (the winner), so y may be heavily biased toward 1. Callers should either randomise team assignment before training (recommended) or set scale_pos_weight in the config to count(y==0) / count(y==1). The default scale_pos_weight is None (XGBoost default = 1.0), appropriate when team assignment is randomised.

fit(X: DataFrame, y: Series) → None[source]¶

Train on feature matrix X and binary labels y.

Automatically splits X into train/validation sets using validation_fraction from the config. The validation set is used for early stopping via eval_set.

Label balance convention: team_a assignment in the feature server is typically w_team_id (the winner). If labels are imbalanced, either randomise team assignment upstream or set scale_pos_weight = count(y==0) / count(y==1) in the XGBoostModelConfig.

Raises:: ValueError – If X is empty.

get_config() → XGBoostModelConfig[source]¶: Return the Pydantic-validated configuration for this model.

get_feature_importances() → list[tuple[str, float]] | None[source]¶: Return feature name/importance pairs from the fitted classifier.

classmethod load(path: Path) → Self[source]¶

Load a previously-saved XGBoost model from path.

Raises:: FileNotFoundError – If either config.json or model.ubj is missing.

predict_proba(X: DataFrame) → Series[source]¶

Return P(team_a wins) for each row of X.

Raises:: RuntimeError – If called before fit().

save(path: Path) → None[source]¶

Persist the trained model to path directory.

Writes four files: - model.ubj — XGBoost native UBJSON format (stable across versions) - config.json — Pydantic-serialised hyperparameter config - feature_names.json — JSON array of feature column names - feature_config.json — FeatureConfig sidecar

Raises:: RuntimeError – If called before fit().

class ncaa_eval.model.xgboost_model.XGBoostModelConfig(*, model_name: Literal['xgboost'] = 'xgboost', calibration_method: Literal['isotonic', 'sigmoid'] | None = None, n_estimators: int = 500, max_depth: int = 5, learning_rate: float = 0.05, subsample: float = 0.8, colsample_bytree: float = 0.8, min_child_weight: int = 3, reg_alpha: float = 0.0, reg_lambda: float = 1.0, early_stopping_rounds: int = 50, validation_fraction: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.1, scale_pos_weight: float | None = None)[source]¶

Bases: ModelConfig

Hyperparameters for the XGBoost gradient-boosting model.

Defaults from specs/research/modeling-approaches.md §5.5 and §6.4.

Label balance: Set scale_pos_weight = count(y==0) / count(y==1) when training labels are imbalanced (e.g. team_a is always the winner). Leave as None (XGBoost default = 1.0) when team assignment is randomised before training.

colsample_bytree: float¶

early_stopping_rounds: int¶

learning_rate: float¶

max_depth: int¶

min_child_weight: int¶

model_config = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_name: Literal['xgboost']¶

n_estimators: int¶

reg_alpha: float¶

reg_lambda: float¶

scale_pos_weight: float | None¶

subsample: float¶

validation_fraction: Annotated[float, Field(gt=0.0, lt=1.0)]¶