ncaa_eval.model.xgboost_model module

XGBoost gradient-boosting model — reference stateless model.

Wraps xgboost.XGBClassifier behind the Model ABC, providing fit / predict_proba / save / load with XGBoost’s native UBJSON persistence format.

class ncaa_eval.model.xgboost_model.XGBoostModel(config: XGBoostModelConfig | None = None, *, batch_rating_types: tuple[Literal['srs', 'ridge', 'colley'], ...] = ('srs',), graph_features_enabled: bool = False, ordinal_composite: Literal['simple_average', 'weighted', 'pca'] | None = None)[source]

Bases: Model

XGBoost binary classifier wrapping XGBClassifier.

This is a stateless model — it implements Model directly (no StatefulModel lifecycle hooks).

Label balance convention: The feature server typically assigns team_a = w_team_id (the winner), so y may be heavily biased toward 1. Callers should either randomise team assignment before training (recommended) or set scale_pos_weight in the config to count(y==0) / count(y==1). The default scale_pos_weight is None (XGBoost default = 1.0), appropriate when team assignment is randomised.

fit(X: DataFrame, y: Series) None[source]

Train on feature matrix X and binary labels y.

Automatically splits X into train/validation sets using validation_fraction from the config. The validation set is used for early stopping via eval_set.

Label balance convention: team_a assignment in the feature server is typically w_team_id (the winner). If labels are imbalanced, either randomise team assignment upstream or set scale_pos_weight = count(y==0) / count(y==1) in the XGBoostModelConfig.

Raises:

ValueError – If X is empty.

get_config() XGBoostModelConfig[source]

Return the Pydantic-validated configuration for this model.

get_feature_importances() list[tuple[str, float]] | None[source]

Return feature name/importance pairs from the fitted classifier.

classmethod load(path: Path) Self[source]

Load a previously-saved XGBoost model from path.

Raises:

FileNotFoundError – If either config.json or model.ubj is missing.

predict_proba(X: DataFrame) Series[source]

Return P(team_a wins) for each row of X.

Raises:

RuntimeError – If called before fit().

save(path: Path) None[source]

Persist the trained model to path directory.

Writes four files: - model.ubj — XGBoost native UBJSON format (stable across versions) - config.json — Pydantic-serialised hyperparameter config - feature_names.json — JSON array of feature column names - feature_config.json — FeatureConfig sidecar

Raises:

RuntimeError – If called before fit().

class ncaa_eval.model.xgboost_model.XGBoostModelConfig(*, model_name: Literal['xgboost'] = 'xgboost', calibration_method: Literal['isotonic', 'sigmoid'] | None = None, n_estimators: int = 500, max_depth: int = 5, learning_rate: float = 0.05, subsample: float = 0.8, colsample_bytree: float = 0.8, min_child_weight: int = 3, reg_alpha: float = 0.0, reg_lambda: float = 1.0, early_stopping_rounds: int = 50, validation_fraction: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.1, scale_pos_weight: float | None = None)[source]

Bases: ModelConfig

Hyperparameters for the XGBoost gradient-boosting model.

Defaults from specs/research/modeling-approaches.md §5.5 and §6.4.

Label balance: Set scale_pos_weight = count(y==0) / count(y==1) when training labels are imbalanced (e.g. team_a is always the winner). Leave as None (XGBoost default = 1.0) when team assignment is randomised before training.

colsample_bytree: float
early_stopping_rounds: int
learning_rate: float
max_depth: int
min_child_weight: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_name: Literal['xgboost']
n_estimators: int
reg_alpha: float
reg_lambda: float
scale_pos_weight: float | None
subsample: float
validation_fraction: Annotated[float, Field(gt=0.0, lt=1.0)]