NCAA Eval Style Guide¶

This guide documents the coding standards and conventions for the ncaa_eval project. All rules here reflect decisions already configured in pyproject.toml. Tooling enforcement is handled by separate stories (1.4-1.6); this document captures the what and why.

1. Docstring Convention (Google Style)¶

Choice: Google-style docstrings. Configured in: [tool.ruff.lint.pydocstyle] convention = "google"

Rationale¶

Google style reads naturally in source code and renders well in Sphinx via napoleon.
With PEP 484 type annotations present everywhere (mypy –strict), repeating types inside docstrings is redundant. Google style keeps the prose concise.

Rules¶

Do not duplicate types in Args: / Returns: sections. Type information lives in annotations only.
Include Args:, Returns: (unless None), and Raises: (if applicable).
Add an Examples: section for complex public functions.
Docstrings are encouraged but not enforced on every entity. When pydocstyle (D) rules are enabled (Story 1.4), D1 (missing docstring) and D415 (first-line punctuation) will be suppressed via the Ruff ignore list. Note: D rules are not currently active — the convention = "google" setting and ignore entries in pyproject.toml are preparatory config that takes effect once D is added to extend-select in Story 1.4. At minimum, every public module and every public class should have a docstring.
Detailed description required when a function performs 3 or more operations. The description paragraph (after the summary line) must explain how the function implements its purpose — not just restate the summary. This is enforced during code review (PR checklist).

Example¶

from __future__ import annotations

def rolling_efficiency(
    scores: pd.Series,
    window: int,
    min_periods: int = 1,
) -> pd.Series:
    """Compute rolling offensive efficiency over a sliding window.

    Applies an exponentially-weighted mean to per-game point totals,
    normalised by possessions.

    Args:
        scores: Raw per-game point totals.
        window: Number of games in the rolling window.
        min_periods: Minimum observations required for a valid result.

    Returns:
        Rolling efficiency values aligned to the input index.

    Raises:
        ValueError: If *window* is less than 1.

    Examples:
        >>> rolling_efficiency(pd.Series([70, 80, 90]), window=2)
        0      NaN
        1    75.0
        2    85.0
        dtype: float64
    """

2. Naming Conventions¶

Entity	Convention	Example
Modules / packages	`snake_case`	`feature_pipeline.py`, `ncaa_eval/`
Classes	`PascalCase`	`EloModel`, `TeamStats`
Functions / methods	`snake_case`	`compute_margin`, `get_team_id`
Variables	`snake_case`	`home_score`, `season_df`
Constants	`UPPER_SNAKE_CASE`	`DEFAULT_K_FACTOR`, `MIN_GAMES`
Type aliases	`PascalCase` (use `type` statement)	`type TeamId = int`
Private members	Single leading underscore	`_cache`, `_validate_input()`
Test files	`test_<module>.py`	`test_elo_model.py`
Test functions	`test_<behaviour>`	`test_margin_positive_when_home_wins`

Additional Rules¶

No abbreviations in public APIs unless universally understood (id, df, url).
Boolean names should read as predicates: is_valid, has_played, should_normalize.
DataFrame variables should indicate their grain: game_df, season_stats_df, team_features_df.

3. Import Ordering¶

Configured in: [tool.ruff.lint.isort]

Imports are ordered in three groups separated by blank lines:

Standard library (os, pathlib, typing, …)
Third-party (pandas, numpy, xgboost, …)
Local / first-party (ncaa_eval, tests)

Mandatory Rules¶

Rule	Detail	Config Reference
Future annotations	Every file must start with `from __future__ import annotations`	`required-imports`
Combine as-imports	`from foo import bar, baz as B` on one line	`combine-as-imports = true`
First-party detection	`tests` is known first-party	`known-first-party = ["tests"]`

Example¶

from __future__ import annotations

import logging
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit

from ncaa_eval.model import Model
from ncaa_eval.transform import compute_rolling_stats

Active Ruff Rules¶

Beyond isort, pyproject.toml extends the default Ruff rules (E, F) with:

Rule	Category	What It Enforces
`I`	isort	Import ordering (see above)
`UP`	pyupgrade	Modern Python syntax — e.g., `list[int]` not `List[int]`, `X \| None` not `Optional[X]`
`PT`	flake8-pytest-style	Pytest best practices — e.g., `@pytest.fixture` conventions, parametrize style
`TID25`	tidy-imports	Import hygiene — bans relative imports from parent packages
`C90`	mccabe	McCabe cyclomatic complexity ≤ 10 (see Section 6.1)
`PLR0911`	pylint-refactor	Max 6 return statements per function
`PLR0912`	pylint-refactor	Max 12 branches per function
`PLR0913`	pylint-refactor	Max 5 function arguments

Suppressed Rules¶

Rule	Why Suppressed
`E501`	Line length is enforced by the Ruff formatter (110 chars), not the linter. Suppressing `E501` avoids double-reporting.
`D1`	Missing-docstring warnings. Not yet active (see Section 1 note).
`D415`	First-line punctuation. Not yet active (see Section 1 note).
`PLR2004`	Magic value comparison — too aggressive for data science code with domain-specific constants (e.g., tournament field size `64`, coin-flip probability `0.5`). See PEP 20 “Practicality Beats Purity” in Section 6.

Lint Suppression Policy¶

When # noqa or # type: ignore is unavoidable, follow these rules:

Always include the specific error code. Never use bare # noqa or bare # type: ignore — always specify which rule is being suppressed.

# GOOD: Specific suppression with rationale
import joblib  # type: ignore[import-untyped] — no type stubs available

# BAD: Bare suppression — which rule? why?
import joblib  # type: ignore

Prefer refactoring over suppression. The escalation path:
- First: Refactor to eliminate the violation (extract function, split class, etc.)
- Second: If refactoring is impractical, suppress with specific code and comment
- Never: Suppress because “it’s faster than fixing”
Complexity-class codes require PO approval. Inline # noqa for the following codes is allowed only with explicit PO approval:
- PLR0911 — Too many return statements (max 6)
- PLR0912 — Too many branches (max 12)
- PLR0913 — Too many arguments (max 5)
- C901 — McCabe cyclomatic complexity (max 10)
When a suppression is approved, the inline comment must include a rationale:
```
# Approved — rationale
def update_game(  # noqa: PLR0913 — game data has inherent dimensionality
# Pending refactoring — tracked to a story
def run_training(  # noqa: PLR0913, C901, PLR0912 — REFACTOR Story 8.1
```
Acceptable suppression scenarios:
- # type: ignore[import-untyped] — Third-party libraries without type stubs
- # type: ignore[no-untyped-def] — Callback signatures imposed by frameworks (e.g., Streamlit event handlers)
- # noqa: BLE001 — Broad exception handling with documented rationale
- # noqa: PLR0913 (etc.) — With PO approval and rationale (see rule 3)
Unacceptable suppression scenarios:
- Any # noqa for PLR0911, PLR0912, PLR0913, or C901 without PO approval
- # type: ignore on return types — Fix the return type annotation instead
- Any suppression without the specific error code

4. Type Annotation Standards¶

Configured in: [tool.mypy] strict = true

What `--strict` Means for Developers¶

Every function signature, variable, and return type must be annotated. The strict flag enables all of the following (as of mypy 1.x — the exact set may change with future versions; run mypy --help to see the current list):

--disallow-untyped-defs — every function needs annotations
--disallow-any-generics — no bare list, dict; use list[int], dict[str, float]
--warn-return-any — functions must not silently return Any
--no-implicit-reexport — explicit __all__ required for re-exports
--strict-bytes — distinguishes bytes from str

Practical Guidelines¶

Use from __future__ import annotations (already enforced) so all annotations are strings and forward references work.
Use modern syntax: list[int] not List[int], X | None not Optional[X].
Use the type statement for aliases: type TeamId = int.
For third-party libraries with incomplete stubs (numpy, pandas, xgboost), the config uses follow_imports = "silent" to suppress errors from untyped dependencies. Add # type: ignore[<code>] only when truly necessary and always include the specific error code.
The py.typed marker at src/ncaa_eval/py.typed signals PEP 561 compliance.

Pydantic Integration¶

Configured in: [tool.mypy] plugins and [tool.pydantic-mypy]

Pydantic is the project’s primary data modeling library (shared types between Logic and UI layers). The mypy plugin provides enhanced type checking for Pydantic models:

[tool.mypy]
plugins = ["pydantic.mypy"]

[tool.pydantic-mypy]
init_typed = true                    # __init__ params get field types
init_forbid_extra = true             # error on unknown fields in __init__
warn_required_dynamic_aliases = true # warn if alias without default

What this means for developers:

init_typed = true — mypy understands Model(field=value) constructor calls and validates argument types against field definitions.
init_forbid_extra = true — passing an unknown field name to a model constructor is a type error, not a silent runtime discard.
warn_required_dynamic_aliases = true — alerts when a field with alias= has no default value, which can cause confusing ValidationError messages.

5. Vectorization First¶

Reject PRs that use for loops over Pandas DataFrames for metric calculations.

— Architecture Section 12

This is a hard rule. Scientific computation in this project must use vectorized operations.

Why¶

Vectorized pandas/numpy operations are 10-100x faster than Python loops.
They are easier to reason about for correctness.
They compose cleanly with .pipe() chains.

Correct Pattern¶

# Vectorized — computes margin for every game in one shot
game_df["margin"] = game_df["home_score"] - game_df["away_score"]

# Vectorized rolling mean
game_df["rolling_ppg"] = game_df.groupby("team_id")["points"].transform(
    lambda s: s.rolling(window=5, min_periods=1).mean()
)

Incorrect Pattern¶

# WRONG — for loop over DataFrame rows
margins = []
for _, row in game_df.iterrows():
    margins.append(row["home_score"] - row["away_score"])
game_df["margin"] = margins

Exceptions¶

Explicit for loops are acceptable only when:

Operating on a small, fixed collection (e.g., iterating over 5 model configs, not 10,000 game rows).
Performing graph traversal (e.g., NetworkX operations that have no vectorized equivalent).
The loop body involves side effects that cannot be vectorized (e.g., writing files per team).
Ingest-layer one-time-per-sync operations — src/ncaa_eval/ingest/connectors/kaggle.py uses iterrows() in load_day_zeros, fetch_teams, and _parse_games_csv for per-row data transformation: constructing Pydantic models (Game, Team) in fetch_teams and _parse_games_csv, and parsing dates with per-row error handling in load_day_zeros. These run exactly once per sync operation and are not in any hot path. Pandera schema validation (Story 9.14) guards the DataFrame before iteration begins. (PO Decision C — Audit item 2.4, 2026-03-11)

In these cases, add a brief comment explaining why the loop is necessary.

6. Design Philosophy (PEP 20 - The Zen of Python)¶

“Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.”

— PEP 20

While PEP 20 can’t be fully automated, we enforce its core principles through code review and tooling where possible.

Enforced Principles¶

Simple is Better Than Complex¶

Tooling: Ruff complexity checks (see Section 6.1)
Review: Max cyclomatic complexity = 10, max function length = 50 lines
Guideline: If a function is hard to name, it’s doing too much — split it

# GOOD: Simple, single-purpose function
def calculate_margin(home_score: int, away_score: int) -> int:
    """Calculate point margin (positive = home win)."""
    return home_score - away_score

# BAD: Too complex, doing multiple things
def process_game(game_data: dict) -> dict:
    """Process game... but what does this actually do?"""
    # 80 lines of nested logic with unclear responsibilities
    ...

Explicit is Better Than Implicit¶

Tooling: Mypy strict mode enforces explicit types
Review: No magic numbers, no implicit state changes, clear function names
Guideline: Code should read like prose — the reader shouldn’t have to guess

# GOOD: Explicit parameters and behavior
def calculate_elo_change(
    rating: int,
    opponent_rating: int,
    won: bool,
    k_factor: int = 32,
) -> int:
    """Calculate Elo rating change after a game."""
    expected = 1 / (1 + 10 ** ((opponent_rating - rating) / 400))
    actual = 1 if won else 0
    return int(k_factor * (actual - expected))

# BAD: Implicit behavior, magic numbers
def adjust_rating(r: int, o: int, w: bool) -> int:
    return int(32 * (w - 1 / (1 + 10 ** ((o - r) / 400))))

Readability Counts¶

Tooling: Ruff formatting (110 char lines), naming conventions
Review: Variable names reflect domain concepts, not abbreviations
Guideline: Write for humans first, computers second

# GOOD: Clear domain language
team_offensive_efficiency = total_points / total_possessions

# BAD: Abbreviations require mental translation
off_eff = pts / poss

Flat is Better Than Nested¶

Tooling: Ruff detects excessive nesting
Review: Max nesting depth = 3
Guideline: Use early returns to reduce nesting

# GOOD: Flat structure with early returns
def validate_game(game: Game) -> None:
    """Validate game data."""
    if game.home_score < 0:
        raise ValueError("Home score cannot be negative")
    if game.away_score < 0:
        raise ValueError("Away score cannot be negative")
    if game.date > datetime.now():
        raise ValueError("Game cannot be in the future")

# BAD: Nested structure
def validate_game(game: Game) -> None:
    if game.home_score >= 0:
        if game.away_score >= 0:
            if game.date <= datetime.now():
                return
            else:
                raise ValueError("Game cannot be in the future")
        else:
            raise ValueError("Away score cannot be negative")
    else:
        raise ValueError("Home score cannot be negative")

There Should Be One Obvious Way to Do It¶

Review: Follow established project patterns (e.g., vectorization for calculations)
Guideline: Check existing code before inventing new approaches

# GOOD: Follows project pattern (vectorized operations)
game_df["margin"] = game_df["home_score"] - game_df["away_score"]

# BAD: Invents custom approach (loops)
for idx in range(len(game_df)):
    game_df.loc[idx, "margin"] = game_df.loc[idx, "home_score"] - game_df.loc[idx, "away_score"]

Complex is Better Than Complicated¶

Review: Accept genuine complexity; reject unnecessary complication
Guideline: Some systems are inherently complex (simulation engines, Elo rating chains, cross-validation orchestrators). The goal is managed complexity — complex internals behind simple interfaces — not complication from tangled logic.

# GOOD: Complex but not complicated — clear interface, complex internals
class MonteCarloSimulator:
    """Simulate tournament outcomes via Monte Carlo sampling.

    The algorithm is inherently complex (bracket traversal, probability
    propagation, seeded randomness), but the interface is simple.
    """
    def simulate(self, bracket: Bracket, n_simulations: int = 10_000) -> SimulationResult:
        ...

# BAD: Complicated — unnecessary indirection, unclear purpose
def run_sim(data, opts, flags, ctx, mode="default", **kwargs):
    if mode == "default":
        ...  # 200 lines of nested conditionals

Sparse is Better Than Dense¶

Tooling: 110-character line length limit, Ruff formatting
Review: One operation per line, generous whitespace, no clever one-liners
Guideline: When a line does too much, split it. When a function packs too many operations together, extract named intermediates.

# GOOD: Sparse — each step is clear
expected_win_rate = 1 / (1 + 10 ** ((opponent_rating - team_rating) / 400))
actual_outcome = 1.0 if won else 0.0
rating_change = k_factor * (actual_outcome - expected_win_rate)

# BAD: Dense — everything crammed together
return int(32*(int(w)-1/(1+10**((o-r)/400))))

Special Cases Aren’t Special Enough / Practicality Beats Purity¶

Review: Allow pragmatic exceptions when properly documented
Guideline: The vectorization-first rule (Section 5) has legitimate exceptions: graph traversal, small fixed collections, side-effect loops. Data science code has domain-specific constants (0.5 for coin-flip probability, 64 for tournament teams) that PLR2004 would flag — we suppress PLR2004 because practicality beats purity for these cases. The key: every exception must be documented and intentional, never silent.

# GOOD: Pragmatic exception with justification
# PLR2004 suppressed project-wide — domain constants like tournament size
# are clearer inline than as named constants
def build_bracket(teams: list[Team]) -> Bracket:
    if len(teams) != 64:  # NCAA tournament field size
        raise ValueError(f"Expected 64 teams, got {len(teams)}")
    ...

# BAD: Bare magic number with no context
def process(data):
    if len(data) > 42:  # What is 42?
        data = data[:42]

Errors Should Never Pass Silently / Unless Explicitly Silenced¶

Review: Every exception handler must log or re-raise; no bare except: pass
Guideline: This is the project’s most-violated PEP 20 principle (see Pattern D from codebase audit). The convention:
1. Default: Log at WARNING level with structured context, then re-raise or return a sentinel that callers check.
2. Explicitly silenced: Log at DEBUG level with a code comment explaining why silence is acceptable. Use # noqa or # type: ignore with specific codes.
3. Never: Bare except Exception: pass or except Exception that substitutes a value without logging.

# GOOD: Error surfaced with context
try:
    response = fetch_team_schedule(team_id)
except RequestError as exc:
    logger.warning("Failed to fetch schedule for team %d: %s", team_id, exc)
    raise

# GOOD: Explicitly silenced with rationale
try:
    optional_enrichment = fetch_extra_stats(team_id)
except RequestError:
    # Enrichment is non-critical; base stats are sufficient for predictions.
    logger.debug("Optional enrichment unavailable for team %d", team_id)
    optional_enrichment = None

# BAD: Silent swallowing — Pattern D violation
try:
    data = fetch_team_schedule(team_id)
except Exception:
    data = pd.DataFrame()  # Caller has no idea data is missing

Refuse the Temptation to Guess¶

Tooling: mypy --strict — forces explicit types everywhere
Review: No implicit type coercions, no unvalidated assumptions
Guideline: This aphorism is the philosophical foundation for mypy --strict. When types are explicit, the compiler catches wrong assumptions before runtime. When types are Any or missing, the code guesses — and guesses wrong at the worst possible time.

# GOOD: Explicit types — no guessing
def get_team_rating(team_id: int, ratings: dict[int, float]) -> float | None:
    return ratings.get(team_id)

# BAD: Implicit types — code guesses about structure
def get_team_rating(team_id, ratings):
    return ratings[team_id]  # KeyError? ratings is a list? Who knows.

If the Implementation Is Hard to Explain, It’s a Bad Idea / If It’s Easy to Explain, It May Be a Good Idea¶

Tooling: McCabe complexity ≤ 10, max 50 lines per function
Review: If you can’t describe a function in one sentence, it needs splitting
Guideline: This reinforces the complexity limits. If a code reviewer asks “what does this do?” and the answer requires a paragraph, the function is doing too much. The run_training() function in cli/train.py (70+ statements, 3 noqa suppressions) is the canonical example of code that is hard to explain.

# GOOD: Easy to explain — "calculates Brier score for probability predictions"
def brier_score(probabilities: np.ndarray, outcomes: np.ndarray) -> float:
    return float(np.mean((probabilities - outcomes) ** 2))

# BAD: Hard to explain — "it processes... things... with conditions..."
def process_and_validate_then_transform_with_fallback(data, config, mode, flags):
    ...  # 80 lines that require 5 minutes to explain

Code Review Checklist for PEP 20¶

During code review, verify:

[ ] Simplicity: Functions have single, clear responsibilities (McCabe complexity ≤ 10)
[ ] Single purpose: Each function does exactly one thing; max 50 lines, nesting ≤ 3
[ ] Explicitness: No magic numbers, parameters have clear names, behavior is obvious
[ ] Readability: Domain concepts use full words, not abbreviations
[ ] Flatness: Nesting depth ≤ 3, early returns preferred
[ ] Consistency: Follows existing project patterns (vectorization, type sharing, etc.)
[ ] Complexity vs complication: Inherent complexity is managed behind simple interfaces
[ ] Sparseness: One operation per line, no clever one-liners, named intermediates
[ ] Pragmatic exceptions: Deviations from rules are documented and intentional
[ ] Error handling: No silent exception swallowing; handlers log or re-raise
[ ] Explicit types: No Any, no missing annotations, no implicit coercions
[ ] Explainability: Every function describable in one sentence
[ ] No complexity-code overrides: No # noqa for PLR0911/PLR0912/PLR0913/C901 without PO approval; each includes a rationale
[ ] Docstring detail: Functions with 3+ operations have detailed description explaining how

6.1 Complexity Gates (Ruff Configuration)¶

Configured in: pyproject.toml → [tool.ruff.lint.mccabe]

Metric	Limit	Enforced By
McCabe Cyclomatic Complexity	10	Ruff `C901` (pre-commit)
Max Function Length	50 lines	Manual review
Max Nesting Depth	3	Manual review
Max Arguments	5	Ruff `PLR0913` (pre-commit)

See pyproject.toml for exact configuration.

6.2 Pure Functions vs Side Effects¶

Pure functions are easier to test, faster to run, and easier to reason about.

Design guideline: Keep business logic pure, push side effects to edges.

Pure Functions (Preferred)¶

Definition: Same input always produces same output, no side effects.

Characteristics:

Deterministic (predictable)
No external dependencies (no I/O, no database, no network, no time/randomness)
No state mutation
Easy to test (just input → output, no mocking)
Perfect for property-based testing (Hypothesis)
Fast (no I/O)

# PURE: Always deterministic, easy to test
def calculate_win_probability(rating_diff: int, k_factor: float = 32.0) -> float:
    """Calculate win probability from rating difference."""
    return 1 / (1 + 10 ** (-rating_diff / 400))

# PURE: Data transformation, no side effects
def normalize_team_names(names: pd.Series) -> pd.Series:
    """Normalize team names to standard format (vectorized)."""
    return names.str.strip().str.title().str.replace("St.", "Saint")

# PURE: Mathematical calculation (vectorized)
def calculate_margins(home_scores: np.ndarray, away_scores: np.ndarray) -> np.ndarray:
    """Calculate point margins (vectorized, pure)."""
    return home_scores - away_scores

Testing pure functions:

# Unit test: Simple input → output
def test_win_probability_equal_ratings():
    """Verify win probability is 50% for equal ratings."""
    assert calculate_win_probability(rating_diff=0) == 0.5

# Property test: Perfect for pure functions
@pytest.mark.property
@given(rating_diff=st.integers(-1000, 1000))
def test_win_probability_always_bounded(rating_diff):
    """Verify win probability always in [0, 1] (invariant)."""
    prob = calculate_win_probability(rating_diff)
    assert 0 <= prob <= 1

Side-Effect Functions (Push to Edges)¶

Definition: Functions that interact with the outside world or modify state.

Characteristics:

Non-deterministic (may vary based on external state)
External dependencies (files, database, network, time, randomness)
Modifies state (mutates objects, writes files, updates database)
Harder to test (requires mocks, stubs, fixtures)
Requires integration tests

# SIDE-EFFECT: Reads from file system
def load_games(path: Path) -> pd.DataFrame:
    """Load games from CSV file (I/O operation)."""
    return pd.read_csv(path)

# SIDE-EFFECT: Depends on current time (non-deterministic)
def is_game_started(game_start: datetime) -> bool:
    """Check if game has started."""
    return datetime.now() > game_start  # Changes over time

# SIDE-EFFECT: Mutates external state
def update_team_rating(team_id: int, new_rating: int) -> None:
    """Update team rating in database."""
    db.execute("UPDATE teams SET rating = ? WHERE id = ?", new_rating, team_id)

Testing side-effect functions:

# Integration test: Requires fixtures
@pytest.mark.integration
def test_load_games_returns_valid_dataframe(temp_data_dir):
    """Verify games can be loaded from CSV."""
    # Setup: Create test file
    test_file = temp_data_dir / "games.csv"
    test_file.write_text("game_id,home_team,away_team\n1,Duke,UNC\n")

    # Execute
    games = load_games(test_file)

    # Assert
    assert len(games) == 1
    assert "game_id" in games.columns

Good Separation: Pure Core + Side-Effect Shell¶

Pattern: Keep calculations pure, orchestrate I/O at edges.

# PURE: Core business logic (easy to test, vectorized)
def calculate_win_probabilities(
    home_ratings: np.ndarray,
    away_ratings: np.ndarray,
) -> np.ndarray:
    """Calculate win probabilities (pure, vectorized)."""
    rating_diff = home_ratings - away_ratings
    return 1 / (1 + 10 ** (-rating_diff / 400))

# SIDE-EFFECT: Orchestration at the edge
def simulate_tournament(games_path: Path, ratings_path: Path) -> pd.DataFrame:
    """Simulate tournament (orchestrates pure logic + I/O)."""
    # Side effects: Load data
    games = pd.read_csv(games_path)
    ratings = pd.read_csv(ratings_path, index_col="team")

    # Side effects: Data prep
    games = games.merge(ratings, left_on="home_team", right_index=True)
    games = games.merge(ratings, left_on="away_team", right_index=True,
                       suffixes=("_home", "_away"))

    # PURE: Core calculation (vectorized, easy to test separately)
    games["win_prob"] = calculate_win_probabilities(
        games["rating_home"].values,
        games["rating_away"].values,
    )

    return games

Why this is better:

Pure function (calculate_win_probabilities):
- Fast unit tests (no I/O)
- Property-based tests (invariants)
- Reusable in different contexts
- Vectorized (meets NFR1)
Side-effect function (simulate_tournament):
- Thin orchestration layer
- Easy to see where I/O happens
- Core logic testable independently

Bad: Mixing Pure Logic with Side Effects¶

# BAD: Pure logic buried inside side effects
def simulate_tournament_bad() -> pd.DataFrame:
    """Simulate tournament (mixed design - hard to test)."""
    games = pd.read_csv("data/games.csv")  # Side effect

    results = []
    for _, game in games.iterrows():  # Side effect + non-vectorized!
        # Side effects: Database calls
        home_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.home_team)
        away_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.away_team)

        # Pure logic buried inside (can't test without database!)
        rating_diff = home_rating - away_rating
        prob = 1 / (1 + 10 ** (-rating_diff / 400))

        results.append(prob)

    return pd.DataFrame(results)

Problems:

Can’t test calculation logic without database
Can’t use property-based tests
Slow (I/O in loop)
Violates vectorization requirement
Hard to debug (mixed concerns)

Testing Strategy by Function Type¶

Function Type	Test Type	Characteristics	Example
Pure	Unit test, Property-based	Fast, no mocking, deterministic	`calculate_win_probability()`
Side-effect	Integration test	Slower, requires fixtures/mocks	`load_games()`, `update_team_rating()`
Mixed	❌ AVOID	Hard to test, refactor!	`simulate_tournament_bad()`

Code Review Checklist for Pure vs Side-Effect¶

During code review, verify:

[ ] Pure logic separated: Business calculations are pure functions
[ ] Side effects at edges: I/O, database, network calls in orchestration layer
[ ] No mixing: Pure functions don’t contain I/O operations
[ ] Vectorization: Pure functions use numpy/pandas operations (not loops)
[ ] Property tests: Pure functions have property-based tests for invariants
[ ] Integration tests: Side-effect functions have integration tests with fixtures

7. PR Checklist (Summary)¶

Every pull request must pass the following gates. The actual PR template is at .github/pull_request_template.md. For the philosophy behind this two-tier approach, see docs/TESTING_STRATEGY.md (Section: Quality Gates and Execution Tiers).

Gate	Tool	Timing
Lint pass	Ruff	Pre-commit (fast)
Type-check pass	Mypy (`--strict`)	Pre-commit (fast)
Test pass	Pytest	PR / CI
Docstring coverage	Manual review	PR review
Docstring detail	Manual review	PR review
No vectorization violations	Manual review	PR review
Conventional commit messages	Commitizen	Pre-commit
Function complexity	Manual review	PR review
No complexity-code overrides	Manual review	PR review
PEP 20 compliance	Manual review	PR review
SOLID principles	Manual review	PR review
Pure function design	Manual review	PR review

Review Criteria¶

Code follows naming conventions (Section 2).
Imports are ordered correctly (Section 3).
New public APIs have docstrings (Section 1).
Functions with 3+ operations include a detailed docstring description explaining how (not just what) — see Section 1.
No for loops over DataFrames for calculations (Section 5).
Type annotations are complete (Section 4).
Each function does one thing; manually verify single responsibility, max 50 lines, nesting ≤ 3.
No # noqa for PLR0911, PLR0912, PLR0913, or C901 without PO approval; each must include a rationale (Section 3, Lint Suppression Policy).
PEP 20 design principles respected (Section 6).
Pure functions used for business logic, side effects at edges (Section 6.2).
SOLID principles applied for testability (Section 10).
Data structures shared between Logic and UI use Pydantic models or TypedDicts.
Dashboard code never reads files directly — it calls ncaa_eval functions.

8. File & Module Organization¶

Project Layout¶

src/
  ncaa_eval/
    __init__.py          # Package root
    py.typed             # PEP 561 marker
    cli/                 # CLI entry points (main, train)
    ingest/              # Data source connectors
    transform/           # Feature engineering
    model/               # Model ABC and implementations (singular, not models/)
    evaluation/          # Metrics, CV, simulation
    utils/               # Shared helpers (logger.py)
tests/
  __init__.py
  conftest.py            # Shared fixtures
  unit/                  # Unit tests organized by domain
  integration/           # Integration tests
  fixtures/              # Test data fixtures
dashboard/               # Streamlit UI (imports ncaa_eval — no direct IO)
data/                    # Local data store (git-ignored)

Rules¶

One concept per module. A module should do one thing. If it grows beyond ~300 lines, consider splitting.
Mirror src/ in tests/. src/ncaa_eval/model/elo.py is tested by tests/unit/test_elo.py (or tests/integration/test_elo_integration.py).
__init__.py re-exports. Public symbols should be importable from the submodule level (e.g., from ncaa_eval.model import Model, get_model). The root ncaa_eval package does not re-export anything. Concrete model classes self-register via @register_model and are accessed through the registry (get_model("elo")) or direct import (from ncaa_eval.model.elo import EloModel).
No circular imports. If two modules need each other, extract shared types into a third module.
Configuration lives in pyproject.toml. Do not create separate config files for tools that support pyproject.toml.
Line length is 110. Not the default 88. Configured in [tool.ruff] line-length = 110.

9. Additional Architecture Rules¶

These rules come from the project architecture and apply across all code:

Rule	Detail
Type sharing	All data structures shared between Logic and UI must use Pydantic models or TypedDicts.
No direct IO in UI	The Streamlit dashboard must call `ncaa_eval` library functions — never read files directly.
Commit messages	Use conventional commits format (`feat:`, `fix:`, `docs:`, etc.) enforced by Commitizen.
Python version	`>=3.12,<4.0`. Use modern syntax (`match`, `type` statement, `X \| None`).

10. SOLID Principles for Testability¶

SOLID principles make code testable. Violating SOLID = hard-to-test code.

These five object-oriented design principles improve maintainability and testability. While not automatically enforced, they’re checked during code review.

S - Single Responsibility Principle (SRP)¶

“A class should have only one reason to change.”

What it means: Each class/function does ONE thing well.

Already enforced by: PEP 20 complexity checks (Section 6)

# GOOD: Single responsibility
class GameLoader:
    """Loads games from CSV files."""
    def load(self, path: Path) -> pd.DataFrame:
        return pd.read_csv(path)

class GameValidator:
    """Validates game data."""
    def validate(self, games: pd.DataFrame) -> None:
        if games["home_score"].min() < 0:
            raise ValueError("Scores cannot be negative")

# BAD: Multiple responsibilities
class GameManager:
    """Does everything - loading, validating, processing, saving."""
    def do_everything(self, path: Path) -> None:
        # Too many reasons to change!
        pass

Testing impact: Easy to test (one thing = one test class)

O - Open/Closed Principle (OCP)¶

“Open for extension, closed for modification.”

What it means: Add new features without changing existing code.

# GOOD: Open for extension via inheritance
class RatingModel(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float:
        pass

class EloModel(RatingModel):
    def predict(self, game: Game) -> float:
        return self._calculate_elo(game)

class GlickoModel(RatingModel):  # New model, no changes to existing code
    def predict(self, game: Game) -> float:
        return self._calculate_glicko(game)

# BAD: Must modify existing code for each new model
def predict(game: Game, model_type: str) -> float:
    if model_type == "elo":
        return calculate_elo(game)
    elif model_type == "glicko":  # Must modify this function
        return calculate_glicko(game)

Testing impact: Easy to mock/stub new implementations

L - Liskov Substitution Principle (LSP)¶

“Subtypes must be substitutable for their base types.”

What it means: If Dog inherits from Animal, you can use Dog anywhere you use Animal without breaking things.

Already enforced by: Property-based tests verify contracts (Section: Test Purpose Guide)

# GOOD: Subclass honors parent contract
class RatingModel(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float:
        """Return probability in [0, 1]."""
        pass

class EloModel(RatingModel):
    def predict(self, game: Game) -> float:
        # Always returns [0, 1] as promised
        return 1 / (1 + 10 ** ((opp_rating - rating) / 400))

# BAD: Subclass violates parent contract
class BrokenModel(RatingModel):
    def predict(self, game: Game) -> float:
        # Returns values > 1.0, breaking the contract!
        return rating_difference * 100

Testing impact: Property tests verify contracts hold across all implementations

I - Interface Segregation Principle (ISP)¶

“Many specific interfaces are better than one general-purpose interface.”

What it means: Don’t force classes to implement methods they don’t need.

Already enforced by: MyPy strict mode with ABCs and Protocols (Section 4)

This project uses ABCs as the primary pattern for interface segregation (Model ABC, Repository ABC, Connector ABC) and Protocols as a complement for structural typing where duck typing is preferred (e.g., ProbabilityProvider, ScoringRule in the simulation engine). Choose based on context:

ABCs: When implementations must be explicitly registered and share behavior (e.g., all models inherit from Model and get common infrastructure).
Protocols: When you need structural typing — any object with matching methods qualifies, without inheriting from anything (e.g., ProbabilityProvider can be satisfied by any class with a predict_proba method).

# PRIMARY: ABCs for explicit interface contracts
class Model(ABC):
    """Base class for all prediction models."""
    @abstractmethod
    def predict(self, game: Game) -> float: ...

class EloModel(Model):
    def predict(self, game: Game) -> float: ...

# COMPLEMENT: Protocols for structural typing (duck typing)
class ProbabilityProvider(Protocol):
    def predict_proba(self, matchup: Matchup) -> float: ...

# Any class with predict_proba() satisfies this — no inheritance needed
class EloProvider:
    def predict_proba(self, matchup: Matchup) -> float: ...

# BAD: Fat interface forces unused methods
class DoEverything(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float: ...
    @abstractmethod
    def fit(self, games: pd.DataFrame) -> None: ...
    @abstractmethod
    def cross_validate(self, games: pd.DataFrame) -> dict: ...
    # Simple models forced to implement methods they don't need!

Testing impact: Less mocking needed (small interfaces)

D - Dependency Inversion Principle (DIP)¶

“Depend on abstractions, not concretions.”

What it means: High-level code shouldn’t depend on low-level details.

Already enforced by: Architecture rule “Type sharing: use Pydantic models or TypedDicts” (Section 9)

# GOOD: Depends on abstraction (Protocol)
class Simulator:
    def __init__(self, model: Predictable):  # Any model works
        self.model = model

    def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
        for game in games:
            pred = self.model.predict(game)  # Works with any Predictable
            ...

# BAD: Depends on concrete implementation
class Simulator:
    def __init__(self, elo_model: EloModel):  # Locked to EloModel
        self.model = elo_model

    def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
        # Can only use EloModel, not flexible
        pass

Testing impact: Easy to inject test doubles

SOLID Review Checklist¶

During code review, verify:

[ ] SRP: Classes/functions have single, clear responsibility (covered by PEP 20 complexity)
[ ] OCP: New features added via extension (inheritance, composition), not modification
[ ] LSP: Subtypes honor parent contracts (property tests verify this)
[ ] ISP: Interfaces are small and focused (ABCs primary, Protocols for structural typing)
[ ] DIP: Depends on abstractions (Protocols, TypedDicts), not concrete classes

Summary: What’s Already Covered¶

SOLID Principle	Already Enforced By
SRP	PEP 20: “Simple is better than complex” (complexity ≤ 10)
OCP	Manual review (can’t automate)
LSP	Property tests: “probabilities in [0, 1]” invariants
ISP	MyPy strict mode: ABCs primary (Model, Repository, Connector), Protocols complement (ProbabilityProvider, ScoringRule)
DIP	Architecture: “Type sharing via Pydantic/TypedDicts”

Result: SOLID compliance is mostly automated or already covered by existing standards. Code review adds a final check for OCP and overall SOLID adherence.

11. Quality Gate Architecture¶

This project has two complementary quality gate systems. Pre-commit is canonical; nox is the local convenience runner.

Pre-commit (Canonical)¶

Pre-commit hooks (.pre-commit-config.yaml) run automatically on every commit and are enforced in CI. They are the single source of truth for code quality enforcement.

Hook	Stage	What It Does
`ruff-lint`	commit	Ruff linting with `--fix` (auto-corrects violations)
`ruff-format`	commit	Ruff formatting (auto-formats code)
`mypy-strict`	commit	Mypy strict type checking (`src/`, `tests/`, `noxfile.py`, `sync.py`)
`pytest-smoke`	commit	Pytest smoke tests (`-m smoke`)
`commitizen`	commit-msg	Conventional commit message enforcement
`actionlint`	commit	GitHub Actions workflow linting

Pre-commit also includes housekeeping hooks: end-of-file-fixer, trailing-whitespace, check-yaml, check-toml, detect-private-key, no-commit-to-branch, check-merge-conflict, debug-statements.

Nox (Local Convenience Runner)¶

Nox (noxfile.py) provides a single command (nox) that runs the full quality pipeline locally. It is useful for manual full-suite runs before pushing but is NOT used in CI.

Session	What It Does
`lint`	`ruff check . --fix` + `ruff format --check .`
`typecheck`	`mypy --strict` on `src/ncaa_eval`, `tests`, `noxfile.py`, `sync.py`
`tests`	Full `pytest` suite
`docs`	`sphinx-apidoc` + `sphinx-build`

Known Divergences (Intentional)¶

Area	Pre-commit	Nox	Rationale
ruff format	Auto-fix (rewrites files)	Check-only (`--check`)	Pre-commit should fix on commit; nox should only report for manual inspection
CI usage	Runs in CI (`.github/workflows/`)	Not used in CI	Pre-commit + standalone pytest covers lint + typecheck + full tests in CI

CI Pipeline¶

Two CI workflows run in GitHub Actions:

.github/workflows/python-check.yaml (runs on every pull request):

Pre-commit — pre-commit run --all-files (all hooks: ruff, mypy, actionlint, etc.)
Full test suite — pytest --cov=src/ncaa_eval with coverage reporting

.github/workflows/main-updated.yaml (runs on every push to main):

Commitizen — version bump and changelog (skipped for bump: commits)
Sphinx docs — sphinx-apidoc + sphinx-build → GitHub Pages deployment

Quality gates are enforced twice: by the developer’s local pre-commit hooks on every commit/push, and again by python-check.yaml on every pull request before merge.

References¶

pyproject.toml — Single source of truth for all tool configurations
specs/05-architecture-fullstack.md Section 12 — Coding standards
specs/05-architecture-fullstack.md Section 10 — Development workflow
specs/03-prd.md Section 4 — Technical assumptions & constraints

NCAA Eval Style Guide¶

1. Docstring Convention (Google Style)¶

Rationale¶

Rules¶

Example¶

2. Naming Conventions¶

Additional Rules¶

3. Import Ordering¶

Mandatory Rules¶

Example¶

Active Ruff Rules¶

Suppressed Rules¶

Lint Suppression Policy¶

4. Type Annotation Standards¶

What --strict Means for Developers¶

Practical Guidelines¶

Pydantic Integration¶

5. Vectorization First¶

Why¶

Correct Pattern¶

Incorrect Pattern¶

Exceptions¶

6. Design Philosophy (PEP 20 - The Zen of Python)¶

Enforced Principles¶

Simple is Better Than Complex¶

Explicit is Better Than Implicit¶

Readability Counts¶

Flat is Better Than Nested¶

There Should Be One Obvious Way to Do It¶

Complex is Better Than Complicated¶

Sparse is Better Than Dense¶

Special Cases Aren’t Special Enough / Practicality Beats Purity¶

Errors Should Never Pass Silently / Unless Explicitly Silenced¶

Refuse the Temptation to Guess¶

If the Implementation Is Hard to Explain, It’s a Bad Idea / If It’s Easy to Explain, It May Be a Good Idea¶

Code Review Checklist for PEP 20¶

6.1 Complexity Gates (Ruff Configuration)¶

6.2 Pure Functions vs Side Effects¶

Pure Functions (Preferred)¶

Side-Effect Functions (Push to Edges)¶

Good Separation: Pure Core + Side-Effect Shell¶

Bad: Mixing Pure Logic with Side Effects¶

Testing Strategy by Function Type¶

Code Review Checklist for Pure vs Side-Effect¶

7. PR Checklist (Summary)¶

Review Criteria¶

8. File & Module Organization¶

Project Layout¶

Rules¶

9. Additional Architecture Rules¶

10. SOLID Principles for Testability¶

S - Single Responsibility Principle (SRP)¶

O - Open/Closed Principle (OCP)¶

L - Liskov Substitution Principle (LSP)¶

I - Interface Segregation Principle (ISP)¶

D - Dependency Inversion Principle (DIP)¶

SOLID Review Checklist¶

Summary: What’s Already Covered¶

11. Quality Gate Architecture¶

Pre-commit (Canonical)¶

Nox (Local Convenience Runner)¶

Known Divergences (Intentional)¶

CI Pipeline¶

References¶

What `--strict` Means for Developers¶