NCAA Eval Style Guide

This guide documents the coding standards and conventions for the ncaa_eval project. All rules here reflect decisions already configured in pyproject.toml. Tooling enforcement is handled by separate stories (1.4-1.6); this document captures the what and why.


1. Docstring Convention (Google Style)

Choice: Google-style docstrings. Configured in: [tool.ruff.lint.pydocstyle] convention = "google"

Rationale

  • Google style reads naturally in source code and renders well in Sphinx via napoleon.

  • With PEP 484 type annotations present everywhere (mypy –strict), repeating types inside docstrings is redundant. Google style keeps the prose concise.

Rules

  1. Do not duplicate types in Args: / Returns: sections. Type information lives in annotations only.

  2. Include Args:, Returns: (unless None), and Raises: (if applicable).

  3. Add an Examples: section for complex public functions.

  4. Docstrings are encouraged but not enforced on every entity. When pydocstyle (D) rules are enabled (Story 1.4), D1 (missing docstring) and D415 (first-line punctuation) will be suppressed via the Ruff ignore list. Note: D rules are not currently active — the convention = "google" setting and ignore entries in pyproject.toml are preparatory config that takes effect once D is added to extend-select in Story 1.4. At minimum, every public module and every public class should have a docstring.

  5. Detailed description required when a function performs 3 or more operations. The description paragraph (after the summary line) must explain how the function implements its purpose — not just restate the summary. This is enforced during code review (PR checklist).

Example

from __future__ import annotations

def rolling_efficiency(
    scores: pd.Series,
    window: int,
    min_periods: int = 1,
) -> pd.Series:
    """Compute rolling offensive efficiency over a sliding window.

    Applies an exponentially-weighted mean to per-game point totals,
    normalised by possessions.

    Args:
        scores: Raw per-game point totals.
        window: Number of games in the rolling window.
        min_periods: Minimum observations required for a valid result.

    Returns:
        Rolling efficiency values aligned to the input index.

    Raises:
        ValueError: If *window* is less than 1.

    Examples:
        >>> rolling_efficiency(pd.Series([70, 80, 90]), window=2)
        0      NaN
        1    75.0
        2    85.0
        dtype: float64
    """

2. Naming Conventions

Entity

Convention

Example

Modules / packages

snake_case

feature_pipeline.py, ncaa_eval/

Classes

PascalCase

EloModel, TeamStats

Functions / methods

snake_case

compute_margin, get_team_id

Variables

snake_case

home_score, season_df

Constants

UPPER_SNAKE_CASE

DEFAULT_K_FACTOR, MIN_GAMES

Type aliases

PascalCase (use type statement)

type TeamId = int

Private members

Single leading underscore

_cache, _validate_input()

Test files

test_<module>.py

test_elo_model.py

Test functions

test_<behaviour>

test_margin_positive_when_home_wins

Additional Rules

  • No abbreviations in public APIs unless universally understood (id, df, url).

  • Boolean names should read as predicates: is_valid, has_played, should_normalize.

  • DataFrame variables should indicate their grain: game_df, season_stats_df, team_features_df.


3. Import Ordering

Configured in: [tool.ruff.lint.isort]

Imports are ordered in three groups separated by blank lines:

  1. Standard library (os, pathlib, typing, …)

  2. Third-party (pandas, numpy, xgboost, …)

  3. Local / first-party (ncaa_eval, tests)

Mandatory Rules

Rule

Detail

Config Reference

Future annotations

Every file must start with from __future__ import annotations

required-imports

Combine as-imports

from foo import bar, baz as B on one line

combine-as-imports = true

First-party detection

tests is known first-party

known-first-party = ["tests"]

Example

from __future__ import annotations

import logging
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit

from ncaa_eval.model import Model
from ncaa_eval.transform import compute_rolling_stats

Active Ruff Rules

Beyond isort, pyproject.toml extends the default Ruff rules (E, F) with:

Rule

Category

What It Enforces

I

isort

Import ordering (see above)

UP

pyupgrade

Modern Python syntax — e.g., list[int] not List[int], X | None not Optional[X]

PT

flake8-pytest-style

Pytest best practices — e.g., @pytest.fixture conventions, parametrize style

TID25

tidy-imports

Import hygiene — bans relative imports from parent packages

C90

mccabe

McCabe cyclomatic complexity ≤ 10 (see Section 6.1)

PLR0911

pylint-refactor

Max 6 return statements per function

PLR0912

pylint-refactor

Max 12 branches per function

PLR0913

pylint-refactor

Max 5 function arguments

Suppressed Rules

Rule

Why Suppressed

E501

Line length is enforced by the Ruff formatter (110 chars), not the linter. Suppressing E501 avoids double-reporting.

D1

Missing-docstring warnings. Not yet active (see Section 1 note).

D415

First-line punctuation. Not yet active (see Section 1 note).

PLR2004

Magic value comparison — too aggressive for data science code with domain-specific constants (e.g., tournament field size 64, coin-flip probability 0.5). See PEP 20 “Practicality Beats Purity” in Section 6.

Lint Suppression Policy

When # noqa or # type: ignore is unavoidable, follow these rules:

  1. Always include the specific error code. Never use bare # noqa or bare # type: ignore — always specify which rule is being suppressed.

    # GOOD: Specific suppression with rationale
    import joblib  # type: ignore[import-untyped] — no type stubs available
    
    # BAD: Bare suppression — which rule? why?
    import joblib  # type: ignore
    
  2. Prefer refactoring over suppression. The escalation path:

    • First: Refactor to eliminate the violation (extract function, split class, etc.)

    • Second: If refactoring is impractical, suppress with specific code and comment

    • Never: Suppress because “it’s faster than fixing”

  3. Complexity-class codes require PO approval. Inline # noqa for the following codes is allowed only with explicit PO approval:

    • PLR0911 — Too many return statements (max 6)

    • PLR0912 — Too many branches (max 12)

    • PLR0913 — Too many arguments (max 5)

    • C901 — McCabe cyclomatic complexity (max 10)

    When a suppression is approved, the inline comment must include a rationale:

    # Approved — rationale
    def update_game(  # noqa: PLR0913 — game data has inherent dimensionality
    # Pending refactoring — tracked to a story
    def run_training(  # noqa: PLR0913, C901, PLR0912 — REFACTOR Story 8.1
    
  4. Acceptable suppression scenarios:

    • # type: ignore[import-untyped] — Third-party libraries without type stubs

    • # type: ignore[no-untyped-def] — Callback signatures imposed by frameworks (e.g., Streamlit event handlers)

    • # noqa: BLE001 — Broad exception handling with documented rationale

    • # noqa: PLR0913 (etc.) — With PO approval and rationale (see rule 3)

  5. Unacceptable suppression scenarios:

    • Any # noqa for PLR0911, PLR0912, PLR0913, or C901 without PO approval

    • # type: ignore on return types — Fix the return type annotation instead

    • Any suppression without the specific error code


4. Type Annotation Standards

Configured in: [tool.mypy] strict = true

What --strict Means for Developers

Every function signature, variable, and return type must be annotated. The strict flag enables all of the following (as of mypy 1.x — the exact set may change with future versions; run mypy --help to see the current list):

  • --disallow-untyped-defs — every function needs annotations

  • --disallow-any-generics — no bare list, dict; use list[int], dict[str, float]

  • --warn-return-any — functions must not silently return Any

  • --no-implicit-reexport — explicit __all__ required for re-exports

  • --strict-bytes — distinguishes bytes from str

Practical Guidelines

  1. Use from __future__ import annotations (already enforced) so all annotations are strings and forward references work.

  2. Use modern syntax: list[int] not List[int], X | None not Optional[X].

  3. Use the type statement for aliases: type TeamId = int.

  4. For third-party libraries with incomplete stubs (numpy, pandas, xgboost), the config uses follow_imports = "silent" to suppress errors from untyped dependencies. Add # type: ignore[<code>] only when truly necessary and always include the specific error code.

  5. The py.typed marker at src/ncaa_eval/py.typed signals PEP 561 compliance.

Pydantic Integration

Configured in: [tool.mypy] plugins and [tool.pydantic-mypy]

Pydantic is the project’s primary data modeling library (shared types between Logic and UI layers). The mypy plugin provides enhanced type checking for Pydantic models:

[tool.mypy]
plugins = ["pydantic.mypy"]

[tool.pydantic-mypy]
init_typed = true                    # __init__ params get field types
init_forbid_extra = true             # error on unknown fields in __init__
warn_required_dynamic_aliases = true # warn if alias without default

What this means for developers:

  • init_typed = true — mypy understands Model(field=value) constructor calls and validates argument types against field definitions.

  • init_forbid_extra = true — passing an unknown field name to a model constructor is a type error, not a silent runtime discard.

  • warn_required_dynamic_aliases = true — alerts when a field with alias= has no default value, which can cause confusing ValidationError messages.


5. Vectorization First

Reject PRs that use for loops over Pandas DataFrames for metric calculations.

— Architecture Section 12

This is a hard rule. Scientific computation in this project must use vectorized operations.

Why

  • Vectorized pandas/numpy operations are 10-100x faster than Python loops.

  • They are easier to reason about for correctness.

  • They compose cleanly with .pipe() chains.

Correct Pattern

# Vectorized — computes margin for every game in one shot
game_df["margin"] = game_df["home_score"] - game_df["away_score"]

# Vectorized rolling mean
game_df["rolling_ppg"] = game_df.groupby("team_id")["points"].transform(
    lambda s: s.rolling(window=5, min_periods=1).mean()
)

Incorrect Pattern

# WRONG — for loop over DataFrame rows
margins = []
for _, row in game_df.iterrows():
    margins.append(row["home_score"] - row["away_score"])
game_df["margin"] = margins

Exceptions

Explicit for loops are acceptable only when:

  1. Operating on a small, fixed collection (e.g., iterating over 5 model configs, not 10,000 game rows).

  2. Performing graph traversal (e.g., NetworkX operations that have no vectorized equivalent).

  3. The loop body involves side effects that cannot be vectorized (e.g., writing files per team).

  4. Ingest-layer one-time-per-sync operationssrc/ncaa_eval/ingest/connectors/kaggle.py uses iterrows() in load_day_zeros, fetch_teams, and _parse_games_csv for per-row data transformation: constructing Pydantic models (Game, Team) in fetch_teams and _parse_games_csv, and parsing dates with per-row error handling in load_day_zeros. These run exactly once per sync operation and are not in any hot path. Pandera schema validation (Story 9.14) guards the DataFrame before iteration begins. (PO Decision C — Audit item 2.4, 2026-03-11)

In these cases, add a brief comment explaining why the loop is necessary.


6. Design Philosophy (PEP 20 - The Zen of Python)

“Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.”

— PEP 20

While PEP 20 can’t be fully automated, we enforce its core principles through code review and tooling where possible.

Enforced Principles

Simple is Better Than Complex

  • Tooling: Ruff complexity checks (see Section 6.1)

  • Review: Max cyclomatic complexity = 10, max function length = 50 lines

  • Guideline: If a function is hard to name, it’s doing too much — split it

# GOOD: Simple, single-purpose function
def calculate_margin(home_score: int, away_score: int) -> int:
    """Calculate point margin (positive = home win)."""
    return home_score - away_score

# BAD: Too complex, doing multiple things
def process_game(game_data: dict) -> dict:
    """Process game... but what does this actually do?"""
    # 80 lines of nested logic with unclear responsibilities
    ...

Explicit is Better Than Implicit

  • Tooling: Mypy strict mode enforces explicit types

  • Review: No magic numbers, no implicit state changes, clear function names

  • Guideline: Code should read like prose — the reader shouldn’t have to guess

# GOOD: Explicit parameters and behavior
def calculate_elo_change(
    rating: int,
    opponent_rating: int,
    won: bool,
    k_factor: int = 32,
) -> int:
    """Calculate Elo rating change after a game."""
    expected = 1 / (1 + 10 ** ((opponent_rating - rating) / 400))
    actual = 1 if won else 0
    return int(k_factor * (actual - expected))

# BAD: Implicit behavior, magic numbers
def adjust_rating(r: int, o: int, w: bool) -> int:
    return int(32 * (w - 1 / (1 + 10 ** ((o - r) / 400))))

Readability Counts

  • Tooling: Ruff formatting (110 char lines), naming conventions

  • Review: Variable names reflect domain concepts, not abbreviations

  • Guideline: Write for humans first, computers second

# GOOD: Clear domain language
team_offensive_efficiency = total_points / total_possessions

# BAD: Abbreviations require mental translation
off_eff = pts / poss

Flat is Better Than Nested

  • Tooling: Ruff detects excessive nesting

  • Review: Max nesting depth = 3

  • Guideline: Use early returns to reduce nesting

# GOOD: Flat structure with early returns
def validate_game(game: Game) -> None:
    """Validate game data."""
    if game.home_score < 0:
        raise ValueError("Home score cannot be negative")
    if game.away_score < 0:
        raise ValueError("Away score cannot be negative")
    if game.date > datetime.now():
        raise ValueError("Game cannot be in the future")

# BAD: Nested structure
def validate_game(game: Game) -> None:
    if game.home_score >= 0:
        if game.away_score >= 0:
            if game.date <= datetime.now():
                return
            else:
                raise ValueError("Game cannot be in the future")
        else:
            raise ValueError("Away score cannot be negative")
    else:
        raise ValueError("Home score cannot be negative")

There Should Be One Obvious Way to Do It

  • Review: Follow established project patterns (e.g., vectorization for calculations)

  • Guideline: Check existing code before inventing new approaches

# GOOD: Follows project pattern (vectorized operations)
game_df["margin"] = game_df["home_score"] - game_df["away_score"]

# BAD: Invents custom approach (loops)
for idx in range(len(game_df)):
    game_df.loc[idx, "margin"] = game_df.loc[idx, "home_score"] - game_df.loc[idx, "away_score"]

Complex is Better Than Complicated

  • Review: Accept genuine complexity; reject unnecessary complication

  • Guideline: Some systems are inherently complex (simulation engines, Elo rating chains, cross-validation orchestrators). The goal is managed complexity — complex internals behind simple interfaces — not complication from tangled logic.

# GOOD: Complex but not complicated — clear interface, complex internals
class MonteCarloSimulator:
    """Simulate tournament outcomes via Monte Carlo sampling.

    The algorithm is inherently complex (bracket traversal, probability
    propagation, seeded randomness), but the interface is simple.
    """
    def simulate(self, bracket: Bracket, n_simulations: int = 10_000) -> SimulationResult:
        ...

# BAD: Complicated — unnecessary indirection, unclear purpose
def run_sim(data, opts, flags, ctx, mode="default", **kwargs):
    if mode == "default":
        ...  # 200 lines of nested conditionals

Sparse is Better Than Dense

  • Tooling: 110-character line length limit, Ruff formatting

  • Review: One operation per line, generous whitespace, no clever one-liners

  • Guideline: When a line does too much, split it. When a function packs too many operations together, extract named intermediates.

# GOOD: Sparse — each step is clear
expected_win_rate = 1 / (1 + 10 ** ((opponent_rating - team_rating) / 400))
actual_outcome = 1.0 if won else 0.0
rating_change = k_factor * (actual_outcome - expected_win_rate)

# BAD: Dense — everything crammed together
return int(32*(int(w)-1/(1+10**((o-r)/400))))

Special Cases Aren’t Special Enough / Practicality Beats Purity

  • Review: Allow pragmatic exceptions when properly documented

  • Guideline: The vectorization-first rule (Section 5) has legitimate exceptions: graph traversal, small fixed collections, side-effect loops. Data science code has domain-specific constants (0.5 for coin-flip probability, 64 for tournament teams) that PLR2004 would flag — we suppress PLR2004 because practicality beats purity for these cases. The key: every exception must be documented and intentional, never silent.

# GOOD: Pragmatic exception with justification
# PLR2004 suppressed project-wide — domain constants like tournament size
# are clearer inline than as named constants
def build_bracket(teams: list[Team]) -> Bracket:
    if len(teams) != 64:  # NCAA tournament field size
        raise ValueError(f"Expected 64 teams, got {len(teams)}")
    ...

# BAD: Bare magic number with no context
def process(data):
    if len(data) > 42:  # What is 42?
        data = data[:42]

Errors Should Never Pass Silently / Unless Explicitly Silenced

  • Review: Every exception handler must log or re-raise; no bare except: pass

  • Guideline: This is the project’s most-violated PEP 20 principle (see Pattern D from codebase audit). The convention:

    1. Default: Log at WARNING level with structured context, then re-raise or return a sentinel that callers check.

    2. Explicitly silenced: Log at DEBUG level with a code comment explaining why silence is acceptable. Use # noqa or # type: ignore with specific codes.

    3. Never: Bare except Exception: pass or except Exception that substitutes a value without logging.

# GOOD: Error surfaced with context
try:
    response = fetch_team_schedule(team_id)
except RequestError as exc:
    logger.warning("Failed to fetch schedule for team %d: %s", team_id, exc)
    raise

# GOOD: Explicitly silenced with rationale
try:
    optional_enrichment = fetch_extra_stats(team_id)
except RequestError:
    # Enrichment is non-critical; base stats are sufficient for predictions.
    logger.debug("Optional enrichment unavailable for team %d", team_id)
    optional_enrichment = None

# BAD: Silent swallowing — Pattern D violation
try:
    data = fetch_team_schedule(team_id)
except Exception:
    data = pd.DataFrame()  # Caller has no idea data is missing

Refuse the Temptation to Guess

  • Tooling: mypy --strict — forces explicit types everywhere

  • Review: No implicit type coercions, no unvalidated assumptions

  • Guideline: This aphorism is the philosophical foundation for mypy --strict. When types are explicit, the compiler catches wrong assumptions before runtime. When types are Any or missing, the code guesses — and guesses wrong at the worst possible time.

# GOOD: Explicit types — no guessing
def get_team_rating(team_id: int, ratings: dict[int, float]) -> float | None:
    return ratings.get(team_id)

# BAD: Implicit types — code guesses about structure
def get_team_rating(team_id, ratings):
    return ratings[team_id]  # KeyError? ratings is a list? Who knows.

If the Implementation Is Hard to Explain, It’s a Bad Idea / If It’s Easy to Explain, It May Be a Good Idea

  • Tooling: McCabe complexity ≤ 10, max 50 lines per function

  • Review: If you can’t describe a function in one sentence, it needs splitting

  • Guideline: This reinforces the complexity limits. If a code reviewer asks “what does this do?” and the answer requires a paragraph, the function is doing too much. The run_training() function in cli/train.py (70+ statements, 3 noqa suppressions) is the canonical example of code that is hard to explain.

# GOOD: Easy to explain — "calculates Brier score for probability predictions"
def brier_score(probabilities: np.ndarray, outcomes: np.ndarray) -> float:
    return float(np.mean((probabilities - outcomes) ** 2))

# BAD: Hard to explain — "it processes... things... with conditions..."
def process_and_validate_then_transform_with_fallback(data, config, mode, flags):
    ...  # 80 lines that require 5 minutes to explain

Code Review Checklist for PEP 20

During code review, verify:

  • [ ] Simplicity: Functions have single, clear responsibilities (McCabe complexity ≤ 10)

  • [ ] Single purpose: Each function does exactly one thing; max 50 lines, nesting ≤ 3

  • [ ] Explicitness: No magic numbers, parameters have clear names, behavior is obvious

  • [ ] Readability: Domain concepts use full words, not abbreviations

  • [ ] Flatness: Nesting depth ≤ 3, early returns preferred

  • [ ] Consistency: Follows existing project patterns (vectorization, type sharing, etc.)

  • [ ] Complexity vs complication: Inherent complexity is managed behind simple interfaces

  • [ ] Sparseness: One operation per line, no clever one-liners, named intermediates

  • [ ] Pragmatic exceptions: Deviations from rules are documented and intentional

  • [ ] Error handling: No silent exception swallowing; handlers log or re-raise

  • [ ] Explicit types: No Any, no missing annotations, no implicit coercions

  • [ ] Explainability: Every function describable in one sentence

  • [ ] No complexity-code overrides: No # noqa for PLR0911/PLR0912/PLR0913/C901 without PO approval; each includes a rationale

  • [ ] Docstring detail: Functions with 3+ operations have detailed description explaining how


6.1 Complexity Gates (Ruff Configuration)

Configured in: pyproject.toml[tool.ruff.lint.mccabe]

Metric

Limit

Enforced By

McCabe Cyclomatic Complexity

10

Ruff C901 (pre-commit)

Max Function Length

50 lines

Manual review

Max Nesting Depth

3

Manual review

Max Arguments

5

Ruff PLR0913 (pre-commit)

See pyproject.toml for exact configuration.


6.2 Pure Functions vs Side Effects

Pure functions are easier to test, faster to run, and easier to reason about.

Design guideline: Keep business logic pure, push side effects to edges.

Pure Functions (Preferred)

Definition: Same input always produces same output, no side effects.

Characteristics:

  • Deterministic (predictable)

  • No external dependencies (no I/O, no database, no network, no time/randomness)

  • No state mutation

  • Easy to test (just input → output, no mocking)

  • Perfect for property-based testing (Hypothesis)

  • Fast (no I/O)

# PURE: Always deterministic, easy to test
def calculate_win_probability(rating_diff: int, k_factor: float = 32.0) -> float:
    """Calculate win probability from rating difference."""
    return 1 / (1 + 10 ** (-rating_diff / 400))

# PURE: Data transformation, no side effects
def normalize_team_names(names: pd.Series) -> pd.Series:
    """Normalize team names to standard format (vectorized)."""
    return names.str.strip().str.title().str.replace("St.", "Saint")

# PURE: Mathematical calculation (vectorized)
def calculate_margins(home_scores: np.ndarray, away_scores: np.ndarray) -> np.ndarray:
    """Calculate point margins (vectorized, pure)."""
    return home_scores - away_scores

Testing pure functions:

# Unit test: Simple input → output
def test_win_probability_equal_ratings():
    """Verify win probability is 50% for equal ratings."""
    assert calculate_win_probability(rating_diff=0) == 0.5

# Property test: Perfect for pure functions
@pytest.mark.property
@given(rating_diff=st.integers(-1000, 1000))
def test_win_probability_always_bounded(rating_diff):
    """Verify win probability always in [0, 1] (invariant)."""
    prob = calculate_win_probability(rating_diff)
    assert 0 <= prob <= 1

Side-Effect Functions (Push to Edges)

Definition: Functions that interact with the outside world or modify state.

Characteristics:

  • Non-deterministic (may vary based on external state)

  • External dependencies (files, database, network, time, randomness)

  • Modifies state (mutates objects, writes files, updates database)

  • Harder to test (requires mocks, stubs, fixtures)

  • Requires integration tests

# SIDE-EFFECT: Reads from file system
def load_games(path: Path) -> pd.DataFrame:
    """Load games from CSV file (I/O operation)."""
    return pd.read_csv(path)

# SIDE-EFFECT: Depends on current time (non-deterministic)
def is_game_started(game_start: datetime) -> bool:
    """Check if game has started."""
    return datetime.now() > game_start  # Changes over time

# SIDE-EFFECT: Mutates external state
def update_team_rating(team_id: int, new_rating: int) -> None:
    """Update team rating in database."""
    db.execute("UPDATE teams SET rating = ? WHERE id = ?", new_rating, team_id)

Testing side-effect functions:

# Integration test: Requires fixtures
@pytest.mark.integration
def test_load_games_returns_valid_dataframe(temp_data_dir):
    """Verify games can be loaded from CSV."""
    # Setup: Create test file
    test_file = temp_data_dir / "games.csv"
    test_file.write_text("game_id,home_team,away_team\n1,Duke,UNC\n")

    # Execute
    games = load_games(test_file)

    # Assert
    assert len(games) == 1
    assert "game_id" in games.columns

Good Separation: Pure Core + Side-Effect Shell

Pattern: Keep calculations pure, orchestrate I/O at edges.

# PURE: Core business logic (easy to test, vectorized)
def calculate_win_probabilities(
    home_ratings: np.ndarray,
    away_ratings: np.ndarray,
) -> np.ndarray:
    """Calculate win probabilities (pure, vectorized)."""
    rating_diff = home_ratings - away_ratings
    return 1 / (1 + 10 ** (-rating_diff / 400))

# SIDE-EFFECT: Orchestration at the edge
def simulate_tournament(games_path: Path, ratings_path: Path) -> pd.DataFrame:
    """Simulate tournament (orchestrates pure logic + I/O)."""
    # Side effects: Load data
    games = pd.read_csv(games_path)
    ratings = pd.read_csv(ratings_path, index_col="team")

    # Side effects: Data prep
    games = games.merge(ratings, left_on="home_team", right_index=True)
    games = games.merge(ratings, left_on="away_team", right_index=True,
                       suffixes=("_home", "_away"))

    # PURE: Core calculation (vectorized, easy to test separately)
    games["win_prob"] = calculate_win_probabilities(
        games["rating_home"].values,
        games["rating_away"].values,
    )

    return games

Why this is better:

  1. Pure function (calculate_win_probabilities):

    • Fast unit tests (no I/O)

    • Property-based tests (invariants)

    • Reusable in different contexts

    • Vectorized (meets NFR1)

  2. Side-effect function (simulate_tournament):

    • Thin orchestration layer

    • Easy to see where I/O happens

    • Core logic testable independently


Bad: Mixing Pure Logic with Side Effects

# BAD: Pure logic buried inside side effects
def simulate_tournament_bad() -> pd.DataFrame:
    """Simulate tournament (mixed design - hard to test)."""
    games = pd.read_csv("data/games.csv")  # Side effect

    results = []
    for _, game in games.iterrows():  # Side effect + non-vectorized!
        # Side effects: Database calls
        home_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.home_team)
        away_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.away_team)

        # Pure logic buried inside (can't test without database!)
        rating_diff = home_rating - away_rating
        prob = 1 / (1 + 10 ** (-rating_diff / 400))

        results.append(prob)

    return pd.DataFrame(results)

Problems:

  • Can’t test calculation logic without database

  • Can’t use property-based tests

  • Slow (I/O in loop)

  • Violates vectorization requirement

  • Hard to debug (mixed concerns)


Testing Strategy by Function Type

Function Type

Test Type

Characteristics

Example

Pure

Unit test, Property-based

Fast, no mocking, deterministic

calculate_win_probability()

Side-effect

Integration test

Slower, requires fixtures/mocks

load_games(), update_team_rating()

Mixed

❌ AVOID

Hard to test, refactor!

simulate_tournament_bad()


Code Review Checklist for Pure vs Side-Effect

During code review, verify:

  • [ ] Pure logic separated: Business calculations are pure functions

  • [ ] Side effects at edges: I/O, database, network calls in orchestration layer

  • [ ] No mixing: Pure functions don’t contain I/O operations

  • [ ] Vectorization: Pure functions use numpy/pandas operations (not loops)

  • [ ] Property tests: Pure functions have property-based tests for invariants

  • [ ] Integration tests: Side-effect functions have integration tests with fixtures


7. PR Checklist (Summary)

Every pull request must pass the following gates. The actual PR template is at .github/pull_request_template.md. For the philosophy behind this two-tier approach, see docs/TESTING_STRATEGY.md (Section: Quality Gates and Execution Tiers).

Gate

Tool

Timing

Lint pass

Ruff

Pre-commit (fast)

Type-check pass

Mypy (--strict)

Pre-commit (fast)

Test pass

Pytest

PR / CI

Docstring coverage

Manual review

PR review

Docstring detail

Manual review

PR review

No vectorization violations

Manual review

PR review

Conventional commit messages

Commitizen

Pre-commit

Function complexity

Manual review

PR review

No complexity-code overrides

Manual review

PR review

PEP 20 compliance

Manual review

PR review

SOLID principles

Manual review

PR review

Pure function design

Manual review

PR review

Review Criteria

  • Code follows naming conventions (Section 2).

  • Imports are ordered correctly (Section 3).

  • New public APIs have docstrings (Section 1).

  • Functions with 3+ operations include a detailed docstring description explaining how (not just what) — see Section 1.

  • No for loops over DataFrames for calculations (Section 5).

  • Type annotations are complete (Section 4).

  • Each function does one thing; manually verify single responsibility, max 50 lines, nesting ≤ 3.

  • No # noqa for PLR0911, PLR0912, PLR0913, or C901 without PO approval; each must include a rationale (Section 3, Lint Suppression Policy).

  • PEP 20 design principles respected (Section 6).

  • Pure functions used for business logic, side effects at edges (Section 6.2).

  • SOLID principles applied for testability (Section 10).

  • Data structures shared between Logic and UI use Pydantic models or TypedDicts.

  • Dashboard code never reads files directly — it calls ncaa_eval functions.


8. File & Module Organization

Project Layout

src/
  ncaa_eval/
    __init__.py          # Package root
    py.typed             # PEP 561 marker
    cli/                 # CLI entry points (main, train)
    ingest/              # Data source connectors
    transform/           # Feature engineering
    model/               # Model ABC and implementations (singular, not models/)
    evaluation/          # Metrics, CV, simulation
    utils/               # Shared helpers (logger.py)
tests/
  __init__.py
  conftest.py            # Shared fixtures
  unit/                  # Unit tests organized by domain
  integration/           # Integration tests
  fixtures/              # Test data fixtures
dashboard/               # Streamlit UI (imports ncaa_eval — no direct IO)
data/                    # Local data store (git-ignored)

Rules

  1. One concept per module. A module should do one thing. If it grows beyond ~300 lines, consider splitting.

  2. Mirror src/ in tests/. src/ncaa_eval/model/elo.py is tested by tests/unit/test_elo.py (or tests/integration/test_elo_integration.py).

  3. __init__.py re-exports. Public symbols should be importable from the submodule level (e.g., from ncaa_eval.model import Model, get_model). The root ncaa_eval package does not re-export anything. Concrete model classes self-register via @register_model and are accessed through the registry (get_model("elo")) or direct import (from ncaa_eval.model.elo import EloModel).

  4. No circular imports. If two modules need each other, extract shared types into a third module.

  5. Configuration lives in pyproject.toml. Do not create separate config files for tools that support pyproject.toml.

  6. Line length is 110. Not the default 88. Configured in [tool.ruff] line-length = 110.


9. Additional Architecture Rules

These rules come from the project architecture and apply across all code:

Rule

Detail

Type sharing

All data structures shared between Logic and UI must use Pydantic models or TypedDicts.

No direct IO in UI

The Streamlit dashboard must call ncaa_eval library functions — never read files directly.

Commit messages

Use conventional commits format (feat:, fix:, docs:, etc.) enforced by Commitizen.

Python version

>=3.12,<4.0. Use modern syntax (match, type statement, X | None).


10. SOLID Principles for Testability

SOLID principles make code testable. Violating SOLID = hard-to-test code.

These five object-oriented design principles improve maintainability and testability. While not automatically enforced, they’re checked during code review.

S - Single Responsibility Principle (SRP)

“A class should have only one reason to change.”

What it means: Each class/function does ONE thing well.

Already enforced by: PEP 20 complexity checks (Section 6)

# GOOD: Single responsibility
class GameLoader:
    """Loads games from CSV files."""
    def load(self, path: Path) -> pd.DataFrame:
        return pd.read_csv(path)

class GameValidator:
    """Validates game data."""
    def validate(self, games: pd.DataFrame) -> None:
        if games["home_score"].min() < 0:
            raise ValueError("Scores cannot be negative")

# BAD: Multiple responsibilities
class GameManager:
    """Does everything - loading, validating, processing, saving."""
    def do_everything(self, path: Path) -> None:
        # Too many reasons to change!
        pass

Testing impact: Easy to test (one thing = one test class)


O - Open/Closed Principle (OCP)

“Open for extension, closed for modification.”

What it means: Add new features without changing existing code.

# GOOD: Open for extension via inheritance
class RatingModel(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float:
        pass

class EloModel(RatingModel):
    def predict(self, game: Game) -> float:
        return self._calculate_elo(game)

class GlickoModel(RatingModel):  # New model, no changes to existing code
    def predict(self, game: Game) -> float:
        return self._calculate_glicko(game)

# BAD: Must modify existing code for each new model
def predict(game: Game, model_type: str) -> float:
    if model_type == "elo":
        return calculate_elo(game)
    elif model_type == "glicko":  # Must modify this function
        return calculate_glicko(game)

Testing impact: Easy to mock/stub new implementations


L - Liskov Substitution Principle (LSP)

“Subtypes must be substitutable for their base types.”

What it means: If Dog inherits from Animal, you can use Dog anywhere you use Animal without breaking things.

Already enforced by: Property-based tests verify contracts (Section: Test Purpose Guide)

# GOOD: Subclass honors parent contract
class RatingModel(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float:
        """Return probability in [0, 1]."""
        pass

class EloModel(RatingModel):
    def predict(self, game: Game) -> float:
        # Always returns [0, 1] as promised
        return 1 / (1 + 10 ** ((opp_rating - rating) / 400))

# BAD: Subclass violates parent contract
class BrokenModel(RatingModel):
    def predict(self, game: Game) -> float:
        # Returns values > 1.0, breaking the contract!
        return rating_difference * 100

Testing impact: Property tests verify contracts hold across all implementations


I - Interface Segregation Principle (ISP)

“Many specific interfaces are better than one general-purpose interface.”

What it means: Don’t force classes to implement methods they don’t need.

Already enforced by: MyPy strict mode with ABCs and Protocols (Section 4)

This project uses ABCs as the primary pattern for interface segregation (Model ABC, Repository ABC, Connector ABC) and Protocols as a complement for structural typing where duck typing is preferred (e.g., ProbabilityProvider, ScoringRule in the simulation engine). Choose based on context:

  • ABCs: When implementations must be explicitly registered and share behavior (e.g., all models inherit from Model and get common infrastructure).

  • Protocols: When you need structural typing — any object with matching methods qualifies, without inheriting from anything (e.g., ProbabilityProvider can be satisfied by any class with a predict_proba method).

# PRIMARY: ABCs for explicit interface contracts
class Model(ABC):
    """Base class for all prediction models."""
    @abstractmethod
    def predict(self, game: Game) -> float: ...

class EloModel(Model):
    def predict(self, game: Game) -> float: ...

# COMPLEMENT: Protocols for structural typing (duck typing)
class ProbabilityProvider(Protocol):
    def predict_proba(self, matchup: Matchup) -> float: ...

# Any class with predict_proba() satisfies this — no inheritance needed
class EloProvider:
    def predict_proba(self, matchup: Matchup) -> float: ...

# BAD: Fat interface forces unused methods
class DoEverything(ABC):
    @abstractmethod
    def predict(self, game: Game) -> float: ...
    @abstractmethod
    def fit(self, games: pd.DataFrame) -> None: ...
    @abstractmethod
    def cross_validate(self, games: pd.DataFrame) -> dict: ...
    # Simple models forced to implement methods they don't need!

Testing impact: Less mocking needed (small interfaces)


D - Dependency Inversion Principle (DIP)

“Depend on abstractions, not concretions.”

What it means: High-level code shouldn’t depend on low-level details.

Already enforced by: Architecture rule “Type sharing: use Pydantic models or TypedDicts” (Section 9)

# GOOD: Depends on abstraction (Protocol)
class Simulator:
    def __init__(self, model: Predictable):  # Any model works
        self.model = model

    def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
        for game in games:
            pred = self.model.predict(game)  # Works with any Predictable
            ...

# BAD: Depends on concrete implementation
class Simulator:
    def __init__(self, elo_model: EloModel):  # Locked to EloModel
        self.model = elo_model

    def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
        # Can only use EloModel, not flexible
        pass

Testing impact: Easy to inject test doubles


SOLID Review Checklist

During code review, verify:

  • [ ] SRP: Classes/functions have single, clear responsibility (covered by PEP 20 complexity)

  • [ ] OCP: New features added via extension (inheritance, composition), not modification

  • [ ] LSP: Subtypes honor parent contracts (property tests verify this)

  • [ ] ISP: Interfaces are small and focused (ABCs primary, Protocols for structural typing)

  • [ ] DIP: Depends on abstractions (Protocols, TypedDicts), not concrete classes

Summary: What’s Already Covered

SOLID Principle

Already Enforced By

SRP

PEP 20: “Simple is better than complex” (complexity ≤ 10)

OCP

Manual review (can’t automate)

LSP

Property tests: “probabilities in [0, 1]” invariants

ISP

MyPy strict mode: ABCs primary (Model, Repository, Connector), Protocols complement (ProbabilityProvider, ScoringRule)

DIP

Architecture: “Type sharing via Pydantic/TypedDicts”

Result: SOLID compliance is mostly automated or already covered by existing standards. Code review adds a final check for OCP and overall SOLID adherence.


11. Quality Gate Architecture

This project has two complementary quality gate systems. Pre-commit is canonical; nox is the local convenience runner.

Pre-commit (Canonical)

Pre-commit hooks (.pre-commit-config.yaml) run automatically on every commit and are enforced in CI. They are the single source of truth for code quality enforcement.

Hook

Stage

What It Does

ruff-lint

commit

Ruff linting with --fix (auto-corrects violations)

ruff-format

commit

Ruff formatting (auto-formats code)

mypy-strict

commit

Mypy strict type checking (src/, tests/, noxfile.py, sync.py)

pytest-smoke

commit

Pytest smoke tests (-m smoke)

commitizen

commit-msg

Conventional commit message enforcement

actionlint

commit

GitHub Actions workflow linting

Pre-commit also includes housekeeping hooks: end-of-file-fixer, trailing-whitespace, check-yaml, check-toml, detect-private-key, no-commit-to-branch, check-merge-conflict, debug-statements.

Nox (Local Convenience Runner)

Nox (noxfile.py) provides a single command (nox) that runs the full quality pipeline locally. It is useful for manual full-suite runs before pushing but is NOT used in CI.

Session

What It Does

lint

ruff check . --fix + ruff format --check .

typecheck

mypy --strict on src/ncaa_eval, tests, noxfile.py, sync.py

tests

Full pytest suite

docs

sphinx-apidoc + sphinx-build

Known Divergences (Intentional)

Area

Pre-commit

Nox

Rationale

ruff format

Auto-fix (rewrites files)

Check-only (--check)

Pre-commit should fix on commit; nox should only report for manual inspection

CI usage

Runs in CI (.github/workflows/)

Not used in CI

Pre-commit + standalone pytest covers lint + typecheck + full tests in CI

CI Pipeline

Two CI workflows run in GitHub Actions:

.github/workflows/python-check.yaml (runs on every pull request):

  1. Pre-commitpre-commit run --all-files (all hooks: ruff, mypy, actionlint, etc.)

  2. Full test suitepytest --cov=src/ncaa_eval with coverage reporting

.github/workflows/main-updated.yaml (runs on every push to main):

  1. Commitizen — version bump and changelog (skipped for bump: commits)

  2. Sphinx docssphinx-apidoc + sphinx-build → GitHub Pages deployment

Quality gates are enforced twice: by the developer’s local pre-commit hooks on every commit/push, and again by python-check.yaml on every pull request before merge.


References

  • pyproject.toml — Single source of truth for all tool configurations

  • specs/05-architecture-fullstack.md Section 12 — Coding standards

  • specs/05-architecture-fullstack.md Section 10 — Development workflow

  • specs/03-prd.md Section 4 — Technical assumptions & constraints