NCAA Eval Style Guide¶
This guide documents the coding standards and conventions for the ncaa_eval project.
All rules here reflect decisions already configured in pyproject.toml. Tooling
enforcement is handled by separate stories (1.4-1.6); this document captures the
what and why.
1. Docstring Convention (Google Style)¶
Choice: Google-style docstrings.
Configured in: [tool.ruff.lint.pydocstyle] convention = "google"
Rationale¶
Google style reads naturally in source code and renders well in Sphinx via
napoleon.With PEP 484 type annotations present everywhere (mypy –strict), repeating types inside docstrings is redundant. Google style keeps the prose concise.
Rules¶
Do not duplicate types in
Args:/Returns:sections. Type information lives in annotations only.Include
Args:,Returns:(unlessNone), andRaises:(if applicable).Add an
Examples:section for complex public functions.Docstrings are encouraged but not enforced on every entity. When pydocstyle (
D) rules are enabled (Story 1.4),D1(missing docstring) andD415(first-line punctuation) will be suppressed via the Ruffignorelist. Note:Drules are not currently active — theconvention = "google"setting and ignore entries inpyproject.tomlare preparatory config that takes effect onceDis added toextend-selectin Story 1.4. At minimum, every public module and every public class should have a docstring.Detailed description required when a function performs 3 or more operations. The description paragraph (after the summary line) must explain how the function implements its purpose — not just restate the summary. This is enforced during code review (PR checklist).
Example¶
from __future__ import annotations
def rolling_efficiency(
scores: pd.Series,
window: int,
min_periods: int = 1,
) -> pd.Series:
"""Compute rolling offensive efficiency over a sliding window.
Applies an exponentially-weighted mean to per-game point totals,
normalised by possessions.
Args:
scores: Raw per-game point totals.
window: Number of games in the rolling window.
min_periods: Minimum observations required for a valid result.
Returns:
Rolling efficiency values aligned to the input index.
Raises:
ValueError: If *window* is less than 1.
Examples:
>>> rolling_efficiency(pd.Series([70, 80, 90]), window=2)
0 NaN
1 75.0
2 85.0
dtype: float64
"""
2. Naming Conventions¶
Entity |
Convention |
Example |
|---|---|---|
Modules / packages |
|
|
Classes |
|
|
Functions / methods |
|
|
Variables |
|
|
Constants |
|
|
Type aliases |
|
|
Private members |
Single leading underscore |
|
Test files |
|
|
Test functions |
|
|
Additional Rules¶
No abbreviations in public APIs unless universally understood (
id,df,url).Boolean names should read as predicates:
is_valid,has_played,should_normalize.DataFrame variables should indicate their grain:
game_df,season_stats_df,team_features_df.
3. Import Ordering¶
Configured in: [tool.ruff.lint.isort]
Imports are ordered in three groups separated by blank lines:
Standard library (
os,pathlib,typing, …)Third-party (
pandas,numpy,xgboost, …)Local / first-party (
ncaa_eval,tests)
Mandatory Rules¶
Rule |
Detail |
Config Reference |
|---|---|---|
Future annotations |
Every file must start with |
|
Combine as-imports |
|
|
First-party detection |
|
|
Example¶
from __future__ import annotations
import logging
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from ncaa_eval.model import Model
from ncaa_eval.transform import compute_rolling_stats
Active Ruff Rules¶
Beyond isort, pyproject.toml extends the default Ruff rules (E, F) with:
Rule |
Category |
What It Enforces |
|---|---|---|
|
isort |
Import ordering (see above) |
|
pyupgrade |
Modern Python syntax — e.g., |
|
flake8-pytest-style |
Pytest best practices — e.g., |
|
tidy-imports |
Import hygiene — bans relative imports from parent packages |
|
mccabe |
McCabe cyclomatic complexity ≤ 10 (see Section 6.1) |
|
pylint-refactor |
Max 6 return statements per function |
|
pylint-refactor |
Max 12 branches per function |
|
pylint-refactor |
Max 5 function arguments |
Suppressed Rules¶
Rule |
Why Suppressed |
|---|---|
|
Line length is enforced by the Ruff formatter (110 chars), not the linter. Suppressing |
|
Missing-docstring warnings. Not yet active (see Section 1 note). |
|
First-line punctuation. Not yet active (see Section 1 note). |
|
Magic value comparison — too aggressive for data science code with domain-specific constants (e.g., tournament field size |
Lint Suppression Policy¶
When # noqa or # type: ignore is unavoidable, follow these rules:
Always include the specific error code. Never use bare
# noqaor bare# type: ignore— always specify which rule is being suppressed.# GOOD: Specific suppression with rationale import joblib # type: ignore[import-untyped] — no type stubs available # BAD: Bare suppression — which rule? why? import joblib # type: ignore
Prefer refactoring over suppression. The escalation path:
First: Refactor to eliminate the violation (extract function, split class, etc.)
Second: If refactoring is impractical, suppress with specific code and comment
Never: Suppress because “it’s faster than fixing”
Complexity-class codes require PO approval. Inline
# noqafor the following codes is allowed only with explicit PO approval:PLR0911— Too many return statements (max 6)PLR0912— Too many branches (max 12)PLR0913— Too many arguments (max 5)C901— McCabe cyclomatic complexity (max 10)
When a suppression is approved, the inline comment must include a rationale:
# Approved — rationale def update_game( # noqa: PLR0913 — game data has inherent dimensionality # Pending refactoring — tracked to a story def run_training( # noqa: PLR0913, C901, PLR0912 — REFACTOR Story 8.1
Acceptable suppression scenarios:
# type: ignore[import-untyped]— Third-party libraries without type stubs# type: ignore[no-untyped-def]— Callback signatures imposed by frameworks (e.g., Streamlit event handlers)# noqa: BLE001— Broad exception handling with documented rationale# noqa: PLR0913(etc.) — With PO approval and rationale (see rule 3)
Unacceptable suppression scenarios:
Any
# noqafor PLR0911, PLR0912, PLR0913, or C901 without PO approval# type: ignoreon return types — Fix the return type annotation insteadAny suppression without the specific error code
4. Type Annotation Standards¶
Configured in: [tool.mypy] strict = true
What --strict Means for Developers¶
Every function signature, variable, and return type must be annotated. The strict
flag enables all of the following (as of mypy 1.x — the exact set may change with
future versions; run mypy --help to see the current list):
--disallow-untyped-defs— every function needs annotations--disallow-any-generics— no barelist,dict; uselist[int],dict[str, float]--warn-return-any— functions must not silently returnAny--no-implicit-reexport— explicit__all__required for re-exports--strict-bytes— distinguishesbytesfromstr
Practical Guidelines¶
Use
from __future__ import annotations(already enforced) so all annotations are strings and forward references work.Use modern syntax:
list[int]notList[int],X | NonenotOptional[X].Use the
typestatement for aliases:type TeamId = int.For third-party libraries with incomplete stubs (numpy, pandas, xgboost), the config uses
follow_imports = "silent"to suppress errors from untyped dependencies. Add# type: ignore[<code>]only when truly necessary and always include the specific error code.The
py.typedmarker atsrc/ncaa_eval/py.typedsignals PEP 561 compliance.
Pydantic Integration¶
Configured in: [tool.mypy] plugins and [tool.pydantic-mypy]
Pydantic is the project’s primary data modeling library (shared types between Logic and UI layers). The mypy plugin provides enhanced type checking for Pydantic models:
[tool.mypy]
plugins = ["pydantic.mypy"]
[tool.pydantic-mypy]
init_typed = true # __init__ params get field types
init_forbid_extra = true # error on unknown fields in __init__
warn_required_dynamic_aliases = true # warn if alias without default
What this means for developers:
init_typed = true— mypy understandsModel(field=value)constructor calls and validates argument types against field definitions.init_forbid_extra = true— passing an unknown field name to a model constructor is a type error, not a silent runtime discard.warn_required_dynamic_aliases = true— alerts when a field withalias=has no default value, which can cause confusingValidationErrormessages.
5. Vectorization First¶
Reject PRs that use
forloops over Pandas DataFrames for metric calculations.— Architecture Section 12
This is a hard rule. Scientific computation in this project must use vectorized operations.
Why¶
Vectorized pandas/numpy operations are 10-100x faster than Python loops.
They are easier to reason about for correctness.
They compose cleanly with
.pipe()chains.
Correct Pattern¶
# Vectorized — computes margin for every game in one shot
game_df["margin"] = game_df["home_score"] - game_df["away_score"]
# Vectorized rolling mean
game_df["rolling_ppg"] = game_df.groupby("team_id")["points"].transform(
lambda s: s.rolling(window=5, min_periods=1).mean()
)
Incorrect Pattern¶
# WRONG — for loop over DataFrame rows
margins = []
for _, row in game_df.iterrows():
margins.append(row["home_score"] - row["away_score"])
game_df["margin"] = margins
Exceptions¶
Explicit for loops are acceptable only when:
Operating on a small, fixed collection (e.g., iterating over 5 model configs, not 10,000 game rows).
Performing graph traversal (e.g., NetworkX operations that have no vectorized equivalent).
The loop body involves side effects that cannot be vectorized (e.g., writing files per team).
Ingest-layer one-time-per-sync operations —
src/ncaa_eval/ingest/connectors/kaggle.pyusesiterrows()inload_day_zeros,fetch_teams, and_parse_games_csvfor per-row data transformation: constructing Pydantic models (Game,Team) infetch_teamsand_parse_games_csv, and parsing dates with per-row error handling inload_day_zeros. These run exactly once persyncoperation and are not in any hot path. Pandera schema validation (Story 9.14) guards the DataFrame before iteration begins. (PO Decision C — Audit item 2.4, 2026-03-11)
In these cases, add a brief comment explaining why the loop is necessary.
6. Design Philosophy (PEP 20 - The Zen of Python)¶
“Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.”
— PEP 20
While PEP 20 can’t be fully automated, we enforce its core principles through code review and tooling where possible.
Enforced Principles¶
Simple is Better Than Complex¶
Tooling: Ruff complexity checks (see Section 6.1)
Review: Max cyclomatic complexity = 10, max function length = 50 lines
Guideline: If a function is hard to name, it’s doing too much — split it
# GOOD: Simple, single-purpose function
def calculate_margin(home_score: int, away_score: int) -> int:
"""Calculate point margin (positive = home win)."""
return home_score - away_score
# BAD: Too complex, doing multiple things
def process_game(game_data: dict) -> dict:
"""Process game... but what does this actually do?"""
# 80 lines of nested logic with unclear responsibilities
...
Explicit is Better Than Implicit¶
Tooling: Mypy strict mode enforces explicit types
Review: No magic numbers, no implicit state changes, clear function names
Guideline: Code should read like prose — the reader shouldn’t have to guess
# GOOD: Explicit parameters and behavior
def calculate_elo_change(
rating: int,
opponent_rating: int,
won: bool,
k_factor: int = 32,
) -> int:
"""Calculate Elo rating change after a game."""
expected = 1 / (1 + 10 ** ((opponent_rating - rating) / 400))
actual = 1 if won else 0
return int(k_factor * (actual - expected))
# BAD: Implicit behavior, magic numbers
def adjust_rating(r: int, o: int, w: bool) -> int:
return int(32 * (w - 1 / (1 + 10 ** ((o - r) / 400))))
Readability Counts¶
Tooling: Ruff formatting (110 char lines), naming conventions
Review: Variable names reflect domain concepts, not abbreviations
Guideline: Write for humans first, computers second
# GOOD: Clear domain language
team_offensive_efficiency = total_points / total_possessions
# BAD: Abbreviations require mental translation
off_eff = pts / poss
Flat is Better Than Nested¶
Tooling: Ruff detects excessive nesting
Review: Max nesting depth = 3
Guideline: Use early returns to reduce nesting
# GOOD: Flat structure with early returns
def validate_game(game: Game) -> None:
"""Validate game data."""
if game.home_score < 0:
raise ValueError("Home score cannot be negative")
if game.away_score < 0:
raise ValueError("Away score cannot be negative")
if game.date > datetime.now():
raise ValueError("Game cannot be in the future")
# BAD: Nested structure
def validate_game(game: Game) -> None:
if game.home_score >= 0:
if game.away_score >= 0:
if game.date <= datetime.now():
return
else:
raise ValueError("Game cannot be in the future")
else:
raise ValueError("Away score cannot be negative")
else:
raise ValueError("Home score cannot be negative")
There Should Be One Obvious Way to Do It¶
Review: Follow established project patterns (e.g., vectorization for calculations)
Guideline: Check existing code before inventing new approaches
# GOOD: Follows project pattern (vectorized operations)
game_df["margin"] = game_df["home_score"] - game_df["away_score"]
# BAD: Invents custom approach (loops)
for idx in range(len(game_df)):
game_df.loc[idx, "margin"] = game_df.loc[idx, "home_score"] - game_df.loc[idx, "away_score"]
Complex is Better Than Complicated¶
Review: Accept genuine complexity; reject unnecessary complication
Guideline: Some systems are inherently complex (simulation engines, Elo rating chains, cross-validation orchestrators). The goal is managed complexity — complex internals behind simple interfaces — not complication from tangled logic.
# GOOD: Complex but not complicated — clear interface, complex internals
class MonteCarloSimulator:
"""Simulate tournament outcomes via Monte Carlo sampling.
The algorithm is inherently complex (bracket traversal, probability
propagation, seeded randomness), but the interface is simple.
"""
def simulate(self, bracket: Bracket, n_simulations: int = 10_000) -> SimulationResult:
...
# BAD: Complicated — unnecessary indirection, unclear purpose
def run_sim(data, opts, flags, ctx, mode="default", **kwargs):
if mode == "default":
... # 200 lines of nested conditionals
Sparse is Better Than Dense¶
Tooling: 110-character line length limit, Ruff formatting
Review: One operation per line, generous whitespace, no clever one-liners
Guideline: When a line does too much, split it. When a function packs too many operations together, extract named intermediates.
# GOOD: Sparse — each step is clear
expected_win_rate = 1 / (1 + 10 ** ((opponent_rating - team_rating) / 400))
actual_outcome = 1.0 if won else 0.0
rating_change = k_factor * (actual_outcome - expected_win_rate)
# BAD: Dense — everything crammed together
return int(32*(int(w)-1/(1+10**((o-r)/400))))
Special Cases Aren’t Special Enough / Practicality Beats Purity¶
Review: Allow pragmatic exceptions when properly documented
Guideline: The vectorization-first rule (Section 5) has legitimate exceptions: graph traversal, small fixed collections, side-effect loops. Data science code has domain-specific constants (
0.5for coin-flip probability,64for tournament teams) that PLR2004 would flag — we suppress PLR2004 because practicality beats purity for these cases. The key: every exception must be documented and intentional, never silent.
# GOOD: Pragmatic exception with justification
# PLR2004 suppressed project-wide — domain constants like tournament size
# are clearer inline than as named constants
def build_bracket(teams: list[Team]) -> Bracket:
if len(teams) != 64: # NCAA tournament field size
raise ValueError(f"Expected 64 teams, got {len(teams)}")
...
# BAD: Bare magic number with no context
def process(data):
if len(data) > 42: # What is 42?
data = data[:42]
Errors Should Never Pass Silently / Unless Explicitly Silenced¶
Review: Every exception handler must log or re-raise; no bare
except: passGuideline: This is the project’s most-violated PEP 20 principle (see Pattern D from codebase audit). The convention:
Default: Log at WARNING level with structured context, then re-raise or return a sentinel that callers check.
Explicitly silenced: Log at DEBUG level with a code comment explaining why silence is acceptable. Use
# noqaor# type: ignorewith specific codes.Never: Bare
except Exception: passorexcept Exceptionthat substitutes a value without logging.
# GOOD: Error surfaced with context
try:
response = fetch_team_schedule(team_id)
except RequestError as exc:
logger.warning("Failed to fetch schedule for team %d: %s", team_id, exc)
raise
# GOOD: Explicitly silenced with rationale
try:
optional_enrichment = fetch_extra_stats(team_id)
except RequestError:
# Enrichment is non-critical; base stats are sufficient for predictions.
logger.debug("Optional enrichment unavailable for team %d", team_id)
optional_enrichment = None
# BAD: Silent swallowing — Pattern D violation
try:
data = fetch_team_schedule(team_id)
except Exception:
data = pd.DataFrame() # Caller has no idea data is missing
Refuse the Temptation to Guess¶
Tooling:
mypy --strict— forces explicit types everywhereReview: No implicit type coercions, no unvalidated assumptions
Guideline: This aphorism is the philosophical foundation for
mypy --strict. When types are explicit, the compiler catches wrong assumptions before runtime. When types areAnyor missing, the code guesses — and guesses wrong at the worst possible time.
# GOOD: Explicit types — no guessing
def get_team_rating(team_id: int, ratings: dict[int, float]) -> float | None:
return ratings.get(team_id)
# BAD: Implicit types — code guesses about structure
def get_team_rating(team_id, ratings):
return ratings[team_id] # KeyError? ratings is a list? Who knows.
If the Implementation Is Hard to Explain, It’s a Bad Idea / If It’s Easy to Explain, It May Be a Good Idea¶
Tooling: McCabe complexity ≤ 10, max 50 lines per function
Review: If you can’t describe a function in one sentence, it needs splitting
Guideline: This reinforces the complexity limits. If a code reviewer asks “what does this do?” and the answer requires a paragraph, the function is doing too much. The
run_training()function incli/train.py(70+ statements, 3noqasuppressions) is the canonical example of code that is hard to explain.
# GOOD: Easy to explain — "calculates Brier score for probability predictions"
def brier_score(probabilities: np.ndarray, outcomes: np.ndarray) -> float:
return float(np.mean((probabilities - outcomes) ** 2))
# BAD: Hard to explain — "it processes... things... with conditions..."
def process_and_validate_then_transform_with_fallback(data, config, mode, flags):
... # 80 lines that require 5 minutes to explain
Code Review Checklist for PEP 20¶
During code review, verify:
[ ] Simplicity: Functions have single, clear responsibilities (McCabe complexity ≤ 10)
[ ] Single purpose: Each function does exactly one thing; max 50 lines, nesting ≤ 3
[ ] Explicitness: No magic numbers, parameters have clear names, behavior is obvious
[ ] Readability: Domain concepts use full words, not abbreviations
[ ] Flatness: Nesting depth ≤ 3, early returns preferred
[ ] Consistency: Follows existing project patterns (vectorization, type sharing, etc.)
[ ] Complexity vs complication: Inherent complexity is managed behind simple interfaces
[ ] Sparseness: One operation per line, no clever one-liners, named intermediates
[ ] Pragmatic exceptions: Deviations from rules are documented and intentional
[ ] Error handling: No silent exception swallowing; handlers log or re-raise
[ ] Explicit types: No
Any, no missing annotations, no implicit coercions[ ] Explainability: Every function describable in one sentence
[ ] No complexity-code overrides: No
# noqafor PLR0911/PLR0912/PLR0913/C901 without PO approval; each includes a rationale[ ] Docstring detail: Functions with 3+ operations have detailed description explaining how
6.1 Complexity Gates (Ruff Configuration)¶
Configured in: pyproject.toml → [tool.ruff.lint.mccabe]
Metric |
Limit |
Enforced By |
|---|---|---|
McCabe Cyclomatic Complexity |
10 |
Ruff |
Max Function Length |
50 lines |
Manual review |
Max Nesting Depth |
3 |
Manual review |
Max Arguments |
5 |
Ruff |
See pyproject.toml for exact configuration.
6.2 Pure Functions vs Side Effects¶
Pure functions are easier to test, faster to run, and easier to reason about.
Design guideline: Keep business logic pure, push side effects to edges.
Pure Functions (Preferred)¶
Definition: Same input always produces same output, no side effects.
Characteristics:
Deterministic (predictable)
No external dependencies (no I/O, no database, no network, no time/randomness)
No state mutation
Easy to test (just input → output, no mocking)
Perfect for property-based testing (Hypothesis)
Fast (no I/O)
# PURE: Always deterministic, easy to test
def calculate_win_probability(rating_diff: int, k_factor: float = 32.0) -> float:
"""Calculate win probability from rating difference."""
return 1 / (1 + 10 ** (-rating_diff / 400))
# PURE: Data transformation, no side effects
def normalize_team_names(names: pd.Series) -> pd.Series:
"""Normalize team names to standard format (vectorized)."""
return names.str.strip().str.title().str.replace("St.", "Saint")
# PURE: Mathematical calculation (vectorized)
def calculate_margins(home_scores: np.ndarray, away_scores: np.ndarray) -> np.ndarray:
"""Calculate point margins (vectorized, pure)."""
return home_scores - away_scores
Testing pure functions:
# Unit test: Simple input → output
def test_win_probability_equal_ratings():
"""Verify win probability is 50% for equal ratings."""
assert calculate_win_probability(rating_diff=0) == 0.5
# Property test: Perfect for pure functions
@pytest.mark.property
@given(rating_diff=st.integers(-1000, 1000))
def test_win_probability_always_bounded(rating_diff):
"""Verify win probability always in [0, 1] (invariant)."""
prob = calculate_win_probability(rating_diff)
assert 0 <= prob <= 1
Side-Effect Functions (Push to Edges)¶
Definition: Functions that interact with the outside world or modify state.
Characteristics:
Non-deterministic (may vary based on external state)
External dependencies (files, database, network, time, randomness)
Modifies state (mutates objects, writes files, updates database)
Harder to test (requires mocks, stubs, fixtures)
Requires integration tests
# SIDE-EFFECT: Reads from file system
def load_games(path: Path) -> pd.DataFrame:
"""Load games from CSV file (I/O operation)."""
return pd.read_csv(path)
# SIDE-EFFECT: Depends on current time (non-deterministic)
def is_game_started(game_start: datetime) -> bool:
"""Check if game has started."""
return datetime.now() > game_start # Changes over time
# SIDE-EFFECT: Mutates external state
def update_team_rating(team_id: int, new_rating: int) -> None:
"""Update team rating in database."""
db.execute("UPDATE teams SET rating = ? WHERE id = ?", new_rating, team_id)
Testing side-effect functions:
# Integration test: Requires fixtures
@pytest.mark.integration
def test_load_games_returns_valid_dataframe(temp_data_dir):
"""Verify games can be loaded from CSV."""
# Setup: Create test file
test_file = temp_data_dir / "games.csv"
test_file.write_text("game_id,home_team,away_team\n1,Duke,UNC\n")
# Execute
games = load_games(test_file)
# Assert
assert len(games) == 1
assert "game_id" in games.columns
Good Separation: Pure Core + Side-Effect Shell¶
Pattern: Keep calculations pure, orchestrate I/O at edges.
# PURE: Core business logic (easy to test, vectorized)
def calculate_win_probabilities(
home_ratings: np.ndarray,
away_ratings: np.ndarray,
) -> np.ndarray:
"""Calculate win probabilities (pure, vectorized)."""
rating_diff = home_ratings - away_ratings
return 1 / (1 + 10 ** (-rating_diff / 400))
# SIDE-EFFECT: Orchestration at the edge
def simulate_tournament(games_path: Path, ratings_path: Path) -> pd.DataFrame:
"""Simulate tournament (orchestrates pure logic + I/O)."""
# Side effects: Load data
games = pd.read_csv(games_path)
ratings = pd.read_csv(ratings_path, index_col="team")
# Side effects: Data prep
games = games.merge(ratings, left_on="home_team", right_index=True)
games = games.merge(ratings, left_on="away_team", right_index=True,
suffixes=("_home", "_away"))
# PURE: Core calculation (vectorized, easy to test separately)
games["win_prob"] = calculate_win_probabilities(
games["rating_home"].values,
games["rating_away"].values,
)
return games
Why this is better:
Pure function (
calculate_win_probabilities):Fast unit tests (no I/O)
Property-based tests (invariants)
Reusable in different contexts
Vectorized (meets NFR1)
Side-effect function (
simulate_tournament):Thin orchestration layer
Easy to see where I/O happens
Core logic testable independently
Bad: Mixing Pure Logic with Side Effects¶
# BAD: Pure logic buried inside side effects
def simulate_tournament_bad() -> pd.DataFrame:
"""Simulate tournament (mixed design - hard to test)."""
games = pd.read_csv("data/games.csv") # Side effect
results = []
for _, game in games.iterrows(): # Side effect + non-vectorized!
# Side effects: Database calls
home_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.home_team)
away_rating = db.query("SELECT rating FROM teams WHERE id = ?", game.away_team)
# Pure logic buried inside (can't test without database!)
rating_diff = home_rating - away_rating
prob = 1 / (1 + 10 ** (-rating_diff / 400))
results.append(prob)
return pd.DataFrame(results)
Problems:
Can’t test calculation logic without database
Can’t use property-based tests
Slow (I/O in loop)
Violates vectorization requirement
Hard to debug (mixed concerns)
Testing Strategy by Function Type¶
Function Type |
Test Type |
Characteristics |
Example |
|---|---|---|---|
Pure |
Unit test, Property-based |
Fast, no mocking, deterministic |
|
Side-effect |
Integration test |
Slower, requires fixtures/mocks |
|
Mixed |
❌ AVOID |
Hard to test, refactor! |
|
Code Review Checklist for Pure vs Side-Effect¶
During code review, verify:
[ ] Pure logic separated: Business calculations are pure functions
[ ] Side effects at edges: I/O, database, network calls in orchestration layer
[ ] No mixing: Pure functions don’t contain I/O operations
[ ] Vectorization: Pure functions use numpy/pandas operations (not loops)
[ ] Property tests: Pure functions have property-based tests for invariants
[ ] Integration tests: Side-effect functions have integration tests with fixtures
7. PR Checklist (Summary)¶
Every pull request must pass the following gates. The actual PR template is at
.github/pull_request_template.md. For
the philosophy behind this two-tier approach, see
docs/TESTING_STRATEGY.md (Section: Quality Gates and Execution Tiers).
Gate |
Tool |
Timing |
|---|---|---|
Lint pass |
Ruff |
Pre-commit (fast) |
Type-check pass |
Mypy ( |
Pre-commit (fast) |
Test pass |
Pytest |
PR / CI |
Docstring coverage |
Manual review |
PR review |
Docstring detail |
Manual review |
PR review |
No vectorization violations |
Manual review |
PR review |
Conventional commit messages |
Commitizen |
Pre-commit |
Function complexity |
Manual review |
PR review |
No complexity-code overrides |
Manual review |
PR review |
PEP 20 compliance |
Manual review |
PR review |
SOLID principles |
Manual review |
PR review |
Pure function design |
Manual review |
PR review |
Review Criteria¶
Code follows naming conventions (Section 2).
Imports are ordered correctly (Section 3).
New public APIs have docstrings (Section 1).
Functions with 3+ operations include a detailed docstring description explaining how (not just what) — see Section 1.
No
forloops over DataFrames for calculations (Section 5).Type annotations are complete (Section 4).
Each function does one thing; manually verify single responsibility, max 50 lines, nesting ≤ 3.
No
# noqafor PLR0911, PLR0912, PLR0913, or C901 without PO approval; each must include a rationale (Section 3, Lint Suppression Policy).PEP 20 design principles respected (Section 6).
Pure functions used for business logic, side effects at edges (Section 6.2).
SOLID principles applied for testability (Section 10).
Data structures shared between Logic and UI use Pydantic models or TypedDicts.
Dashboard code never reads files directly — it calls
ncaa_evalfunctions.
8. File & Module Organization¶
Project Layout¶
src/
ncaa_eval/
__init__.py # Package root
py.typed # PEP 561 marker
cli/ # CLI entry points (main, train)
ingest/ # Data source connectors
transform/ # Feature engineering
model/ # Model ABC and implementations (singular, not models/)
evaluation/ # Metrics, CV, simulation
utils/ # Shared helpers (logger.py)
tests/
__init__.py
conftest.py # Shared fixtures
unit/ # Unit tests organized by domain
integration/ # Integration tests
fixtures/ # Test data fixtures
dashboard/ # Streamlit UI (imports ncaa_eval — no direct IO)
data/ # Local data store (git-ignored)
Rules¶
One concept per module. A module should do one thing. If it grows beyond ~300 lines, consider splitting.
Mirror
src/intests/.src/ncaa_eval/model/elo.pyis tested bytests/unit/test_elo.py(ortests/integration/test_elo_integration.py).__init__.pyre-exports. Public symbols should be importable from the submodule level (e.g.,from ncaa_eval.model import Model, get_model). The rootncaa_evalpackage does not re-export anything. Concrete model classes self-register via@register_modeland are accessed through the registry (get_model("elo")) or direct import (from ncaa_eval.model.elo import EloModel).No circular imports. If two modules need each other, extract shared types into a third module.
Configuration lives in
pyproject.toml. Do not create separate config files for tools that supportpyproject.toml.Line length is 110. Not the default 88. Configured in
[tool.ruff] line-length = 110.
9. Additional Architecture Rules¶
These rules come from the project architecture and apply across all code:
Rule |
Detail |
|---|---|
Type sharing |
All data structures shared between Logic and UI must use Pydantic models or TypedDicts. |
No direct IO in UI |
The Streamlit dashboard must call |
Commit messages |
Use conventional commits format ( |
Python version |
|
10. SOLID Principles for Testability¶
SOLID principles make code testable. Violating SOLID = hard-to-test code.
These five object-oriented design principles improve maintainability and testability. While not automatically enforced, they’re checked during code review.
S - Single Responsibility Principle (SRP)¶
“A class should have only one reason to change.”
What it means: Each class/function does ONE thing well.
Already enforced by: PEP 20 complexity checks (Section 6)
# GOOD: Single responsibility
class GameLoader:
"""Loads games from CSV files."""
def load(self, path: Path) -> pd.DataFrame:
return pd.read_csv(path)
class GameValidator:
"""Validates game data."""
def validate(self, games: pd.DataFrame) -> None:
if games["home_score"].min() < 0:
raise ValueError("Scores cannot be negative")
# BAD: Multiple responsibilities
class GameManager:
"""Does everything - loading, validating, processing, saving."""
def do_everything(self, path: Path) -> None:
# Too many reasons to change!
pass
Testing impact: Easy to test (one thing = one test class)
O - Open/Closed Principle (OCP)¶
“Open for extension, closed for modification.”
What it means: Add new features without changing existing code.
# GOOD: Open for extension via inheritance
class RatingModel(ABC):
@abstractmethod
def predict(self, game: Game) -> float:
pass
class EloModel(RatingModel):
def predict(self, game: Game) -> float:
return self._calculate_elo(game)
class GlickoModel(RatingModel): # New model, no changes to existing code
def predict(self, game: Game) -> float:
return self._calculate_glicko(game)
# BAD: Must modify existing code for each new model
def predict(game: Game, model_type: str) -> float:
if model_type == "elo":
return calculate_elo(game)
elif model_type == "glicko": # Must modify this function
return calculate_glicko(game)
Testing impact: Easy to mock/stub new implementations
L - Liskov Substitution Principle (LSP)¶
“Subtypes must be substitutable for their base types.”
What it means: If Dog inherits from Animal, you can use Dog anywhere you use Animal without breaking things.
Already enforced by: Property-based tests verify contracts (Section: Test Purpose Guide)
# GOOD: Subclass honors parent contract
class RatingModel(ABC):
@abstractmethod
def predict(self, game: Game) -> float:
"""Return probability in [0, 1]."""
pass
class EloModel(RatingModel):
def predict(self, game: Game) -> float:
# Always returns [0, 1] as promised
return 1 / (1 + 10 ** ((opp_rating - rating) / 400))
# BAD: Subclass violates parent contract
class BrokenModel(RatingModel):
def predict(self, game: Game) -> float:
# Returns values > 1.0, breaking the contract!
return rating_difference * 100
Testing impact: Property tests verify contracts hold across all implementations
I - Interface Segregation Principle (ISP)¶
“Many specific interfaces are better than one general-purpose interface.”
What it means: Don’t force classes to implement methods they don’t need.
Already enforced by: MyPy strict mode with ABCs and Protocols (Section 4)
This project uses ABCs as the primary pattern for interface segregation (Model ABC,
Repository ABC, Connector ABC) and Protocols as a complement for structural typing
where duck typing is preferred (e.g., ProbabilityProvider, ScoringRule in the
simulation engine). Choose based on context:
ABCs: When implementations must be explicitly registered and share behavior (e.g., all models inherit from
Modeland get common infrastructure).Protocols: When you need structural typing — any object with matching methods qualifies, without inheriting from anything (e.g.,
ProbabilityProvidercan be satisfied by any class with apredict_probamethod).
# PRIMARY: ABCs for explicit interface contracts
class Model(ABC):
"""Base class for all prediction models."""
@abstractmethod
def predict(self, game: Game) -> float: ...
class EloModel(Model):
def predict(self, game: Game) -> float: ...
# COMPLEMENT: Protocols for structural typing (duck typing)
class ProbabilityProvider(Protocol):
def predict_proba(self, matchup: Matchup) -> float: ...
# Any class with predict_proba() satisfies this — no inheritance needed
class EloProvider:
def predict_proba(self, matchup: Matchup) -> float: ...
# BAD: Fat interface forces unused methods
class DoEverything(ABC):
@abstractmethod
def predict(self, game: Game) -> float: ...
@abstractmethod
def fit(self, games: pd.DataFrame) -> None: ...
@abstractmethod
def cross_validate(self, games: pd.DataFrame) -> dict: ...
# Simple models forced to implement methods they don't need!
Testing impact: Less mocking needed (small interfaces)
D - Dependency Inversion Principle (DIP)¶
“Depend on abstractions, not concretions.”
What it means: High-level code shouldn’t depend on low-level details.
Already enforced by: Architecture rule “Type sharing: use Pydantic models or TypedDicts” (Section 9)
# GOOD: Depends on abstraction (Protocol)
class Simulator:
def __init__(self, model: Predictable): # Any model works
self.model = model
def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
for game in games:
pred = self.model.predict(game) # Works with any Predictable
...
# BAD: Depends on concrete implementation
class Simulator:
def __init__(self, elo_model: EloModel): # Locked to EloModel
self.model = elo_model
def simulate_tournament(self, games: list[Game]) -> pd.DataFrame:
# Can only use EloModel, not flexible
pass
Testing impact: Easy to inject test doubles
SOLID Review Checklist¶
During code review, verify:
[ ] SRP: Classes/functions have single, clear responsibility (covered by PEP 20 complexity)
[ ] OCP: New features added via extension (inheritance, composition), not modification
[ ] LSP: Subtypes honor parent contracts (property tests verify this)
[ ] ISP: Interfaces are small and focused (ABCs primary, Protocols for structural typing)
[ ] DIP: Depends on abstractions (Protocols, TypedDicts), not concrete classes
Summary: What’s Already Covered¶
SOLID Principle |
Already Enforced By |
|---|---|
SRP |
PEP 20: “Simple is better than complex” (complexity ≤ 10) |
OCP |
Manual review (can’t automate) |
LSP |
Property tests: “probabilities in [0, 1]” invariants |
ISP |
MyPy strict mode: ABCs primary (Model, Repository, Connector), Protocols complement (ProbabilityProvider, ScoringRule) |
DIP |
Architecture: “Type sharing via Pydantic/TypedDicts” |
Result: SOLID compliance is mostly automated or already covered by existing standards. Code review adds a final check for OCP and overall SOLID adherence.
11. Quality Gate Architecture¶
This project has two complementary quality gate systems. Pre-commit is canonical; nox is the local convenience runner.
Pre-commit (Canonical)¶
Pre-commit hooks (.pre-commit-config.yaml) run automatically on every commit and
are enforced in CI. They are the single source of truth for code quality enforcement.
Hook |
Stage |
What It Does |
|---|---|---|
|
commit |
Ruff linting with |
|
commit |
Ruff formatting (auto-formats code) |
|
commit |
Mypy strict type checking ( |
|
commit |
Pytest smoke tests ( |
|
commit-msg |
Conventional commit message enforcement |
|
commit |
GitHub Actions workflow linting |
Pre-commit also includes housekeeping hooks: end-of-file-fixer,
trailing-whitespace, check-yaml, check-toml, detect-private-key,
no-commit-to-branch, check-merge-conflict, debug-statements.
Nox (Local Convenience Runner)¶
Nox (noxfile.py) provides a single command (nox) that runs the full quality
pipeline locally. It is useful for manual full-suite runs before pushing but is NOT
used in CI.
Session |
What It Does |
|---|---|
|
|
|
|
|
Full |
|
|
Known Divergences (Intentional)¶
Area |
Pre-commit |
Nox |
Rationale |
|---|---|---|---|
ruff format |
Auto-fix (rewrites files) |
Check-only ( |
Pre-commit should fix on commit; nox should only report for manual inspection |
CI usage |
Runs in CI ( |
Not used in CI |
Pre-commit + standalone pytest covers lint + typecheck + full tests in CI |
CI Pipeline¶
Two CI workflows run in GitHub Actions:
.github/workflows/python-check.yaml (runs on every pull request):
Pre-commit —
pre-commit run --all-files(all hooks: ruff, mypy, actionlint, etc.)Full test suite —
pytest --cov=src/ncaa_evalwith coverage reporting
.github/workflows/main-updated.yaml (runs on every push to main):
Commitizen — version bump and changelog (skipped for
bump:commits)Sphinx docs —
sphinx-apidoc+sphinx-build→ GitHub Pages deployment
Quality gates are enforced twice: by the developer’s local pre-commit hooks on every
commit/push, and again by python-check.yaml on every pull request before merge.
References¶
pyproject.toml— Single source of truth for all tool configurationsspecs/05-architecture-fullstack.mdSection 12 — Coding standardsspecs/05-architecture-fullstack.mdSection 10 — Development workflowspecs/03-prd.mdSection 4 — Technical assumptions & constraints