Test Execution Tiers¶

This guide explains when tests and quality checks run across the 4-tier execution model.

Test Execution Scope (When Things Run)¶

This section defines the multi-tier approach that balances development speed with thorough validation. Each tier answers: What runs when?

Key question: Is this fast enough to run without disrupting the development workflow?

Tier 1: Pre-Commit (Smoke Tests + Fast Checks)¶

Overview¶

Fast, local checks that run automatically on every commit via pre-commit hooks.

Time budget¶

< 10 seconds total (all checks combined)

Design principle¶

Catch the most common issues before code ever leaves the developer’s machine. Must complete in seconds to avoid disrupting flow.

What Runs: Complete Checklist¶

Check	Tool	What It Catches	Command
Lint	`ruff check .`	Style violations, import issues, anti-patterns	Auto-fix on commit
Format	`ruff format --check .`	Inconsistent formatting	Auto-fix on commit
Type-check	`mypy` (strict)	Missing annotations, type errors	`mypy src/ncaa_eval tests`
Smoke tests	`pytest -m smoke`	Broken imports, basic sanity failures, schema contract breaks	`pytest -m smoke`
Commit message	Commitizen	Non-conventional commit format	Automatic validation

Smoke Tests: What to Include¶

Smoke tests are a curated subset of tests designed for speed. Individual smoke tests should be < 1 second each, < 5 seconds total.

Include:

✅ Import checks - Verify package imports work (catches circular imports, missing dependencies, broken __init__.py)
✅ Core function sanity - Critical functions accept valid input without crashing (not full correctness, just “doesn’t blow up”)
✅ Schema/contract tests - Pydantic models and TypedDicts validate with representative sample data (catches accidental field renames or type changes)
✅ Quick invariant checks - Fast property tests that verify basic invariants
✅ Fast regression tests - Regression tests for critical bugs that can be verified quickly

Exclude:

❌ Anything touching disk, network, or external services
❌ Tests that process large DataFrames or datasets
❌ Full correctness / edge-case tests (save for complete suite)
❌ Integration tests with I/O
❌ Performance benchmarks
❌ Property-based tests that generate many examples (Hypothesis is slow)

Smoke Test Examples¶

# Smoke-eligible: Fast unit test (import check)
@pytest.mark.smoke
def test_import_package():
    """Verify package can be imported without errors."""
    import ncaa_eval  # noqa: F401 — import itself is the assertion

# Smoke-eligible: Fast unit test (sanity check)
@pytest.mark.smoke
def test_brier_score_accepts_valid_input():
    """Verify Brier score accepts valid input without crashing."""
    predictions = np.array([0.8, 0.3])
    actuals = np.array([1, 0])
    result = brier_score(predictions, actuals)
    assert result is not None  # Just verify it doesn't crash

# Smoke-eligible: Schema contract test
@pytest.mark.smoke
def test_game_schema_validates():
    """Verify Game Pydantic model validates with sample data."""
    game = Game(season=2023, day_num=100, w_team_id=1234, l_team_id=5678, w_score=75, l_score=70)
    # If this constructs without error, schema is correct
    assert game.w_team_id == 1234

# Smoke-eligible: Fast regression test
@pytest.mark.smoke
@pytest.mark.regression
def test_elo_rating_never_negative_quick(elo_config):
    """Regression test: Elo ratings should never go negative (fast check)."""
    engine = EloFeatureEngine(elo_config)
    # After a loss against much stronger opponent, rating should stay non-negative
    assert engine.get_rating(team_id=1) >= 0

# NOT smoke-eligible: Integration test with I/O
@pytest.mark.integration
def test_load_games_from_disk(temp_data_dir):
    """Verify games can be loaded from disk (too slow for smoke)."""
    games = load_games(temp_data_dir)
    assert len(games) > 0

# NOT smoke-eligible: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(prob=st.floats(0, 1))
def test_probability_always_bounded(prob):
    """Verify adjusted probabilities stay in [0, 1]."""
    adjusted = adjust_probability(prob, home_advantage=0.05)
    assert 0 <= adjusted <= 1

Commands¶

Marker: @pytest.mark.smoke
Run: pytest -m smoke
When: Every commit (pre-commit hook)

Rationale¶

Fast feedback loop - catches broken imports, basic sanity failures, schema contract breaks before code is committed.

Tier 2: PR / CI (Complete Suite)¶

Overview¶

Thorough validation that runs when a pull request is opened or updated.

Time budget¶

Minutes (acceptable for PR review, not for pre-commit)

Design principle¶

Complete testing ensures nothing is broken before code reaches the main branch. Time is not a constraint - thoroughness is the priority.

What Runs: Complete Checklist¶

Check	Tool	What It Catches	Command
Unit tests	`pytest`	Logic regressions, broken contracts	`pytest` or `pytest -m "not slow"`
Integration tests	`pytest -m integration`	Component interaction failures	`pytest -m integration`
Property-based tests	`pytest -m property`	Invariant violations, edge cases	`pytest -m property`
Performance tests	`pytest -m performance`	Vectorization violations, speed regressions	`pytest -m performance`
Coverage	`pytest-cov`	Untested code paths	`pytest --cov=src/ncaa_eval --cov-report=term-missing`
Mutation testing (selective)	`mutmut`	Weak tests, gaps in test coverage	`mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py`

Complete Tests: What to Include¶

The full test suite including all tests - smoke tests plus everything too slow for pre-commit.

Include:

✅ All smoke tests - Fast tests run again as part of complete suite
✅ Integration tests - Tests with I/O, database, or external dependencies
✅ Property-based tests - Hypothesis tests that generate hundreds of examples
✅ Performance tests - Benchmarks and timing assertions
✅ Mutation testing - Quality verification for high-priority modules
✅ Full edge-case coverage - Comprehensive correctness testing

Complete Test Examples¶

# Complete-only: Integration test (too slow for smoke)
@pytest.mark.integration
def test_sync_command_fetches_and_stores_games(temp_data_dir):
    """Verify sync command successfully ingests and stores game data."""
    sync_games(source="test_api", output_dir=temp_data_dir)
    games_df = load_games(temp_data_dir)
    assert len(games_df) > 0
    assert "game_id" in games_df.columns

# Complete-only: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(data=st.lists(st.integers(), min_size=1, max_size=100))
def test_rolling_average_length_invariant(data):
    """Verify rolling average output length matches input."""
    series = pd.Series(data)
    result = calculate_rolling_average(series, window=5)
    assert len(result) == len(series)

# Complete-only: Performance test
@pytest.mark.slow
@pytest.mark.performance
def test_brier_score_vectorized_performance():
    """Verify Brier score meets performance target (vectorized)."""
    predictions = np.random.rand(100_000)
    actuals = np.random.randint(0, 2, 100_000)

    time_taken = timeit.timeit(
        lambda: brier_score(predictions, actuals),
        number=10
    ) / 10

    assert time_taken < 0.01, f"Too slow: {time_taken:.4f}s"

# Complete-only: Comprehensive edge-case test
@pytest.mark.parametrize("predictions,actuals,expected", [
    ([1.0, 0.0, 1.0], [1, 0, 1], 0.0),       # Perfect
    ([0.9, 0.1, 0.9], [1, 0, 1], 0.03),      # Near-perfect
    ([0.5, 0.5, 0.5], [1, 0, 1], 0.25),      # Random
    ([0.0, 1.0, 0.0], [1, 0, 1], 1.0),       # Worst case
])
def test_brier_score_edge_cases(predictions, actuals, expected):
    """Verify Brier score for known edge cases (comprehensive)."""
    result = brier_score(np.array(predictions), np.array(actuals))
    assert abs(result - expected) < 0.01

Commands¶

Marker: No specific marker (all tests run by default)
Run: pytest (no filter)
When: Pull request / CI

Rationale¶

Comprehensive validation before merge - catches regressions, verifies test quality, ensures performance targets.

Tier 3: Code Review (AI Agent)¶

Overview¶

Code review performed by an AI agent via the code-review workflow (ideally using a different LLM than the one that implemented the story). Automated tooling and CI cannot catch everything.

What the AI agent reviews¶

Focus Area	What to Check
Docstring quality	Are public APIs documented with clear Google-style docstrings?
Vectorization compliance	No `for` loops over DataFrames for calculations (see STYLE_GUIDE.md Section 5)
Architecture compliance	Type sharing, no direct IO in UI, appropriate use of Pydantic vs TypedDict
Supporting evidence	Performance claims backed by benchmarks, bug fixes accompanied by regression tests
Design intent	Does the implementation match the story’s acceptance criteria and architectural intent?
Test quality	Are tests comprehensive? Do they test the right things? Are invariants verified?

Rationale¶

Higher-level concerns that automated tools can’t evaluate. Requires understanding of project architecture, domain knowledge, and design intent.

Tier 4: Owner Review¶

Overview¶

Final approval rests with the project owner. Focus areas beyond what automated tools and AI review cover.

What the owner reviews¶

Focus Area	Questions to Ask
Functional correctness	Does this actually solve the problem from a domain perspective?
Strategic alignment	Is the approach what I expected? Does it align with project direction?
Complexity appropriateness	Is the solution appropriately complex (not over-engineered, not under-engineered)?
Naming and clarity	Are names intuitive? Is the code self-explanatory?
Scope creep	Does this PR do only what was requested, or does it include unrelated changes?
Gut check	Anything feel off? Trust your instincts.

Rationale¶

Human judgment on strategic decisions, domain correctness, and alignment with project vision.

Decision Tree: Smoke vs. Complete¶

Is this test fast (< 1 second)?
├─ NO → Complete suite only
│        - Mark with appropriate scope/purpose markers (@pytest.mark.integration, @pytest.mark.slow, etc.)
│        - Examples: I/O tests, property-based tests, performance benchmarks
│
└─ YES → Could be smoke-eligible, check purpose:
         │
         ├─ Is it an import check, sanity check, or schema contract test?
         │  └─ YES → Add @pytest.mark.smoke (pre-commit eligible)
         │
         ├─ Is it a critical regression test for a high-severity bug?
         │  └─ YES → Add @pytest.mark.smoke + @pytest.mark.regression
         │
         └─ Is it a comprehensive correctness test with many edge cases?
            └─ YES → Complete suite only (save pre-commit time for sanity checks)

Important: If the smoke suite grows beyond 5 seconds, demote the slowest tests to complete-only. Pre-commit speed is critical for developer productivity.

Why This Multi-Tier Approach Matters¶

Fast feedback where it matters:

Pre-commit catches 80% of issues in seconds
Developers get immediate feedback without context switching
Reduces wasted time on broken code

Thorough validation before merge:

PR/CI catches the remaining 20% that requires full context
Comprehensive testing ensures quality without slowing development

Human judgment for strategic decisions:

AI and human review catch design issues that tools can’t
Ensures alignment with architectural intent and project vision

Avoiding common anti-patterns:

❌ Putting slow checks in pre-commit → developers use --no-verify
❌ Putting fast checks only in CI → issues caught too late
❌ No code review → missing architectural and design issues

The multi-tier approach keeps the feedback loop tight where it matters most.

Test Execution Tiers¶

Test Execution Scope (When Things Run)¶

Tier 1: Pre-Commit (Smoke Tests + Fast Checks)¶

Overview¶

Time budget¶

Design principle¶

What Runs: Complete Checklist¶

Smoke Tests: What to Include¶

Smoke Test Examples¶

Commands¶

Rationale¶

Tier 2: PR / CI (Complete Suite)¶

Overview¶

Time budget¶

Design principle¶

What Runs: Complete Checklist¶

Complete Tests: What to Include¶

Complete Test Examples¶

Commands¶

Rationale¶

Tier 3: Code Review (AI Agent)¶

Overview¶

What the AI agent reviews¶

Rationale¶

Tier 4: Owner Review¶

Overview¶

What the owner reviews¶

Rationale¶

Decision Tree: Smoke vs. Complete¶

Why This Multi-Tier Approach Matters¶

See Also¶