Test Execution Tiers

This guide explains when tests and quality checks run across the 4-tier execution model.


Test Execution Scope (When Things Run)

This section defines the multi-tier approach that balances development speed with thorough validation. Each tier answers: What runs when?

Key question: Is this fast enough to run without disrupting the development workflow?


Tier 1: Pre-Commit (Smoke Tests + Fast Checks)

Overview

Fast, local checks that run automatically on every commit via pre-commit hooks.

Time budget

< 10 seconds total (all checks combined)

Design principle

Catch the most common issues before code ever leaves the developer’s machine. Must complete in seconds to avoid disrupting flow.


What Runs: Complete Checklist

Check

Tool

What It Catches

Command

Lint

ruff check .

Style violations, import issues, anti-patterns

Auto-fix on commit

Format

ruff format --check .

Inconsistent formatting

Auto-fix on commit

Type-check

mypy (strict)

Missing annotations, type errors

mypy src/ncaa_eval tests

Smoke tests

pytest -m smoke

Broken imports, basic sanity failures, schema contract breaks

pytest -m smoke

Commit message

Commitizen

Non-conventional commit format

Automatic validation


Smoke Tests: What to Include

Smoke tests are a curated subset of tests designed for speed. Individual smoke tests should be < 1 second each, < 5 seconds total.

Include:

  • Import checks - Verify package imports work (catches circular imports, missing dependencies, broken __init__.py)

  • Core function sanity - Critical functions accept valid input without crashing (not full correctness, just “doesn’t blow up”)

  • Schema/contract tests - Pydantic models and TypedDicts validate with representative sample data (catches accidental field renames or type changes)

  • Quick invariant checks - Fast property tests that verify basic invariants

  • Fast regression tests - Regression tests for critical bugs that can be verified quickly

Exclude:

  • ❌ Anything touching disk, network, or external services

  • ❌ Tests that process large DataFrames or datasets

  • ❌ Full correctness / edge-case tests (save for complete suite)

  • ❌ Integration tests with I/O

  • ❌ Performance benchmarks

  • ❌ Property-based tests that generate many examples (Hypothesis is slow)


Smoke Test Examples

# Smoke-eligible: Fast unit test (import check)
@pytest.mark.smoke
def test_import_package():
    """Verify package can be imported without errors."""
    import ncaa_eval  # noqa: F401 — import itself is the assertion

# Smoke-eligible: Fast unit test (sanity check)
@pytest.mark.smoke
def test_brier_score_accepts_valid_input():
    """Verify Brier score accepts valid input without crashing."""
    predictions = np.array([0.8, 0.3])
    actuals = np.array([1, 0])
    result = brier_score(predictions, actuals)
    assert result is not None  # Just verify it doesn't crash

# Smoke-eligible: Schema contract test
@pytest.mark.smoke
def test_game_schema_validates():
    """Verify Game Pydantic model validates with sample data."""
    game = Game(season=2023, day_num=100, w_team_id=1234, l_team_id=5678, w_score=75, l_score=70)
    # If this constructs without error, schema is correct
    assert game.w_team_id == 1234

# Smoke-eligible: Fast regression test
@pytest.mark.smoke
@pytest.mark.regression
def test_elo_rating_never_negative_quick(elo_config):
    """Regression test: Elo ratings should never go negative (fast check)."""
    engine = EloFeatureEngine(elo_config)
    # After a loss against much stronger opponent, rating should stay non-negative
    assert engine.get_rating(team_id=1) >= 0

# NOT smoke-eligible: Integration test with I/O
@pytest.mark.integration
def test_load_games_from_disk(temp_data_dir):
    """Verify games can be loaded from disk (too slow for smoke)."""
    games = load_games(temp_data_dir)
    assert len(games) > 0

# NOT smoke-eligible: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(prob=st.floats(0, 1))
def test_probability_always_bounded(prob):
    """Verify adjusted probabilities stay in [0, 1]."""
    adjusted = adjust_probability(prob, home_advantage=0.05)
    assert 0 <= adjusted <= 1

Commands

  • Marker: @pytest.mark.smoke

  • Run: pytest -m smoke

  • When: Every commit (pre-commit hook)

Rationale

Fast feedback loop - catches broken imports, basic sanity failures, schema contract breaks before code is committed.


Tier 2: PR / CI (Complete Suite)

Overview

Thorough validation that runs when a pull request is opened or updated.

Time budget

Minutes (acceptable for PR review, not for pre-commit)

Design principle

Complete testing ensures nothing is broken before code reaches the main branch. Time is not a constraint - thoroughness is the priority.


What Runs: Complete Checklist

Check

Tool

What It Catches

Command

Unit tests

pytest

Logic regressions, broken contracts

pytest or pytest -m "not slow"

Integration tests

pytest -m integration

Component interaction failures

pytest -m integration

Property-based tests

pytest -m property

Invariant violations, edge cases

pytest -m property

Performance tests

pytest -m performance

Vectorization violations, speed regressions

pytest -m performance

Coverage

pytest-cov

Untested code paths

pytest --cov=src/ncaa_eval --cov-report=term-missing

Mutation testing (selective)

mutmut

Weak tests, gaps in test coverage

mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py


Complete Tests: What to Include

The full test suite including all tests - smoke tests plus everything too slow for pre-commit.

Include:

  • All smoke tests - Fast tests run again as part of complete suite

  • Integration tests - Tests with I/O, database, or external dependencies

  • Property-based tests - Hypothesis tests that generate hundreds of examples

  • Performance tests - Benchmarks and timing assertions

  • Mutation testing - Quality verification for high-priority modules

  • Full edge-case coverage - Comprehensive correctness testing


Complete Test Examples

# Complete-only: Integration test (too slow for smoke)
@pytest.mark.integration
def test_sync_command_fetches_and_stores_games(temp_data_dir):
    """Verify sync command successfully ingests and stores game data."""
    sync_games(source="test_api", output_dir=temp_data_dir)
    games_df = load_games(temp_data_dir)
    assert len(games_df) > 0
    assert "game_id" in games_df.columns

# Complete-only: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(data=st.lists(st.integers(), min_size=1, max_size=100))
def test_rolling_average_length_invariant(data):
    """Verify rolling average output length matches input."""
    series = pd.Series(data)
    result = calculate_rolling_average(series, window=5)
    assert len(result) == len(series)

# Complete-only: Performance test
@pytest.mark.slow
@pytest.mark.performance
def test_brier_score_vectorized_performance():
    """Verify Brier score meets performance target (vectorized)."""
    predictions = np.random.rand(100_000)
    actuals = np.random.randint(0, 2, 100_000)

    time_taken = timeit.timeit(
        lambda: brier_score(predictions, actuals),
        number=10
    ) / 10

    assert time_taken < 0.01, f"Too slow: {time_taken:.4f}s"

# Complete-only: Comprehensive edge-case test
@pytest.mark.parametrize("predictions,actuals,expected", [
    ([1.0, 0.0, 1.0], [1, 0, 1], 0.0),       # Perfect
    ([0.9, 0.1, 0.9], [1, 0, 1], 0.03),      # Near-perfect
    ([0.5, 0.5, 0.5], [1, 0, 1], 0.25),      # Random
    ([0.0, 1.0, 0.0], [1, 0, 1], 1.0),       # Worst case
])
def test_brier_score_edge_cases(predictions, actuals, expected):
    """Verify Brier score for known edge cases (comprehensive)."""
    result = brier_score(np.array(predictions), np.array(actuals))
    assert abs(result - expected) < 0.01

Commands

  • Marker: No specific marker (all tests run by default)

  • Run: pytest (no filter)

  • When: Pull request / CI

Rationale

Comprehensive validation before merge - catches regressions, verifies test quality, ensures performance targets.


Tier 3: Code Review (AI Agent)

Overview

Code review performed by an AI agent via the code-review workflow (ideally using a different LLM than the one that implemented the story). Automated tooling and CI cannot catch everything.

What the AI agent reviews

Focus Area

What to Check

Docstring quality

Are public APIs documented with clear Google-style docstrings?

Vectorization compliance

No for loops over DataFrames for calculations (see STYLE_GUIDE.md Section 5)

Architecture compliance

Type sharing, no direct IO in UI, appropriate use of Pydantic vs TypedDict

Supporting evidence

Performance claims backed by benchmarks, bug fixes accompanied by regression tests

Design intent

Does the implementation match the story’s acceptance criteria and architectural intent?

Test quality

Are tests comprehensive? Do they test the right things? Are invariants verified?

Rationale

Higher-level concerns that automated tools can’t evaluate. Requires understanding of project architecture, domain knowledge, and design intent.


Tier 4: Owner Review

Overview

Final approval rests with the project owner. Focus areas beyond what automated tools and AI review cover.

What the owner reviews

Focus Area

Questions to Ask

Functional correctness

Does this actually solve the problem from a domain perspective?

Strategic alignment

Is the approach what I expected? Does it align with project direction?

Complexity appropriateness

Is the solution appropriately complex (not over-engineered, not under-engineered)?

Naming and clarity

Are names intuitive? Is the code self-explanatory?

Scope creep

Does this PR do only what was requested, or does it include unrelated changes?

Gut check

Anything feel off? Trust your instincts.

Rationale

Human judgment on strategic decisions, domain correctness, and alignment with project vision.


Decision Tree: Smoke vs. Complete

Is this test fast (< 1 second)?
├─ NO → Complete suite only
│        - Mark with appropriate scope/purpose markers (@pytest.mark.integration, @pytest.mark.slow, etc.)
│        - Examples: I/O tests, property-based tests, performance benchmarks
│
└─ YES → Could be smoke-eligible, check purpose:
         │
         ├─ Is it an import check, sanity check, or schema contract test?
         │  └─ YES → Add @pytest.mark.smoke (pre-commit eligible)
         │
         ├─ Is it a critical regression test for a high-severity bug?
         │  └─ YES → Add @pytest.mark.smoke + @pytest.mark.regression
         │
         └─ Is it a comprehensive correctness test with many edge cases?
            └─ YES → Complete suite only (save pre-commit time for sanity checks)

Important: If the smoke suite grows beyond 5 seconds, demote the slowest tests to complete-only. Pre-commit speed is critical for developer productivity.


Why This Multi-Tier Approach Matters

Fast feedback where it matters:

  • Pre-commit catches 80% of issues in seconds

  • Developers get immediate feedback without context switching

  • Reduces wasted time on broken code

Thorough validation before merge:

  • PR/CI catches the remaining 20% that requires full context

  • Comprehensive testing ensures quality without slowing development

Human judgment for strategic decisions:

  • AI and human review catch design issues that tools can’t

  • Ensures alignment with architectural intent and project vision

Avoiding common anti-patterns:

  • ❌ Putting slow checks in pre-commit → developers use --no-verify

  • ❌ Putting fast checks only in CI → issues caught too late

  • ❌ No code review → missing architectural and design issues

The multi-tier approach keeps the feedback loop tight where it matters most.


See Also