Test Execution Tiers¶
This guide explains when tests and quality checks run across the 4-tier execution model.
Test Execution Scope (When Things Run)¶
This section defines the multi-tier approach that balances development speed with thorough validation. Each tier answers: What runs when?
Key question: Is this fast enough to run without disrupting the development workflow?
Tier 1: Pre-Commit (Smoke Tests + Fast Checks)¶
Overview¶
Fast, local checks that run automatically on every commit via pre-commit hooks.
Time budget¶
< 10 seconds total (all checks combined)
Design principle¶
Catch the most common issues before code ever leaves the developer’s machine. Must complete in seconds to avoid disrupting flow.
What Runs: Complete Checklist¶
Check |
Tool |
What It Catches |
Command |
|---|---|---|---|
Lint |
|
Style violations, import issues, anti-patterns |
Auto-fix on commit |
Format |
|
Inconsistent formatting |
Auto-fix on commit |
Type-check |
|
Missing annotations, type errors |
|
Smoke tests |
|
Broken imports, basic sanity failures, schema contract breaks |
|
Commit message |
Commitizen |
Non-conventional commit format |
Automatic validation |
Smoke Tests: What to Include¶
Smoke tests are a curated subset of tests designed for speed. Individual smoke tests should be < 1 second each, < 5 seconds total.
Include:
✅ Import checks - Verify package imports work (catches circular imports, missing dependencies, broken
__init__.py)✅ Core function sanity - Critical functions accept valid input without crashing (not full correctness, just “doesn’t blow up”)
✅ Schema/contract tests - Pydantic models and TypedDicts validate with representative sample data (catches accidental field renames or type changes)
✅ Quick invariant checks - Fast property tests that verify basic invariants
✅ Fast regression tests - Regression tests for critical bugs that can be verified quickly
Exclude:
❌ Anything touching disk, network, or external services
❌ Tests that process large DataFrames or datasets
❌ Full correctness / edge-case tests (save for complete suite)
❌ Integration tests with I/O
❌ Performance benchmarks
❌ Property-based tests that generate many examples (Hypothesis is slow)
Smoke Test Examples¶
# Smoke-eligible: Fast unit test (import check)
@pytest.mark.smoke
def test_import_package():
"""Verify package can be imported without errors."""
import ncaa_eval # noqa: F401 — import itself is the assertion
# Smoke-eligible: Fast unit test (sanity check)
@pytest.mark.smoke
def test_brier_score_accepts_valid_input():
"""Verify Brier score accepts valid input without crashing."""
predictions = np.array([0.8, 0.3])
actuals = np.array([1, 0])
result = brier_score(predictions, actuals)
assert result is not None # Just verify it doesn't crash
# Smoke-eligible: Schema contract test
@pytest.mark.smoke
def test_game_schema_validates():
"""Verify Game Pydantic model validates with sample data."""
game = Game(season=2023, day_num=100, w_team_id=1234, l_team_id=5678, w_score=75, l_score=70)
# If this constructs without error, schema is correct
assert game.w_team_id == 1234
# Smoke-eligible: Fast regression test
@pytest.mark.smoke
@pytest.mark.regression
def test_elo_rating_never_negative_quick(elo_config):
"""Regression test: Elo ratings should never go negative (fast check)."""
engine = EloFeatureEngine(elo_config)
# After a loss against much stronger opponent, rating should stay non-negative
assert engine.get_rating(team_id=1) >= 0
# NOT smoke-eligible: Integration test with I/O
@pytest.mark.integration
def test_load_games_from_disk(temp_data_dir):
"""Verify games can be loaded from disk (too slow for smoke)."""
games = load_games(temp_data_dir)
assert len(games) > 0
# NOT smoke-eligible: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(prob=st.floats(0, 1))
def test_probability_always_bounded(prob):
"""Verify adjusted probabilities stay in [0, 1]."""
adjusted = adjust_probability(prob, home_advantage=0.05)
assert 0 <= adjusted <= 1
Commands¶
Marker:
@pytest.mark.smokeRun:
pytest -m smokeWhen: Every commit (pre-commit hook)
Rationale¶
Fast feedback loop - catches broken imports, basic sanity failures, schema contract breaks before code is committed.
Tier 2: PR / CI (Complete Suite)¶
Overview¶
Thorough validation that runs when a pull request is opened or updated.
Time budget¶
Minutes (acceptable for PR review, not for pre-commit)
Design principle¶
Complete testing ensures nothing is broken before code reaches the main branch. Time is not a constraint - thoroughness is the priority.
What Runs: Complete Checklist¶
Check |
Tool |
What It Catches |
Command |
|---|---|---|---|
Unit tests |
|
Logic regressions, broken contracts |
|
Integration tests |
|
Component interaction failures |
|
Property-based tests |
|
Invariant violations, edge cases |
|
Performance tests |
|
Vectorization violations, speed regressions |
|
Coverage |
|
Untested code paths |
|
Mutation testing (selective) |
|
Weak tests, gaps in test coverage |
|
Complete Tests: What to Include¶
The full test suite including all tests - smoke tests plus everything too slow for pre-commit.
Include:
✅ All smoke tests - Fast tests run again as part of complete suite
✅ Integration tests - Tests with I/O, database, or external dependencies
✅ Property-based tests - Hypothesis tests that generate hundreds of examples
✅ Performance tests - Benchmarks and timing assertions
✅ Mutation testing - Quality verification for high-priority modules
✅ Full edge-case coverage - Comprehensive correctness testing
Complete Test Examples¶
# Complete-only: Integration test (too slow for smoke)
@pytest.mark.integration
def test_sync_command_fetches_and_stores_games(temp_data_dir):
"""Verify sync command successfully ingests and stores game data."""
sync_games(source="test_api", output_dir=temp_data_dir)
games_df = load_games(temp_data_dir)
assert len(games_df) > 0
assert "game_id" in games_df.columns
# Complete-only: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(data=st.lists(st.integers(), min_size=1, max_size=100))
def test_rolling_average_length_invariant(data):
"""Verify rolling average output length matches input."""
series = pd.Series(data)
result = calculate_rolling_average(series, window=5)
assert len(result) == len(series)
# Complete-only: Performance test
@pytest.mark.slow
@pytest.mark.performance
def test_brier_score_vectorized_performance():
"""Verify Brier score meets performance target (vectorized)."""
predictions = np.random.rand(100_000)
actuals = np.random.randint(0, 2, 100_000)
time_taken = timeit.timeit(
lambda: brier_score(predictions, actuals),
number=10
) / 10
assert time_taken < 0.01, f"Too slow: {time_taken:.4f}s"
# Complete-only: Comprehensive edge-case test
@pytest.mark.parametrize("predictions,actuals,expected", [
([1.0, 0.0, 1.0], [1, 0, 1], 0.0), # Perfect
([0.9, 0.1, 0.9], [1, 0, 1], 0.03), # Near-perfect
([0.5, 0.5, 0.5], [1, 0, 1], 0.25), # Random
([0.0, 1.0, 0.0], [1, 0, 1], 1.0), # Worst case
])
def test_brier_score_edge_cases(predictions, actuals, expected):
"""Verify Brier score for known edge cases (comprehensive)."""
result = brier_score(np.array(predictions), np.array(actuals))
assert abs(result - expected) < 0.01
Commands¶
Marker: No specific marker (all tests run by default)
Run:
pytest(no filter)When: Pull request / CI
Rationale¶
Comprehensive validation before merge - catches regressions, verifies test quality, ensures performance targets.
Tier 3: Code Review (AI Agent)¶
Overview¶
Code review performed by an AI agent via the code-review workflow (ideally using a different LLM than the one that implemented the story). Automated tooling and CI cannot catch everything.
What the AI agent reviews¶
Focus Area |
What to Check |
|---|---|
Docstring quality |
Are public APIs documented with clear Google-style docstrings? |
Vectorization compliance |
No |
Architecture compliance |
Type sharing, no direct IO in UI, appropriate use of Pydantic vs TypedDict |
Supporting evidence |
Performance claims backed by benchmarks, bug fixes accompanied by regression tests |
Design intent |
Does the implementation match the story’s acceptance criteria and architectural intent? |
Test quality |
Are tests comprehensive? Do they test the right things? Are invariants verified? |
Rationale¶
Higher-level concerns that automated tools can’t evaluate. Requires understanding of project architecture, domain knowledge, and design intent.
Tier 4: Owner Review¶
Overview¶
Final approval rests with the project owner. Focus areas beyond what automated tools and AI review cover.
What the owner reviews¶
Focus Area |
Questions to Ask |
|---|---|
Functional correctness |
Does this actually solve the problem from a domain perspective? |
Strategic alignment |
Is the approach what I expected? Does it align with project direction? |
Complexity appropriateness |
Is the solution appropriately complex (not over-engineered, not under-engineered)? |
Naming and clarity |
Are names intuitive? Is the code self-explanatory? |
Scope creep |
Does this PR do only what was requested, or does it include unrelated changes? |
Gut check |
Anything feel off? Trust your instincts. |
Rationale¶
Human judgment on strategic decisions, domain correctness, and alignment with project vision.
Decision Tree: Smoke vs. Complete¶
Is this test fast (< 1 second)?
├─ NO → Complete suite only
│ - Mark with appropriate scope/purpose markers (@pytest.mark.integration, @pytest.mark.slow, etc.)
│ - Examples: I/O tests, property-based tests, performance benchmarks
│
└─ YES → Could be smoke-eligible, check purpose:
│
├─ Is it an import check, sanity check, or schema contract test?
│ └─ YES → Add @pytest.mark.smoke (pre-commit eligible)
│
├─ Is it a critical regression test for a high-severity bug?
│ └─ YES → Add @pytest.mark.smoke + @pytest.mark.regression
│
└─ Is it a comprehensive correctness test with many edge cases?
└─ YES → Complete suite only (save pre-commit time for sanity checks)
Important: If the smoke suite grows beyond 5 seconds, demote the slowest tests to complete-only. Pre-commit speed is critical for developer productivity.
Why This Multi-Tier Approach Matters¶
Fast feedback where it matters:
Pre-commit catches 80% of issues in seconds
Developers get immediate feedback without context switching
Reduces wasted time on broken code
Thorough validation before merge:
PR/CI catches the remaining 20% that requires full context
Comprehensive testing ensures quality without slowing development
Human judgment for strategic decisions:
AI and human review catch design issues that tools can’t
Ensures alignment with architectural intent and project vision
Avoiding common anti-patterns:
❌ Putting slow checks in pre-commit → developers use
--no-verify❌ Putting fast checks only in CI → issues caught too late
❌ No code review → missing architectural and design issues
The multi-tier approach keeps the feedback loop tight where it matters most.
See Also¶
Test Scope Guide - Unit vs Integration tests
Test Approach Guide - Example-based vs Property-based
Test Purpose Guide - Functional, Performance, Regression
Quality Assurance Guide - Mutation testing, coverage analysis
Conventions Guide - Fixtures, markers, organization
Domain Testing Guide - Performance and data leakage testing