# Test Execution Tiers

This guide explains **when tests and quality checks run** across the 4-tier execution model.

---

## Test Execution Scope (When Things Run)

This section defines the **multi-tier approach** that balances development speed with thorough validation. Each tier answers: **What runs when?**

**Key question:** Is this fast enough to run without disrupting the development workflow?

---

### Tier 1: Pre-Commit (Smoke Tests + Fast Checks)

#### Overview
Fast, local checks that run automatically on every commit via pre-commit hooks.

#### Time budget
**< 10 seconds total** (all checks combined)

#### Design principle
Catch the most common issues before code ever leaves the developer's machine. Must complete in seconds to avoid disrupting flow.

---

#### What Runs: Complete Checklist

| Check | Tool | What It Catches | Command |
|-------|------|-----------------|---------|
| **Lint** | `ruff check .` | Style violations, import issues, anti-patterns | Auto-fix on commit |
| **Format** | `ruff format --check .` | Inconsistent formatting | Auto-fix on commit |
| **Type-check** | `mypy` (strict) | Missing annotations, type errors | `mypy src/ncaa_eval tests` |
| **Smoke tests** | `pytest -m smoke` | Broken imports, basic sanity failures, schema contract breaks | `pytest -m smoke` |
| **Commit message** | Commitizen | Non-conventional commit format | Automatic validation |

---

#### Smoke Tests: What to Include

Smoke tests are a **curated subset** of tests designed for speed. Individual smoke tests should be < 1 second each, **< 5 seconds total**.

**Include:**
- ✅ **Import checks** - Verify package imports work (catches circular imports, missing dependencies, broken `__init__.py`)
- ✅ **Core function sanity** - Critical functions accept valid input without crashing (not full correctness, just "doesn't blow up")
- ✅ **Schema/contract tests** - Pydantic models and TypedDicts validate with representative sample data (catches accidental field renames or type changes)
- ✅ **Quick invariant checks** - Fast property tests that verify basic invariants
- ✅ **Fast regression tests** - Regression tests for critical bugs that can be verified quickly

**Exclude:**
- ❌ Anything touching disk, network, or external services
- ❌ Tests that process large DataFrames or datasets
- ❌ Full correctness / edge-case tests (save for complete suite)
- ❌ Integration tests with I/O
- ❌ Performance benchmarks
- ❌ Property-based tests that generate many examples (Hypothesis is slow)

---

#### Smoke Test Examples

```python
# Smoke-eligible: Fast unit test (import check)
@pytest.mark.smoke
def test_import_package():
    """Verify package can be imported without errors."""
    import ncaa_eval  # noqa: F401 — import itself is the assertion

# Smoke-eligible: Fast unit test (sanity check)
@pytest.mark.smoke
def test_brier_score_accepts_valid_input():
    """Verify Brier score accepts valid input without crashing."""
    predictions = np.array([0.8, 0.3])
    actuals = np.array([1, 0])
    result = brier_score(predictions, actuals)
    assert result is not None  # Just verify it doesn't crash

# Smoke-eligible: Schema contract test
@pytest.mark.smoke
def test_game_schema_validates():
    """Verify Game Pydantic model validates with sample data."""
    game = Game(season=2023, day_num=100, w_team_id=1234, l_team_id=5678, w_score=75, l_score=70)
    # If this constructs without error, schema is correct
    assert game.w_team_id == 1234

# Smoke-eligible: Fast regression test
@pytest.mark.smoke
@pytest.mark.regression
def test_elo_rating_never_negative_quick(elo_config):
    """Regression test: Elo ratings should never go negative (fast check)."""
    engine = EloFeatureEngine(elo_config)
    # After a loss against much stronger opponent, rating should stay non-negative
    assert engine.get_rating(team_id=1) >= 0

# NOT smoke-eligible: Integration test with I/O
@pytest.mark.integration
def test_load_games_from_disk(temp_data_dir):
    """Verify games can be loaded from disk (too slow for smoke)."""
    games = load_games(temp_data_dir)
    assert len(games) > 0

# NOT smoke-eligible: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(prob=st.floats(0, 1))
def test_probability_always_bounded(prob):
    """Verify adjusted probabilities stay in [0, 1]."""
    adjusted = adjust_probability(prob, home_advantage=0.05)
    assert 0 <= adjusted <= 1
```

---

#### Commands
- **Marker:** `@pytest.mark.smoke`
- **Run:** `pytest -m smoke`
- **When:** Every commit (pre-commit hook)

#### Rationale
Fast feedback loop - catches broken imports, basic sanity failures, schema contract breaks before code is committed.

---

### Tier 2: PR / CI (Complete Suite)

#### Overview
Thorough validation that runs when a pull request is opened or updated.

#### Time budget
**Minutes** (acceptable for PR review, not for pre-commit)

#### Design principle
Complete testing ensures nothing is broken before code reaches the main branch. Time is not a constraint - thoroughness is the priority.

---

#### What Runs: Complete Checklist

| Check | Tool | What It Catches | Command |
|-------|------|-----------------|---------|
| **Unit tests** | `pytest` | Logic regressions, broken contracts | `pytest` or `pytest -m "not slow"` |
| **Integration tests** | `pytest -m integration` | Component interaction failures | `pytest -m integration` |
| **Property-based tests** | `pytest -m property` | Invariant violations, edge cases | `pytest -m property` |
| **Performance tests** | `pytest -m performance` | Vectorization violations, speed regressions | `pytest -m performance` |
| **Coverage** | `pytest-cov` | Untested code paths | `pytest --cov=src/ncaa_eval --cov-report=term-missing` |
| **Mutation testing (selective)** | `mutmut` | Weak tests, gaps in test coverage | `mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py` |

---

#### Complete Tests: What to Include

The full test suite including **all tests** - smoke tests plus everything too slow for pre-commit.

**Include:**
- ✅ **All smoke tests** - Fast tests run again as part of complete suite
- ✅ **Integration tests** - Tests with I/O, database, or external dependencies
- ✅ **Property-based tests** - Hypothesis tests that generate hundreds of examples
- ✅ **Performance tests** - Benchmarks and timing assertions
- ✅ **Mutation testing** - Quality verification for high-priority modules
- ✅ **Full edge-case coverage** - Comprehensive correctness testing

---

#### Complete Test Examples

```python
# Complete-only: Integration test (too slow for smoke)
@pytest.mark.integration
def test_sync_command_fetches_and_stores_games(temp_data_dir):
    """Verify sync command successfully ingests and stores game data."""
    sync_games(source="test_api", output_dir=temp_data_dir)
    games_df = load_games(temp_data_dir)
    assert len(games_df) > 0
    assert "game_id" in games_df.columns

# Complete-only: Property-based test (Hypothesis is slow)
@pytest.mark.property
@given(data=st.lists(st.integers(), min_size=1, max_size=100))
def test_rolling_average_length_invariant(data):
    """Verify rolling average output length matches input."""
    series = pd.Series(data)
    result = calculate_rolling_average(series, window=5)
    assert len(result) == len(series)

# Complete-only: Performance test
@pytest.mark.slow
@pytest.mark.performance
def test_brier_score_vectorized_performance():
    """Verify Brier score meets performance target (vectorized)."""
    predictions = np.random.rand(100_000)
    actuals = np.random.randint(0, 2, 100_000)

    time_taken = timeit.timeit(
        lambda: brier_score(predictions, actuals),
        number=10
    ) / 10

    assert time_taken < 0.01, f"Too slow: {time_taken:.4f}s"

# Complete-only: Comprehensive edge-case test
@pytest.mark.parametrize("predictions,actuals,expected", [
    ([1.0, 0.0, 1.0], [1, 0, 1], 0.0),       # Perfect
    ([0.9, 0.1, 0.9], [1, 0, 1], 0.03),      # Near-perfect
    ([0.5, 0.5, 0.5], [1, 0, 1], 0.25),      # Random
    ([0.0, 1.0, 0.0], [1, 0, 1], 1.0),       # Worst case
])
def test_brier_score_edge_cases(predictions, actuals, expected):
    """Verify Brier score for known edge cases (comprehensive)."""
    result = brier_score(np.array(predictions), np.array(actuals))
    assert abs(result - expected) < 0.01
```

---

#### Commands
- **Marker:** No specific marker (all tests run by default)
- **Run:** `pytest` (no filter)
- **When:** Pull request / CI

#### Rationale
Comprehensive validation before merge - catches regressions, verifies test quality, ensures performance targets.

---

### Tier 3: Code Review (AI Agent)

#### Overview
Code review performed by an AI agent via the `code-review` workflow (ideally using a different LLM than the one that implemented the story). Automated tooling and CI cannot catch everything.

#### What the AI agent reviews

| Focus Area | What to Check |
|------------|---------------|
| **Docstring quality** | Are public APIs documented with clear Google-style docstrings? |
| **Vectorization compliance** | No `for` loops over DataFrames for calculations (see [STYLE_GUIDE.md](../STYLE_GUIDE.md) Section 5) |
| **Architecture compliance** | Type sharing, no direct IO in UI, appropriate use of Pydantic vs TypedDict |
| **Supporting evidence** | Performance claims backed by benchmarks, bug fixes accompanied by regression tests |
| **Design intent** | Does the implementation match the story's acceptance criteria and architectural intent? |
| **Test quality** | Are tests comprehensive? Do they test the right things? Are invariants verified? |

#### Rationale
Higher-level concerns that automated tools can't evaluate. Requires understanding of project architecture, domain knowledge, and design intent.

---

### Tier 4: Owner Review

#### Overview
Final approval rests with the project owner. Focus areas beyond what automated tools and AI review cover.

#### What the owner reviews

| Focus Area | Questions to Ask |
|------------|------------------|
| **Functional correctness** | Does this actually solve the problem from a domain perspective? |
| **Strategic alignment** | Is the approach what I expected? Does it align with project direction? |
| **Complexity appropriateness** | Is the solution appropriately complex (not over-engineered, not under-engineered)? |
| **Naming and clarity** | Are names intuitive? Is the code self-explanatory? |
| **Scope creep** | Does this PR do only what was requested, or does it include unrelated changes? |
| **Gut check** | Anything feel off? Trust your instincts. |

#### Rationale
Human judgment on strategic decisions, domain correctness, and alignment with project vision.

---

### Decision Tree: Smoke vs. Complete

```
Is this test fast (< 1 second)?
├─ NO → Complete suite only
│        - Mark with appropriate scope/purpose markers (@pytest.mark.integration, @pytest.mark.slow, etc.)
│        - Examples: I/O tests, property-based tests, performance benchmarks
│
└─ YES → Could be smoke-eligible, check purpose:
         │
         ├─ Is it an import check, sanity check, or schema contract test?
         │  └─ YES → Add @pytest.mark.smoke (pre-commit eligible)
         │
         ├─ Is it a critical regression test for a high-severity bug?
         │  └─ YES → Add @pytest.mark.smoke + @pytest.mark.regression
         │
         └─ Is it a comprehensive correctness test with many edge cases?
            └─ YES → Complete suite only (save pre-commit time for sanity checks)
```

**Important:** If the smoke suite grows beyond 5 seconds, demote the slowest tests to complete-only. Pre-commit speed is critical for developer productivity.

---

### Why This Multi-Tier Approach Matters

**Fast feedback where it matters:**
- Pre-commit catches 80% of issues in seconds
- Developers get immediate feedback without context switching
- Reduces wasted time on broken code

**Thorough validation before merge:**
- PR/CI catches the remaining 20% that requires full context
- Comprehensive testing ensures quality without slowing development

**Human judgment for strategic decisions:**
- AI and human review catch design issues that tools can't
- Ensures alignment with architectural intent and project vision

**Avoiding common anti-patterns:**
- ❌ Putting slow checks in pre-commit → developers use `--no-verify`
- ❌ Putting fast checks only in CI → issues caught too late
- ❌ No code review → missing architectural and design issues

**The multi-tier approach keeps the feedback loop tight where it matters most.**

---

## See Also

- [Test Scope Guide](test-scope-guide.md) - Unit vs Integration tests
- [Test Approach Guide](test-approach-guide.md) - Example-based vs Property-based
- [Test Purpose Guide](test-purpose-guide.md) - Functional, Performance, Regression
- [Quality Assurance Guide](quality.md) - Mutation testing, coverage analysis
- [Conventions Guide](conventions.md) - Fixtures, markers, organization
- [Domain Testing Guide](domain-testing.md) - Performance and data leakage testing