Domain-Specific Testing¶

This guide covers testing approaches specific to the ncaa_eval project’s domain requirements: performance testing (vectorization compliance) and data leakage prevention.

Performance Testing Guidelines¶

Context: Vectorization First Rule¶

The Architecture mandates “Vectorization First” (Section 12):

Reject PRs that use for loops over Pandas DataFrames for metric calculations.

This project has a 60-second backtest target (10-year Elo training + inference). NFR1 mandates vectorization; NFR2 mandates parallelism. Performance testing ensures these requirements are met.

How to Test Vectorization¶

1. Assertion-Based Unit Tests (Quick Smoke Check)¶

Purpose: Catch obvious violations quickly during pre-commit

@pytest.mark.smoke
def test_brier_score_is_vectorized():
    """Verify Brier score uses vectorized operations (no Python loops)."""
    import inspect

    source = inspect.getsource(brier_score)

    # Check that implementation doesn't contain for loops over data
    # (Allow for loops in fixture setup, just not in calculation logic)
    assert "for " not in source or "# vectorized" in source, \
        "Brier score must use vectorized numpy/pandas operations"

Pros: Fast, catches obvious violations Cons: Brittle (may trigger false positives on legitimate loops in comments)

Pre-commit: ✅ YES - Mark as @pytest.mark.smoke

2. Performance Benchmark Integration Tests¶

Purpose: Verify actual performance meets targets

import timeit
import pytest

@pytest.mark.slow
@pytest.mark.integration
def test_brier_score_performance():
    """Verify Brier score meets performance target."""
    predictions = np.random.rand(10_000)
    actuals = np.random.randint(0, 2, 10_000)

    time_taken = timeit.timeit(
        lambda: brier_score(predictions, actuals),
        number=100
    )

    # Should complete 100 iterations in < 100ms (1ms per call for 10k predictions)
    assert time_taken < 0.1, f"Brier score too slow: {time_taken:.3f}s for 100 iterations"

Pros: Verifies actual performance, catches regressions Cons: Slow, results vary by machine (need generous thresholds)

Pre-commit: ❌ NO (too slow) PR-time: ✅ YES - Mark as @pytest.mark.slow and @pytest.mark.performance

3. pytest-benchmark (Optional, for Critical Paths)¶

⚠️ NOT INSTALLED: pytest-benchmark is optional and NOT currently in pyproject.toml. Use timeit-based benchmarks (Option 2) for now.

Purpose: Track performance over time with detailed statistics

def test_brier_score_benchmark(benchmark):
    """Benchmark Brier score calculation."""
    predictions = np.random.rand(10_000)
    actuals = np.random.randint(0, 2, 10_000)

    result = benchmark(brier_score, predictions, actuals)

    # Benchmark provides detailed statistics (mean, stddev, percentiles)
    assert result is not None

Pros: Detailed statistics, integrated with pytest, tracks performance over time Cons: Requires pytest-benchmark plugin (not currently installed)

Pre-commit: ❌ NO PR-time: ✅ YES (if plugin is installed)

Comprehensive Performance Test Example¶

@pytest.mark.slow
@pytest.mark.performance
@pytest.mark.integration
def test_full_backtest_meets_60_second_target():
    """Verify 10-year Elo backtest completes within 60 seconds (NFR1)."""
    games = load_games_for_years(range(2013, 2023))

    start_time = timeit.default_timer()
    model = EloModel()
    model.fit(games)
    predictions = model.predict(games)
    elapsed = timeit.default_timer() - start_time

    assert elapsed < 60.0, f"Backtest too slow: {elapsed:.2f}s (target: < 60s)"

Vectorization Smoke Test Pattern¶

import inspect

@pytest.mark.smoke
@pytest.mark.performance
def test_metric_calculations_are_vectorized():
    """Verify critical metric calculations don't use Python loops (NFR1)."""
    from ncaa_eval.evaluation import metrics

    # Check that metric module doesn't contain for loops over DataFrames
    source = inspect.getsource(metrics)

    # This is a heuristic - manual review is still needed
    forbidden_patterns = [
        "for _ in df.iterrows()",
        ".iterrows()",
        ".itertuples()",
        "for row in df",
    ]

    for pattern in forbidden_patterns:
        assert pattern not in source, \
            f"Metrics module contains non-vectorized pattern: {pattern}"

Note: The ingest layer (src/ncaa_eval/ingest/) is excluded from the iterrows() forbidden-pattern check only. iterrows() is explicitly permitted there for one-time-per-sync data transformation (see Style Guide Section 5, Exception #4). The .itertuples() and for row in df patterns remain prohibited in business-logic and metric-calculation code; their use elsewhere (ingest connectors, model prediction loops) is evaluated case-by-case against the vectorization mandate.

Recommendation¶

Use assertion-based tests for quick smoke checks (mark as @pytest.mark.smoke)
Use performance benchmarks for critical modules (evaluation/metrics.py, evaluation/simulation.py) as @pytest.mark.slow tests (PR-time only)
Consider pytest-benchmark for performance regression tracking (optional, not currently installed)

Data Leakage Prevention Testing¶

Context: Temporal Boundary Enforcement¶

The Architecture mandates “Data Safety: Temporal boundaries enforced strictly in the API to prevent data leakage” (Section 11.2).

NFR4 requires strict temporal boundaries: training data must never include information from future games. This is critical for model validity.

How to Test Temporal Integrity¶

1. Unit Tests (API Contract)¶

Purpose: Verify API rejects requests for future data

def test_get_chronological_season_enforces_cutoff():
    """Verify chronological API rejects requests for future data."""
    api = ChronologicalDataServer()

    with pytest.raises(ValueError, match="Cannot access future data"):
        api.get_games_before(date="2025-12-31", cutoff_date="2025-01-01")

Tests: API raises error when requesting data beyond cutoff

Pre-commit: ✅ YES (fast, no I/O) - Mark as @pytest.mark.smoke

2. Integration Tests (End-to-End Workflow)¶

Purpose: Verify end-to-end workflows prevent leakage

@pytest.mark.integration
def test_walk_forward_validation_prevents_leakage(sample_games):
    """Verify walk-forward CV never trains on future data."""
    splitter = WalkForwardSplitter(years=range(2015, 2025))

    for train_data, test_data, year in splitter.split(sample_games):
        # Verify all training games occur before test games
        train_max_date = train_data['date'].max()
        test_min_date = test_data['date'].min()

        assert train_max_date < test_min_date, \
            f"Data leakage detected in {year} fold: train_max={train_max_date}, test_min={test_min_date}"

Tests: Cross-validation splitter never leaks future data into training

Pre-commit: ❌ NO (too slow) PR-time: ✅ YES - Mark as @pytest.mark.integration

3. Property-Based Tests (Invariants)¶

Purpose: Verify temporal boundary holds across all possible inputs

from hypothesis import given, strategies as st

@pytest.mark.property
@given(cutoff_year=st.integers(2015, 2025))
def test_temporal_boundary_invariant(cutoff_year):
    """Verify no API call can access data beyond cutoff year."""
    api = ChronologicalDataServer()
    games = api.get_games_before(cutoff_year=cutoff_year)

    # Invariant: ALL games must be from or before cutoff year
    assert all(game.season <= cutoff_year for game in games), \
        f"API returned games beyond cutoff year {cutoff_year}"

Tests: Temporal boundary holds across all possible cutoff years

Pre-commit: ❌ NO (Hypothesis is slow) PR-time: ✅ YES - Mark as @pytest.mark.property

Comprehensive Data Leakage Test Example¶

@pytest.mark.regression
@pytest.mark.integration
def test_chronological_api_rejects_exact_cutoff_date():
    """Regression test: API allowed games ON cutoff date (data leakage).

    Bug: Issue #87 - get_games_before() used <= instead of < for date comparison.
    Fixed: 2026-02-10 - Changed to strict less-than comparison.
    """
    api = ChronologicalDataServer()
    cutoff = "2023-03-15"

    # Load games before cutoff
    games = api.get_games_before(cutoff_date=cutoff)

    # NO game should have date >= cutoff (strict less-than)
    from datetime import datetime
    cutoff_dt = datetime.fromisoformat(cutoff)

    for game in games:
        game_date = datetime.fromisoformat(game.date)
        assert game_date < cutoff_dt, \
            f"Game on/after cutoff leaked through: {game.date} (cutoff: {cutoff})"

Recommendation¶

Add unit tests for API contract violations (fast, pre-commit safe)
Add integration tests for end-to-end workflows like walk-forward CV (PR-time only)
Add property-based tests for temporal boundary invariants (PR-time only)

Summary¶

Performance Testing Checklist¶

✅ Assertion-based smoke tests for vectorization compliance (@pytest.mark.smoke)
✅ Performance benchmarks for critical paths (@pytest.mark.slow, @pytest.mark.performance)
✅ Full backtest timing test (60-second target)
✅ Forbidden pattern detection (.iterrows(), etc.)

Data Leakage Prevention Checklist¶

✅ Unit tests for API contract enforcement (@pytest.mark.smoke)
✅ Integration tests for end-to-end workflows (@pytest.mark.integration)
✅ Property-based tests for temporal boundary invariants (@pytest.mark.property)
✅ Regression tests for previously discovered leakage bugs (@pytest.mark.regression)

Domain-Specific Testing¶

Performance Testing Guidelines¶

Context: Vectorization First Rule¶

How to Test Vectorization¶

1. Assertion-Based Unit Tests (Quick Smoke Check)¶

2. Performance Benchmark Integration Tests¶

3. pytest-benchmark (Optional, for Critical Paths)¶

Comprehensive Performance Test Example¶

Vectorization Smoke Test Pattern¶

Recommendation¶

Data Leakage Prevention Testing¶

Context: Temporal Boundary Enforcement¶

How to Test Temporal Integrity¶

1. Unit Tests (API Contract)¶

2. Integration Tests (End-to-End Workflow)¶

3. Property-Based Tests (Invariants)¶

Comprehensive Data Leakage Test Example¶

Recommendation¶

Summary¶

Performance Testing Checklist¶

Data Leakage Prevention Checklist¶

See Also¶