Domain-Specific Testing¶
This guide covers testing approaches specific to the ncaa_eval project’s domain requirements: performance testing (vectorization compliance) and data leakage prevention.
Performance Testing Guidelines¶
Context: Vectorization First Rule¶
The Architecture mandates “Vectorization First” (Section 12):
Reject PRs that use
forloops over Pandas DataFrames for metric calculations.
This project has a 60-second backtest target (10-year Elo training + inference). NFR1 mandates vectorization; NFR2 mandates parallelism. Performance testing ensures these requirements are met.
How to Test Vectorization¶
1. Assertion-Based Unit Tests (Quick Smoke Check)¶
Purpose: Catch obvious violations quickly during pre-commit
@pytest.mark.smoke
def test_brier_score_is_vectorized():
"""Verify Brier score uses vectorized operations (no Python loops)."""
import inspect
source = inspect.getsource(brier_score)
# Check that implementation doesn't contain for loops over data
# (Allow for loops in fixture setup, just not in calculation logic)
assert "for " not in source or "# vectorized" in source, \
"Brier score must use vectorized numpy/pandas operations"
Pros: Fast, catches obvious violations Cons: Brittle (may trigger false positives on legitimate loops in comments)
Pre-commit: ✅ YES - Mark as @pytest.mark.smoke
2. Performance Benchmark Integration Tests¶
Purpose: Verify actual performance meets targets
import timeit
import pytest
@pytest.mark.slow
@pytest.mark.integration
def test_brier_score_performance():
"""Verify Brier score meets performance target."""
predictions = np.random.rand(10_000)
actuals = np.random.randint(0, 2, 10_000)
time_taken = timeit.timeit(
lambda: brier_score(predictions, actuals),
number=100
)
# Should complete 100 iterations in < 100ms (1ms per call for 10k predictions)
assert time_taken < 0.1, f"Brier score too slow: {time_taken:.3f}s for 100 iterations"
Pros: Verifies actual performance, catches regressions Cons: Slow, results vary by machine (need generous thresholds)
Pre-commit: ❌ NO (too slow)
PR-time: ✅ YES - Mark as @pytest.mark.slow and @pytest.mark.performance
3. pytest-benchmark (Optional, for Critical Paths)¶
⚠️ NOT INSTALLED:
pytest-benchmarkis optional and NOT currently inpyproject.toml. Use timeit-based benchmarks (Option 2) for now.
Purpose: Track performance over time with detailed statistics
def test_brier_score_benchmark(benchmark):
"""Benchmark Brier score calculation."""
predictions = np.random.rand(10_000)
actuals = np.random.randint(0, 2, 10_000)
result = benchmark(brier_score, predictions, actuals)
# Benchmark provides detailed statistics (mean, stddev, percentiles)
assert result is not None
Pros: Detailed statistics, integrated with pytest, tracks performance over time
Cons: Requires pytest-benchmark plugin (not currently installed)
Pre-commit: ❌ NO PR-time: ✅ YES (if plugin is installed)
Comprehensive Performance Test Example¶
@pytest.mark.slow
@pytest.mark.performance
@pytest.mark.integration
def test_full_backtest_meets_60_second_target():
"""Verify 10-year Elo backtest completes within 60 seconds (NFR1)."""
games = load_games_for_years(range(2013, 2023))
start_time = timeit.default_timer()
model = EloModel()
model.fit(games)
predictions = model.predict(games)
elapsed = timeit.default_timer() - start_time
assert elapsed < 60.0, f"Backtest too slow: {elapsed:.2f}s (target: < 60s)"
Vectorization Smoke Test Pattern¶
import inspect
@pytest.mark.smoke
@pytest.mark.performance
def test_metric_calculations_are_vectorized():
"""Verify critical metric calculations don't use Python loops (NFR1)."""
from ncaa_eval.evaluation import metrics
# Check that metric module doesn't contain for loops over DataFrames
source = inspect.getsource(metrics)
# This is a heuristic - manual review is still needed
forbidden_patterns = [
"for _ in df.iterrows()",
".iterrows()",
".itertuples()",
"for row in df",
]
for pattern in forbidden_patterns:
assert pattern not in source, \
f"Metrics module contains non-vectorized pattern: {pattern}"
Note: The ingest layer (
src/ncaa_eval/ingest/) is excluded from theiterrows()forbidden-pattern check only.iterrows()is explicitly permitted there for one-time-per-sync data transformation (see Style Guide Section 5, Exception #4). The.itertuples()andfor row in dfpatterns remain prohibited in business-logic and metric-calculation code; their use elsewhere (ingest connectors, model prediction loops) is evaluated case-by-case against the vectorization mandate.
Recommendation¶
Use assertion-based tests for quick smoke checks (mark as
@pytest.mark.smoke)Use performance benchmarks for critical modules (
evaluation/metrics.py,evaluation/simulation.py) as@pytest.mark.slowtests (PR-time only)Consider pytest-benchmark for performance regression tracking (optional, not currently installed)
Data Leakage Prevention Testing¶
Context: Temporal Boundary Enforcement¶
The Architecture mandates “Data Safety: Temporal boundaries enforced strictly in the API to prevent data leakage” (Section 11.2).
NFR4 requires strict temporal boundaries: training data must never include information from future games. This is critical for model validity.
How to Test Temporal Integrity¶
1. Unit Tests (API Contract)¶
Purpose: Verify API rejects requests for future data
def test_get_chronological_season_enforces_cutoff():
"""Verify chronological API rejects requests for future data."""
api = ChronologicalDataServer()
with pytest.raises(ValueError, match="Cannot access future data"):
api.get_games_before(date="2025-12-31", cutoff_date="2025-01-01")
Tests: API raises error when requesting data beyond cutoff
Pre-commit: ✅ YES (fast, no I/O) - Mark as @pytest.mark.smoke
2. Integration Tests (End-to-End Workflow)¶
Purpose: Verify end-to-end workflows prevent leakage
@pytest.mark.integration
def test_walk_forward_validation_prevents_leakage(sample_games):
"""Verify walk-forward CV never trains on future data."""
splitter = WalkForwardSplitter(years=range(2015, 2025))
for train_data, test_data, year in splitter.split(sample_games):
# Verify all training games occur before test games
train_max_date = train_data['date'].max()
test_min_date = test_data['date'].min()
assert train_max_date < test_min_date, \
f"Data leakage detected in {year} fold: train_max={train_max_date}, test_min={test_min_date}"
Tests: Cross-validation splitter never leaks future data into training
Pre-commit: ❌ NO (too slow)
PR-time: ✅ YES - Mark as @pytest.mark.integration
3. Property-Based Tests (Invariants)¶
Purpose: Verify temporal boundary holds across all possible inputs
from hypothesis import given, strategies as st
@pytest.mark.property
@given(cutoff_year=st.integers(2015, 2025))
def test_temporal_boundary_invariant(cutoff_year):
"""Verify no API call can access data beyond cutoff year."""
api = ChronologicalDataServer()
games = api.get_games_before(cutoff_year=cutoff_year)
# Invariant: ALL games must be from or before cutoff year
assert all(game.season <= cutoff_year for game in games), \
f"API returned games beyond cutoff year {cutoff_year}"
Tests: Temporal boundary holds across all possible cutoff years
Pre-commit: ❌ NO (Hypothesis is slow)
PR-time: ✅ YES - Mark as @pytest.mark.property
Comprehensive Data Leakage Test Example¶
@pytest.mark.regression
@pytest.mark.integration
def test_chronological_api_rejects_exact_cutoff_date():
"""Regression test: API allowed games ON cutoff date (data leakage).
Bug: Issue #87 - get_games_before() used <= instead of < for date comparison.
Fixed: 2026-02-10 - Changed to strict less-than comparison.
"""
api = ChronologicalDataServer()
cutoff = "2023-03-15"
# Load games before cutoff
games = api.get_games_before(cutoff_date=cutoff)
# NO game should have date >= cutoff (strict less-than)
from datetime import datetime
cutoff_dt = datetime.fromisoformat(cutoff)
for game in games:
game_date = datetime.fromisoformat(game.date)
assert game_date < cutoff_dt, \
f"Game on/after cutoff leaked through: {game.date} (cutoff: {cutoff})"
Recommendation¶
Add unit tests for API contract violations (fast, pre-commit safe)
Add integration tests for end-to-end workflows like walk-forward CV (PR-time only)
Add property-based tests for temporal boundary invariants (PR-time only)
Summary¶
Performance Testing Checklist¶
✅ Assertion-based smoke tests for vectorization compliance (
@pytest.mark.smoke)✅ Performance benchmarks for critical paths (
@pytest.mark.slow,@pytest.mark.performance)✅ Full backtest timing test (60-second target)
✅ Forbidden pattern detection (
.iterrows(), etc.)
Data Leakage Prevention Checklist¶
✅ Unit tests for API contract enforcement (
@pytest.mark.smoke)✅ Integration tests for end-to-end workflows (
@pytest.mark.integration)✅ Property-based tests for temporal boundary invariants (
@pytest.mark.property)✅ Regression tests for previously discovered leakage bugs (
@pytest.mark.regression)
See Also¶
Test Purpose Guide - Performance and Regression testing
Test Scope Guide - Unit vs Integration tests
Test Approach Guide - Example-based vs Property-based
Execution Guide - When tests/checks run (4-tier model)
Quality Assurance Guide - Mutation testing, coverage analysis
STYLE_GUIDE.md Section 5 - Vectorization First rule