# Test Purpose Guide: Functional, Performance, and Regression Testing This guide explains **why you're writing a test** - the purpose or goal of the test. ## Overview **Test purpose** is one of the four orthogonal dimensions of testing. These categories are orthogonal to both scope and approach - a single test can serve multiple purposes. --- ## Functional Testing ### What it is Tests that verify the system behaves correctly according to functional requirements. ### Purpose Ensure the code produces the correct output for given inputs and meets business/domain requirements. ### When to use - **Always** - This is the primary purpose of most tests - Verifying business logic correctness - Validating API contracts and interfaces - Ensuring data transformations produce expected results ### Characteristics - Tests **correctness** of behavior - Verifies outputs match specifications - Can be any scope (unit, integration) and any approach (example-based, property-based) ### Examples **Functional unit test (example-based):** ```python def test_brier_score_correct_formula(): """Verify Brier score calculation follows the correct mathematical formula.""" predictions = np.array([0.8, 0.3, 0.6]) actuals = np.array([1, 0, 1]) # Brier = mean((prediction - actual)^2) expected = ((0.8-1)**2 + (0.3-0)**2 + (0.6-1)**2) / 3 result = brier_score(predictions, actuals) assert abs(result - expected) < 1e-10 ``` **Functional integration test (property-based):** ```python @pytest.mark.integration @pytest.mark.property @given(season=st.integers(2015, 2025)) def test_games_loaded_have_required_fields(season): """Verify all loaded games contain required fields (functional correctness).""" games = load_games_for_season(season) required_fields = ["season", "day_num", "w_team_id", "l_team_id", "w_score", "l_score"] for game in games: for field in required_fields: assert hasattr(game, field), f"Game missing required field: {field}" ``` ### Marker No specific marker (default assumption for all tests). Combine with scope/approach markers. --- ## Performance Testing ### What it is Tests that verify the system meets performance requirements (speed, memory, throughput). ### Purpose Ensure code executes within acceptable time/resource bounds, especially for performance-critical operations. ### When to use - **Vectorization compliance** - Verify critical paths don't use Python loops (NFR1) - **Benchmark targets** - Ensure operations meet speed requirements (e.g., 60-second backtest) - **Regression prevention** - Detect performance degradation over time - **Resource limits** - Verify memory usage stays within bounds ### Characteristics - Tests **speed/efficiency** of execution - Often includes timing assertions or benchmarks - Critical for this project due to 60-second backtest target - Typically marked as `@pytest.mark.slow` (excluded from pre-commit) ### Examples **Performance unit test (assertion-based):** ```python import timeit import pytest @pytest.mark.slow @pytest.mark.performance def test_brier_score_vectorized_performance(): """Verify Brier score calculation meets performance target (vectorized).""" predictions = np.random.rand(100_000) actuals = np.random.randint(0, 2, 100_000) # Should complete in < 10ms for 100k predictions (vectorized) time_taken = timeit.timeit( lambda: brier_score(predictions, actuals), number=10 ) / 10 # Average per iteration assert time_taken < 0.01, f"Brier score too slow: {time_taken:.4f}s (target: < 0.01s)" ``` **Performance integration test (benchmark-based):** ```python @pytest.mark.slow @pytest.mark.performance @pytest.mark.integration def test_full_backtest_meets_60_second_target(): """Verify 10-year Elo backtest completes within 60 seconds (NFR1).""" games = load_games_for_years(range(2013, 2023)) start_time = timeit.default_timer() model = EloModel() model.fit(games) predictions = model.predict(games) elapsed = timeit.default_timer() - start_time assert elapsed < 60.0, f"Backtest too slow: {elapsed:.2f}s (target: < 60s)" ``` **Vectorization compliance test:** ```python import inspect @pytest.mark.smoke @pytest.mark.performance def test_metric_calculations_are_vectorized(): """Verify critical metric calculations don't use Python loops (NFR1).""" from ncaa_eval.evaluation import metrics # Check that metric module doesn't contain for loops over DataFrames source = inspect.getsource(metrics) # This is a heuristic - manual review is still needed forbidden_patterns = [ "for _ in df.iterrows()", ".iterrows()", ".itertuples()", "for row in df", ] for pattern in forbidden_patterns: assert pattern not in source, \ f"Metrics module contains non-vectorized pattern: {pattern}" ``` > **Note:** The ingest layer (`src/ncaa_eval/ingest/`) is excluded from the `iterrows()` > forbidden-pattern check only. `iterrows()` is explicitly permitted there for one-time-per-sync > data transformation (see Style Guide Section 5, Exception #4). The `.itertuples()` and > `for row in df` patterns remain prohibited in business-logic and metric-calculation code; > their use elsewhere (ingest connectors, model prediction loops) is evaluated case-by-case > against the vectorization mandate. ### Marker `@pytest.mark.performance` (combine with scope markers like `@pytest.mark.integration`) ### Pre-commit Generally ❌ **NO** (too slow), except for quick vectorization smoke checks ### PR-time ✅ **YES** - Essential for NFR1 compliance --- ## Regression Testing ### What it is Tests written specifically to prevent previously discovered bugs from recurring. ### Purpose Ensure fixed bugs stay fixed as the codebase evolves. ### When to use - **After fixing a bug** - Always write a regression test that would have caught the bug - **For critical bugs** - High-severity or hard-to-debug issues need regression coverage - **Before refactoring** - Write regression tests for existing behavior before major changes ### Characteristics - Tests **specific scenarios that previously failed** - Often includes a reference to the bug/issue (e.g., "Regression test for #42") - Can be any scope (unit, integration) and any approach (example-based, property-based) - Usually example-based (tests the specific failing case) ### Examples **Regression unit test (example-based):** ```python @pytest.mark.regression def test_elo_rating_never_negative_after_extreme_losses(): """Regression test: Elo ratings went negative after 50+ consecutive losses. Bug: Issue #42 - Elo.update() didn't properly floor ratings at minimum value. Fixed: 2026-01-15 - Added max(rating, MIN_RATING) in update logic. """ engine = EloFeatureEngine(elo_config) # Simulate 100 consecutive losses to much higher-rated opponent for game in losing_streak_games: engine.update_game(game) # Should never go below minimum rating (e.g., 0 or 100) assert all(r >= 0 for r in engine.ratings.values()), "Elo rating went negative" ``` **Regression integration test (example-based):** ```python @pytest.mark.regression @pytest.mark.integration def test_chronological_api_rejects_exact_cutoff_date(): """Regression test: API allowed games ON cutoff date (data leakage). Bug: Issue #87 - get_games_before() used <= instead of < for date comparison. Fixed: 2026-02-10 - Changed to strict less-than comparison. """ api = ChronologicalDataServer() cutoff = "2023-03-15" # Load games before cutoff games = api.get_games_before(cutoff_date=cutoff) # NO game should have date >= cutoff (strict less-than) from datetime import datetime cutoff_dt = datetime.fromisoformat(cutoff) for game in games: game_date = datetime.fromisoformat(game.date) assert game_date < cutoff_dt, \ f"Game on/after cutoff leaked through: {game.date} (cutoff: {cutoff})" ``` **Regression test with property-based approach:** ```python @pytest.mark.regression @pytest.mark.property @given(k_factor=st.floats(1.0, 100.0)) def test_elo_update_stable_for_all_k_factors(k_factor): """Regression test: Elo exploded with large k_factors. Bug: Issue #56 - No upper bound on rating change led to overflow. Fixed: 2026-01-20 - Clamped rating changes to reasonable bounds. """ config = EloConfig(k_factor=k_factor) engine = EloFeatureEngine(config) engine.update_game(sample_game) # Rating should stay within reasonable bounds for rating in engine.ratings.values(): assert 0 <= rating <= 3000, f"Rating exploded with k_factor={k_factor}: {rating}" ``` ### Best Practice - Always include bug context ```python @pytest.mark.regression def test_specific_bug_fixed(): """Regression test: [Brief description of bug] Bug: Issue #[number] or [Description of when it was discovered] Fixed: [Date] - [Brief description of fix] Test: [What this test verifies] """ # Test code that would have failed before the fix pass ``` ### Marker `@pytest.mark.regression` (combine with scope markers) ### Pre-commit ✅ **YES** if fast (mark as `@pytest.mark.smoke`) ❌ **NO** if slow ### PR-time ✅ **YES** - Always run regression tests --- ## Combining Multiple Purposes A single test can have multiple purposes: **Functional + Performance:** ```python @pytest.mark.performance def test_brier_score_correct_and_fast(): """Verify Brier score is correct AND meets performance target.""" predictions = np.array([0.8, 0.3]) actuals = np.array([1, 0]) # Functional: Verify correctness expected = ((0.8-1)**2 + (0.3-0)**2) / 2 result = brier_score(predictions, actuals) assert abs(result - expected) < 1e-10 # Performance: Verify speed for large datasets large_preds = np.random.rand(100_000) large_actuals = np.random.randint(0, 2, 100_000) import timeit time_taken = timeit.timeit( lambda: brier_score(large_preds, large_actuals), number=10 ) / 10 assert time_taken < 0.01 # < 10ms for 100k predictions ``` **Functional + Regression:** ```python @pytest.mark.regression @pytest.mark.integration def test_walk_forward_cv_prevents_leakage_bug(): """Regression + Functional: Verify walk-forward CV doesn't leak data. Bug: Issue #92 - CV splitter leaked one day of overlap between train/test. Fixed: 2026-02-12 - Changed to strict < comparison for split boundaries. """ splitter = WalkForwardSplitter(years=range(2015, 2020)) games = load_games_for_years(range(2015, 2020)) for train_data, test_data, year in splitter.split(games): train_max_date = train_data['date'].max() test_min_date = test_data['date'].min() # Functional: Verify no overlap assert train_max_date < test_min_date # Regression: Specific case that failed before fix if year == 2017: # This specific split had overlap before the fix assert train_max_date != test_min_date ``` **All three purposes:** ```python @pytest.mark.integration @pytest.mark.property @pytest.mark.performance @pytest.mark.regression @given(season=st.integers(2015, 2025)) def test_game_loading_comprehensive(season): """Functional + Performance + Regression: Comprehensive game loading test. Functional: Verify all games have required fields Performance: Verify loading completes in reasonable time Regression: Issue #103 - Loading 2019 season caused memory overflow """ import timeit # Performance: Time the loading start = timeit.default_timer() games = load_games_for_season(season) elapsed = timeit.default_timer() - start # Functional: Verify correctness assert len(games) > 0 required_fields = ["season", "day_num", "w_team_id", "l_team_id"] for game in games: for field in required_fields: assert hasattr(game, field) # Performance: Should complete in < 5 seconds for any season assert elapsed < 5.0 # Regression: 2019 season specifically caused issues if season == 2019: import sys memory_mb = sys.getsizeof(games) / (1024 * 1024) assert memory_mb < 100 # Should stay under 100MB ``` --- ## See Also - [Test Scope Guide](test-scope-guide.md) - Unit vs Integration tests - [Test Approach Guide](test-approach-guide.md) - Example-based vs Property-based - [Domain Testing Guide](domain-testing.md) - Performance and data leakage testing details - [Execution Guide](execution.md) - When tests/checks run (4-tier model) - [Quality Assurance Guide](quality.md) - Mutation testing, coverage analysis