Test Purpose Guide: Functional, Performance, and Regression Testing

This guide explains why you’re writing a test - the purpose or goal of the test.

Overview

Test purpose is one of the four orthogonal dimensions of testing. These categories are orthogonal to both scope and approach - a single test can serve multiple purposes.


Functional Testing

What it is

Tests that verify the system behaves correctly according to functional requirements.

Purpose

Ensure the code produces the correct output for given inputs and meets business/domain requirements.

When to use

  • Always - This is the primary purpose of most tests

  • Verifying business logic correctness

  • Validating API contracts and interfaces

  • Ensuring data transformations produce expected results

Characteristics

  • Tests correctness of behavior

  • Verifies outputs match specifications

  • Can be any scope (unit, integration) and any approach (example-based, property-based)

Examples

Functional unit test (example-based):

def test_brier_score_correct_formula():
    """Verify Brier score calculation follows the correct mathematical formula."""
    predictions = np.array([0.8, 0.3, 0.6])
    actuals = np.array([1, 0, 1])

    # Brier = mean((prediction - actual)^2)
    expected = ((0.8-1)**2 + (0.3-0)**2 + (0.6-1)**2) / 3
    result = brier_score(predictions, actuals)

    assert abs(result - expected) < 1e-10

Functional integration test (property-based):

@pytest.mark.integration
@pytest.mark.property
@given(season=st.integers(2015, 2025))
def test_games_loaded_have_required_fields(season):
    """Verify all loaded games contain required fields (functional correctness)."""
    games = load_games_for_season(season)

    required_fields = ["season", "day_num", "w_team_id", "l_team_id", "w_score", "l_score"]
    for game in games:
        for field in required_fields:
            assert hasattr(game, field), f"Game missing required field: {field}"

Marker

No specific marker (default assumption for all tests). Combine with scope/approach markers.


Performance Testing

What it is

Tests that verify the system meets performance requirements (speed, memory, throughput).

Purpose

Ensure code executes within acceptable time/resource bounds, especially for performance-critical operations.

When to use

  • Vectorization compliance - Verify critical paths don’t use Python loops (NFR1)

  • Benchmark targets - Ensure operations meet speed requirements (e.g., 60-second backtest)

  • Regression prevention - Detect performance degradation over time

  • Resource limits - Verify memory usage stays within bounds

Characteristics

  • Tests speed/efficiency of execution

  • Often includes timing assertions or benchmarks

  • Critical for this project due to 60-second backtest target

  • Typically marked as @pytest.mark.slow (excluded from pre-commit)

Examples

Performance unit test (assertion-based):

import timeit
import pytest

@pytest.mark.slow
@pytest.mark.performance
def test_brier_score_vectorized_performance():
    """Verify Brier score calculation meets performance target (vectorized)."""
    predictions = np.random.rand(100_000)
    actuals = np.random.randint(0, 2, 100_000)

    # Should complete in < 10ms for 100k predictions (vectorized)
    time_taken = timeit.timeit(
        lambda: brier_score(predictions, actuals),
        number=10
    ) / 10  # Average per iteration

    assert time_taken < 0.01, f"Brier score too slow: {time_taken:.4f}s (target: < 0.01s)"

Performance integration test (benchmark-based):

@pytest.mark.slow
@pytest.mark.performance
@pytest.mark.integration
def test_full_backtest_meets_60_second_target():
    """Verify 10-year Elo backtest completes within 60 seconds (NFR1)."""
    games = load_games_for_years(range(2013, 2023))

    start_time = timeit.default_timer()
    model = EloModel()
    model.fit(games)
    predictions = model.predict(games)
    elapsed = timeit.default_timer() - start_time

    assert elapsed < 60.0, f"Backtest too slow: {elapsed:.2f}s (target: < 60s)"

Vectorization compliance test:

import inspect

@pytest.mark.smoke
@pytest.mark.performance
def test_metric_calculations_are_vectorized():
    """Verify critical metric calculations don't use Python loops (NFR1)."""
    from ncaa_eval.evaluation import metrics

    # Check that metric module doesn't contain for loops over DataFrames
    source = inspect.getsource(metrics)

    # This is a heuristic - manual review is still needed
    forbidden_patterns = [
        "for _ in df.iterrows()",
        ".iterrows()",
        ".itertuples()",
        "for row in df",
    ]

    for pattern in forbidden_patterns:
        assert pattern not in source, \
            f"Metrics module contains non-vectorized pattern: {pattern}"

Note: The ingest layer (src/ncaa_eval/ingest/) is excluded from the iterrows() forbidden-pattern check only. iterrows() is explicitly permitted there for one-time-per-sync data transformation (see Style Guide Section 5, Exception #4). The .itertuples() and for row in df patterns remain prohibited in business-logic and metric-calculation code; their use elsewhere (ingest connectors, model prediction loops) is evaluated case-by-case against the vectorization mandate.

Marker

@pytest.mark.performance (combine with scope markers like @pytest.mark.integration)

Pre-commit

Generally ❌ NO (too slow), except for quick vectorization smoke checks

PR-time

YES - Essential for NFR1 compliance


Regression Testing

What it is

Tests written specifically to prevent previously discovered bugs from recurring.

Purpose

Ensure fixed bugs stay fixed as the codebase evolves.

When to use

  • After fixing a bug - Always write a regression test that would have caught the bug

  • For critical bugs - High-severity or hard-to-debug issues need regression coverage

  • Before refactoring - Write regression tests for existing behavior before major changes

Characteristics

  • Tests specific scenarios that previously failed

  • Often includes a reference to the bug/issue (e.g., “Regression test for #42”)

  • Can be any scope (unit, integration) and any approach (example-based, property-based)

  • Usually example-based (tests the specific failing case)

Examples

Regression unit test (example-based):

@pytest.mark.regression
def test_elo_rating_never_negative_after_extreme_losses():
    """Regression test: Elo ratings went negative after 50+ consecutive losses.

    Bug: Issue #42 - Elo.update() didn't properly floor ratings at minimum value.
    Fixed: 2026-01-15 - Added max(rating, MIN_RATING) in update logic.
    """
    engine = EloFeatureEngine(elo_config)

    # Simulate 100 consecutive losses to much higher-rated opponent
    for game in losing_streak_games:
        engine.update_game(game)

    # Should never go below minimum rating (e.g., 0 or 100)
    assert all(r >= 0 for r in engine.ratings.values()), "Elo rating went negative"

Regression integration test (example-based):

@pytest.mark.regression
@pytest.mark.integration
def test_chronological_api_rejects_exact_cutoff_date():
    """Regression test: API allowed games ON cutoff date (data leakage).

    Bug: Issue #87 - get_games_before() used <= instead of < for date comparison.
    Fixed: 2026-02-10 - Changed to strict less-than comparison.
    """
    api = ChronologicalDataServer()
    cutoff = "2023-03-15"

    # Load games before cutoff
    games = api.get_games_before(cutoff_date=cutoff)

    # NO game should have date >= cutoff (strict less-than)
    from datetime import datetime
    cutoff_dt = datetime.fromisoformat(cutoff)

    for game in games:
        game_date = datetime.fromisoformat(game.date)
        assert game_date < cutoff_dt, \
            f"Game on/after cutoff leaked through: {game.date} (cutoff: {cutoff})"

Regression test with property-based approach:

@pytest.mark.regression
@pytest.mark.property
@given(k_factor=st.floats(1.0, 100.0))
def test_elo_update_stable_for_all_k_factors(k_factor):
    """Regression test: Elo exploded with large k_factors.

    Bug: Issue #56 - No upper bound on rating change led to overflow.
    Fixed: 2026-01-20 - Clamped rating changes to reasonable bounds.
    """
    config = EloConfig(k_factor=k_factor)
    engine = EloFeatureEngine(config)
    engine.update_game(sample_game)

    # Rating should stay within reasonable bounds
    for rating in engine.ratings.values():
        assert 0 <= rating <= 3000, f"Rating exploded with k_factor={k_factor}: {rating}"

Best Practice - Always include bug context

@pytest.mark.regression
def test_specific_bug_fixed():
    """Regression test: [Brief description of bug]

    Bug: Issue #[number] or [Description of when it was discovered]
    Fixed: [Date] - [Brief description of fix]
    Test: [What this test verifies]
    """
    # Test code that would have failed before the fix
    pass

Marker

@pytest.mark.regression (combine with scope markers)

Pre-commit

YES if fast (mark as @pytest.mark.smoke) ❌ NO if slow

PR-time

YES - Always run regression tests


Combining Multiple Purposes

A single test can have multiple purposes:

Functional + Performance:

@pytest.mark.performance
def test_brier_score_correct_and_fast():
    """Verify Brier score is correct AND meets performance target."""
    predictions = np.array([0.8, 0.3])
    actuals = np.array([1, 0])

    # Functional: Verify correctness
    expected = ((0.8-1)**2 + (0.3-0)**2) / 2
    result = brier_score(predictions, actuals)
    assert abs(result - expected) < 1e-10

    # Performance: Verify speed for large datasets
    large_preds = np.random.rand(100_000)
    large_actuals = np.random.randint(0, 2, 100_000)

    import timeit
    time_taken = timeit.timeit(
        lambda: brier_score(large_preds, large_actuals),
        number=10
    ) / 10

    assert time_taken < 0.01  # < 10ms for 100k predictions

Functional + Regression:

@pytest.mark.regression
@pytest.mark.integration
def test_walk_forward_cv_prevents_leakage_bug():
    """Regression + Functional: Verify walk-forward CV doesn't leak data.

    Bug: Issue #92 - CV splitter leaked one day of overlap between train/test.
    Fixed: 2026-02-12 - Changed to strict < comparison for split boundaries.
    """
    splitter = WalkForwardSplitter(years=range(2015, 2020))
    games = load_games_for_years(range(2015, 2020))

    for train_data, test_data, year in splitter.split(games):
        train_max_date = train_data['date'].max()
        test_min_date = test_data['date'].min()

        # Functional: Verify no overlap
        assert train_max_date < test_min_date

        # Regression: Specific case that failed before fix
        if year == 2017:
            # This specific split had overlap before the fix
            assert train_max_date != test_min_date

All three purposes:

@pytest.mark.integration
@pytest.mark.property
@pytest.mark.performance
@pytest.mark.regression
@given(season=st.integers(2015, 2025))
def test_game_loading_comprehensive(season):
    """Functional + Performance + Regression: Comprehensive game loading test.

    Functional: Verify all games have required fields
    Performance: Verify loading completes in reasonable time
    Regression: Issue #103 - Loading 2019 season caused memory overflow
    """
    import timeit

    # Performance: Time the loading
    start = timeit.default_timer()
    games = load_games_for_season(season)
    elapsed = timeit.default_timer() - start

    # Functional: Verify correctness
    assert len(games) > 0
    required_fields = ["season", "day_num", "w_team_id", "l_team_id"]
    for game in games:
        for field in required_fields:
            assert hasattr(game, field)

    # Performance: Should complete in < 5 seconds for any season
    assert elapsed < 5.0

    # Regression: 2019 season specifically caused issues
    if season == 2019:
        import sys
        memory_mb = sys.getsizeof(games) / (1024 * 1024)
        assert memory_mb < 100  # Should stay under 100MB

See Also