Testing Strategy

Quick Reference for the ncaa_eval project testing approach. For detailed explanations and examples, see the testing guides.


Table of Contents

  1. Overview

  2. Detailed Guides

  3. Quick Decision Trees

  4. Test Markers Reference

  5. Test Commands Reference

  6. Test Organization

  7. Coverage Targets

  8. Testing Tools

  9. Domain-Specific Testing

  10. References


Overview

Key Principles

  1. Fast feedback via Tier 1 (pre-commit, < 10s total)

  2. Thorough validation via Tier 2 (PR/CI, complete suite)

  3. Four orthogonal dimensions - choose appropriate combination

  4. Coverage is a signal, not a gate - identify gaps, don’t block

  5. Mutation testing evaluates test quality (critical modules only)

  6. Vectorization compliance via performance testing (NFR1)

  7. Temporal integrity via data leakage testing (NFR4)

  8. 4-tier execution model - Tier 1 (pre-commit) → Tier 2 (PR/CI) → Tier 3 (AI review) → Tier 4 (owner review)

Four Orthogonal Dimensions

This strategy separates four independent dimensions of testing. Choose the appropriate combination for each test case:

  1. Test Scope - What you’re testing → Scope Guide

    • Unit: Single function/class in isolation

    • Integration: Multiple components working together

  2. Test Approach - How you write the test → Approach Guide

    • Example-based: Concrete inputs → expected outputs

    • Property-based (Hypothesis): Invariants that should hold for all inputs

    • Fuzz-based (Hypothesis): Random/mutated inputs to find crashes and error handling gaps (no dedicated marker — use @pytest.mark.slow)

  3. Test Purpose - Why you’re writing the test → Purpose Guide

    • Functional: Correctness of behavior (default)

    • Performance: Speed/efficiency compliance (NFR1: vectorization)

    • Regression: Prevent previously fixed bugs from recurring

  4. Execution Scope - When tests/checks run → Execution Guide

    • Tier 1 (Pre-commit): Smoke tests + fast checks (< 10s total)

    • Tier 2 (PR/CI): Complete suite + coverage + mutation

    • Tier 3/4: AI + Owner review

Note: Mutation testing and coverage are not test types - they’re quality assurance tools. See Quality Assurance Guide.

Execution Tiers (When Checks Run)

The project uses a 4-tier execution model that balances speed with thoroughness:

Tier 1: Pre-Commit (< 10s total)

Fast, local checks that run on every commit:

Check

Tool

What It Catches

Lint

ruff check .

Style violations, import issues

Format

ruff format --check .

Inconsistent formatting

Type-check

mypy (strict)

Missing annotations, type errors

Smoke tests

pytest -m smoke

Broken imports, sanity failures

Rationale: Catch 80% of issues in seconds before code leaves your machine.

Tier 2: PR/CI (minutes)

Comprehensive validation before merge:

Check

Tool

What It Catches

Full test suite

pytest

All regressions, edge cases

Integration tests

pytest -m integration

Component interaction failures

Property-based

pytest -m property

Invariant violations

Performance

pytest -m performance

Vectorization violations, speed regressions

Coverage

pytest-cov

Untested code paths

Mutation (Tier 1 modules)

mutmut

Weak tests, coverage gaps

Rationale: Catch remaining 20% requiring full project context.

Tier 3: AI Code Review

Docstring quality, vectorization compliance, architecture alignment, test quality, design intent.

Tier 4: Owner Review

Functional correctness, strategic alignment, complexity appropriateness, scope creep prevention.

See Execution Guide for complete details on each tier.


Detailed Guides

For comprehensive explanations, examples, and best practices:


Quick Decision Trees

Which test scope?

flowchart TD
    Start{Does it interact with<br/>external systems?<br/>files, DB, network}
    Start -->|YES| Integration[Integration test<br/>@pytest.mark.integration<br/>PR-time only]
    Start -->|NO| Unit[Unit test<br/>fast, pre-commit eligible if smoke]

Which approach?

flowchart TD
    Start{Testing error handling<br/>or crash resilience?}
    Start -->|YES| Fuzz[Fuzz-based<br/>Hypothesis st.text/st.binary]
    Start -->|NO| Known{Have specific<br/>known scenarios?}
    Known -->|YES| Example[Example-based<br/>parametrize for multiple cases]
    Known -->|NO| Invariant{Can you state<br/>an invariant?}
    Invariant -->|YES| Property[Property-based<br/>@pytest.mark.property<br/>Hypothesis]
    Invariant -->|NO| ExampleAlt[Example-based<br/>test specific examples]

Which execution tier?

flowchart TD
    Start{Is test fast?<br/>under 1 second}
    Start -->|NO| Tier2Slow[Tier 2 only<br/>@pytest.mark.slow]
    Start -->|YES| Critical{Import/sanity/schema check<br/>OR critical regression?}
    Critical -->|YES| Tier1[Tier 1 eligible<br/>@pytest.mark.smoke]
    Critical -->|NO| Tier2Fast[Tier 2 only<br/>save pre-commit budget]

Test Markers Reference

Marker

Dimension

Command

@pytest.mark.smoke

Speed

pytest -m smoke

@pytest.mark.slow

Speed

pytest -m "not slow"

@pytest.mark.unit

Scope

pytest -m unit

@pytest.mark.integration

Scope

pytest -m integration

@pytest.mark.property

Approach

pytest -m property

@pytest.mark.performance

Purpose

pytest -m performance

@pytest.mark.regression

Purpose

pytest -m regression

@pytest.mark.no_mutation

Quality

Tests incompatible with mutmut runner

Combine markers across dimensions:

@pytest.mark.integration
@pytest.mark.property
@pytest.mark.regression

Test Commands Reference

Context

Command

What Runs

Tier 1 (Pre-commit)

pytest -m smoke

Smoke tests only (< 5s; Tier 1 overall < 10s)

Tier 2 (PR/CI - full)

pytest

All tests

Tier 2 (PR/CI - coverage)

pytest --cov=src/ncaa_eval --cov-report=term-missing

All + coverage report

Tier 2 (exclude slow)

pytest -m "not slow"

All except slow tests

Filter by dimension

pytest -m integration

Filter by marker

Combined filters

pytest -m "integration and regression"

Intersection


Test Organization

tests/
├── __init__.py
├── conftest.py                          # Shared fixtures
├── fixtures/
│   ├── .gitkeep
│   └── kaggle/
│       ├── MNCAATourneyCompactResults.csv
│       ├── MRegularSeasonCompactResults.csv
│       ├── MSeasons.csv
│       └── MTeams.csv
├── integration/
│   ├── __init__.py
│   ├── test_documented_commands.py
│   ├── test_elo_integration.py
│   ├── test_feature_serving_integration.py
│   └── test_sync.py
└── unit/
    ├── __init__.py
    ├── test_bracket_page.py
    ├── test_bracket_renderer.py
    ├── test_calibration.py
    ├── test_chronological_serving.py
    ├── test_cli_train.py
    ├── test_connector_base.py
    ├── test_dashboard_app.py
    ├── test_dashboard_filters.py
    ├── test_deep_dive_page.py
    ├── test_elo.py
    ├── test_espn_connector.py
    ├── test_evaluation_backtest.py
    ├── test_evaluation_metrics.py
    ├── test_evaluation_plotting.py
    ├── test_evaluation_simulation.py
    ├── test_evaluation_splitter.py
    ├── test_feature_serving.py
    ├── test_framework_validation.py
    ├── test_fuzzy.py
    ├── test_graph.py
    ├── test_home_page.py
    ├── test_imports.py
    ├── test_kaggle_connector.py
    ├── test_leaderboard_page.py
    ├── test_logger.py
    ├── test_model_base.py
    ├── test_model_elo.py
    ├── test_model_logistic_regression.py
    ├── test_model_registry.py
    ├── test_model_tracking.py
    ├── test_model_xgboost.py
    ├── test_normalization.py
    ├── test_opponent.py
    ├── test_package_structure.py
    ├── test_pool_scorer_page.py
    ├── test_repository.py
    ├── test_run_store_metrics.py
    ├── test_schema.py
    └── test_sequential.py

Naming conventions:

  • Test files: test_<module_name>.py

  • Test functions: test_<function>_<scenario>()

  • Fixtures: Descriptive names (e.g., sample_teams, elo_config, temp_data_dir)

See Conventions Guide for details.


Coverage Targets

Module

Line

Branch

Rationale

evaluation/metrics.py

95%

90%

Critical - errors invalidate all evaluations

evaluation/simulation.py

90%

85%

Monte Carlo simulator

model/

90%

85%

Core abstraction

transform/

85%

80%

Feature correctness, leakage prevention

ingest/

80%

75%

Data quality

utils/

75%

70%

Lower priority

Overall

80%

75%

Balanced

Coverage is a signal, not a gate. Use to identify gaps, not block PRs.

See Conventions Guide for details.


Testing Tools

Tool

Purpose

Configuration

Pytest

Testing framework

pyproject.toml [tool.pytest.ini_options]

Hypothesis

Property-based + Fuzz testing

Dev dependency

Mutmut

Mutation testing (quality)

Dev dependency

pytest-cov

Coverage reporting

[tool.coverage.report]

Nox

Session orchestration

noxfile.py


Domain-Specific Testing

Performance Testing (NFR1: Vectorization)

  • Smoke: Assertion-based vectorization checks (< 1s)

  • PR-time: Performance benchmarks, 60-second backtest target

@pytest.mark.smoke
@pytest.mark.performance
def test_metrics_are_vectorized():
    """Quick check: no .iterrows() in metrics."""
    # See domain-testing.md for example

Data Leakage Prevention (NFR4: Temporal Boundaries)

  • Smoke: API contract unit tests (fast)

  • PR-time: End-to-end workflow tests, property-based invariants

@pytest.mark.smoke
def test_api_enforces_cutoff():
    """Quick check: API rejects future data."""
    # See domain-testing.md for example

See Domain Testing Guide for comprehensive examples.


References

  • STYLE_GUIDE.md - Coding standards, vectorization rule

  • specs/05-architecture-fullstack.md - Architecture, nox workflow

  • specs/03-prd.md - Non-functional requirements (NFR1-NFR5)

  • pyproject.toml - Pytest configuration

  • .github/pull_request_template.md - PR checklist