Testing Strategy¶

Quick Reference for the ncaa_eval project testing approach. For detailed explanations and examples, see the testing guides.

Table of Contents¶

Overview¶

Key Principles¶

✅ Fast feedback via Tier 1 (pre-commit, < 10s total)
✅ Thorough validation via Tier 2 (PR/CI, complete suite)
✅ Four orthogonal dimensions - choose appropriate combination
✅ Coverage is a signal, not a gate - identify gaps, don’t block
✅ Mutation testing evaluates test quality (critical modules only)
✅ Vectorization compliance via performance testing (NFR1)
✅ Temporal integrity via data leakage testing (NFR4)
✅ 4-tier execution model - Tier 1 (pre-commit) → Tier 2 (PR/CI) → Tier 3 (AI review) → Tier 4 (owner review)

Four Orthogonal Dimensions¶

This strategy separates four independent dimensions of testing. Choose the appropriate combination for each test case:

Test Scope - What you’re testing → Scope Guide
- Unit: Single function/class in isolation
- Integration: Multiple components working together
Test Approach - How you write the test → Approach Guide
- Example-based: Concrete inputs → expected outputs
- Property-based (Hypothesis): Invariants that should hold for all inputs
- Fuzz-based (Hypothesis): Random/mutated inputs to find crashes and error handling gaps (no dedicated marker — use @pytest.mark.slow)
Test Purpose - Why you’re writing the test → Purpose Guide
- Functional: Correctness of behavior (default)
- Performance: Speed/efficiency compliance (NFR1: vectorization)
- Regression: Prevent previously fixed bugs from recurring
Execution Scope - When tests/checks run → Execution Guide
- Tier 1 (Pre-commit): Smoke tests + fast checks (< 10s total)
- Tier 2 (PR/CI): Complete suite + coverage + mutation
- Tier 3/4: AI + Owner review

Note: Mutation testing and coverage are not test types - they’re quality assurance tools. See Quality Assurance Guide.

Execution Tiers (When Checks Run)¶

The project uses a 4-tier execution model that balances speed with thoroughness:

Tier 1: Pre-Commit (< 10s total)¶

Fast, local checks that run on every commit:

Check	Tool	What It Catches
Lint	`ruff check .`	Style violations, import issues
Format	`ruff format --check .`	Inconsistent formatting
Type-check	`mypy` (strict)	Missing annotations, type errors
Smoke tests	`pytest -m smoke`	Broken imports, sanity failures

Rationale: Catch 80% of issues in seconds before code leaves your machine.

Tier 2: PR/CI (minutes)¶

Comprehensive validation before merge:

Check	Tool	What It Catches
Full test suite	`pytest`	All regressions, edge cases
Integration tests	`pytest -m integration`	Component interaction failures
Property-based	`pytest -m property`	Invariant violations
Performance	`pytest -m performance`	Vectorization violations, speed regressions
Coverage	`pytest-cov`	Untested code paths
Mutation (Tier 1 modules)	`mutmut`	Weak tests, coverage gaps

Rationale: Catch remaining 20% requiring full project context.

Tier 3: AI Code Review¶

Docstring quality, vectorization compliance, architecture alignment, test quality, design intent.

Tier 4: Owner Review¶

Functional correctness, strategic alignment, complexity appropriateness, scope creep prevention.

See Execution Guide for complete details on each tier.

Detailed Guides¶

For comprehensive explanations, examples, and best practices:

Test Scope Guide - Unit vs Integration tests
Test Approach Guide - Example-based vs Property-based
Test Purpose Guide - Functional, Performance, Regression
Execution Guide - When tests/checks run (4-tier model)
Quality Assurance Guide - Mutation testing, coverage analysis
Conventions Guide - Fixtures, markers, organization, coverage targets
Domain Testing Guide - Performance testing, Data leakage prevention

Quick Decision Trees¶

Which test scope?¶

flowchart TD
    Start{Does it interact with<br/>external systems?<br/>files, DB, network}
    Start -->|YES| Integration[Integration test<br/>@pytest.mark.integration<br/>PR-time only]
    Start -->|NO| Unit[Unit test<br/>fast, pre-commit eligible if smoke]

Which approach?¶

flowchart TD
    Start{Testing error handling<br/>or crash resilience?}
    Start -->|YES| Fuzz[Fuzz-based<br/>Hypothesis st.text/st.binary]
    Start -->|NO| Known{Have specific<br/>known scenarios?}
    Known -->|YES| Example[Example-based<br/>parametrize for multiple cases]
    Known -->|NO| Invariant{Can you state<br/>an invariant?}
    Invariant -->|YES| Property[Property-based<br/>@pytest.mark.property<br/>Hypothesis]
    Invariant -->|NO| ExampleAlt[Example-based<br/>test specific examples]

Which execution tier?¶

flowchart TD
    Start{Is test fast?<br/>under 1 second}
    Start -->|NO| Tier2Slow[Tier 2 only<br/>@pytest.mark.slow]
    Start -->|YES| Critical{Import/sanity/schema check<br/>OR critical regression?}
    Critical -->|YES| Tier1[Tier 1 eligible<br/>@pytest.mark.smoke]
    Critical -->|NO| Tier2Fast[Tier 2 only<br/>save pre-commit budget]

Test Markers Reference¶

Marker	Dimension	Command
`@pytest.mark.smoke`	Speed	`pytest -m smoke`
`@pytest.mark.slow`	Speed	`pytest -m "not slow"`
`@pytest.mark.unit`	Scope	`pytest -m unit`
`@pytest.mark.integration`	Scope	`pytest -m integration`
`@pytest.mark.property`	Approach	`pytest -m property`
`@pytest.mark.performance`	Purpose	`pytest -m performance`
`@pytest.mark.regression`	Purpose	`pytest -m regression`
`@pytest.mark.no_mutation`	Quality	Tests incompatible with mutmut runner

Combine markers across dimensions:

@pytest.mark.integration
@pytest.mark.property
@pytest.mark.regression

Test Commands Reference¶

Context	Command	What Runs
Tier 1 (Pre-commit)	`pytest -m smoke`	Smoke tests only (< 5s; Tier 1 overall < 10s)
Tier 2 (PR/CI - full)	`pytest`	All tests
Tier 2 (PR/CI - coverage)	`pytest --cov=src/ncaa_eval --cov-report=term-missing`	All + coverage report
Tier 2 (exclude slow)	`pytest -m "not slow"`	All except slow tests
Filter by dimension	`pytest -m integration`	Filter by marker
Combined filters	`pytest -m "integration and regression"`	Intersection

Test Organization¶

tests/
├── __init__.py
├── conftest.py                          # Shared fixtures
├── fixtures/
│   ├── .gitkeep
│   └── kaggle/
│       ├── MNCAATourneyCompactResults.csv
│       ├── MRegularSeasonCompactResults.csv
│       ├── MSeasons.csv
│       └── MTeams.csv
├── integration/
│   ├── __init__.py
│   ├── test_documented_commands.py
│   ├── test_elo_integration.py
│   ├── test_feature_serving_integration.py
│   └── test_sync.py
└── unit/
    ├── __init__.py
    ├── test_bracket_page.py
    ├── test_bracket_renderer.py
    ├── test_calibration.py
    ├── test_chronological_serving.py
    ├── test_cli_train.py
    ├── test_connector_base.py
    ├── test_dashboard_app.py
    ├── test_dashboard_filters.py
    ├── test_deep_dive_page.py
    ├── test_elo.py
    ├── test_espn_connector.py
    ├── test_evaluation_backtest.py
    ├── test_evaluation_metrics.py
    ├── test_evaluation_plotting.py
    ├── test_evaluation_simulation.py
    ├── test_evaluation_splitter.py
    ├── test_feature_serving.py
    ├── test_framework_validation.py
    ├── test_fuzzy.py
    ├── test_graph.py
    ├── test_home_page.py
    ├── test_imports.py
    ├── test_kaggle_connector.py
    ├── test_leaderboard_page.py
    ├── test_logger.py
    ├── test_model_base.py
    ├── test_model_elo.py
    ├── test_model_logistic_regression.py
    ├── test_model_registry.py
    ├── test_model_tracking.py
    ├── test_model_xgboost.py
    ├── test_normalization.py
    ├── test_opponent.py
    ├── test_package_structure.py
    ├── test_pool_scorer_page.py
    ├── test_repository.py
    ├── test_run_store_metrics.py
    ├── test_schema.py
    └── test_sequential.py

Naming conventions:

Test files: test_<module_name>.py
Test functions: test_<function>_<scenario>()
Fixtures: Descriptive names (e.g., sample_teams, elo_config, temp_data_dir)

See Conventions Guide for details.

Coverage Targets¶

Module	Line	Branch	Rationale
`evaluation/metrics.py`	95%	90%	Critical - errors invalidate all evaluations
`evaluation/simulation.py`	90%	85%	Monte Carlo simulator
`model/`	90%	85%	Core abstraction
`transform/`	85%	80%	Feature correctness, leakage prevention
`ingest/`	80%	75%	Data quality
`utils/`	75%	70%	Lower priority
Overall	80%	75%	Balanced

Coverage is a signal, not a gate. Use to identify gaps, not block PRs.

See Conventions Guide for details.

Testing Tools¶

Tool	Purpose	Configuration
Pytest	Testing framework	`pyproject.toml` `[tool.pytest.ini_options]`
Hypothesis	Property-based + Fuzz testing	Dev dependency
Mutmut	Mutation testing (quality)	Dev dependency
pytest-cov	Coverage reporting	`[tool.coverage.report]`
Nox	Session orchestration	`noxfile.py`

Domain-Specific Testing¶

Performance Testing (NFR1: Vectorization)¶

Smoke: Assertion-based vectorization checks (< 1s)
PR-time: Performance benchmarks, 60-second backtest target

@pytest.mark.smoke
@pytest.mark.performance
def test_metrics_are_vectorized():
    """Quick check: no .iterrows() in metrics."""
    # See domain-testing.md for example

Data Leakage Prevention (NFR4: Temporal Boundaries)¶

Smoke: API contract unit tests (fast)
PR-time: End-to-end workflow tests, property-based invariants

@pytest.mark.smoke
def test_api_enforces_cutoff():
    """Quick check: API rejects future data."""
    # See domain-testing.md for example

See Domain Testing Guide for comprehensive examples.

References¶

STYLE_GUIDE.md - Coding standards, vectorization rule
specs/05-architecture-fullstack.md - Architecture, nox workflow
specs/03-prd.md - Non-functional requirements (NFR1-NFR5)
pyproject.toml - Pytest configuration
.github/pull_request_template.md - PR checklist