# Test Suite Quality Assurance This guide explains **how to ensure test quality** through mutation testing and coverage analysis. --- ## Overview Tools and techniques for **evaluating the quality and effectiveness of your test suite**. These are NOT test types - they are meta-testing practices that evaluate your existing tests. --- ## 1. Mutation Testing (Mutmut) ### What it is A technique that evaluates test quality by mutating production code and verifying that tests fail. ### How it works 1. Mutmut introduces small changes to your code (e.g., `+` → `-`, `>` → `>=`, `True` → `False`) 2. Runs your existing test suite against each mutation 3. Verifies that at least one test fails for each mutation 4. **Surviving mutants** = mutations that didn't cause test failures = potential gaps in test coverage ### Purpose Verify that your tests are effective at catching bugs, not just achieving high coverage. ### When to use - **After initial test suite is written** - Mutation testing measures test quality, not code quality - **Validating test quality for high-risk modules** - Ensure critical code has strong test coverage - **Periodic quality audits** - Not every PR (too slow), but periodically to validate test effectiveness - **Before major refactoring** - Ensure tests will catch regressions ### NOT a test type Mutation testing doesn't write tests - it evaluates your existing tests. ### Characteristics - **Extremely slow:** Mutates code and reruns entire test suite for each mutation - **Quality signal:** Measures test effectiveness (do tests catch bugs?), not just coverage (are lines executed?) - **Selective application:** Only run on designated high-priority modules - **Orthogonal to test dimensions:** Works with any test scope, approach, or purpose ### Priority Tiers | Tier | Modules | Frequency | Rationale | |------|---------|-----------|-----------| | **Tier 1 (Always)** | `evaluation/metrics.py`, `evaluation/simulation.py` | Every PR | Critical for correctness - errors invalidate all evaluations | | **Tier 2 (Periodic)** | `model/`, `transform/` | Weekly or before release | Important but less critical than metrics | | **Tier 3 (Rare)** | `ingest/`, `utils/` | Monthly or as needed | Lower priority, test coverage may suffice | ### Usage ```bash # Run mutation testing on high-priority module mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py # Review results mutmut results # Show details of surviving mutants (gaps in test coverage) mutmut show # Run only against fast tests (for quicker feedback) mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py --runner="pytest -m smoke" ``` ### Interpreting results - **Mutation Score = (Killed Mutants / Total Mutants) × 100%** - **Target:** 80%+ for Tier 1 modules, 70%+ for Tier 2 - **Surviving mutants** indicate: - Missing test cases - Weak assertions (e.g., `assert result is not None` instead of `assert result == expected`) - Dead code (code that can be changed without affecting behavior) ### Example surviving mutant (test gap) ```python # Original code def calculate_margin(home_score, away_score): return home_score - away_score # Mutant (changed - to +) def calculate_margin(home_score, away_score): return home_score + away_score # Mutmut changed this # If this mutant survives, it means no test verifies the margin calculation is correct! # Solution: Add a test with known inputs/outputs def test_calculate_margin_correctness(): assert calculate_margin(100, 80) == 20 # Would catch the + mutation ``` ### Marker No dedicated marker — mutation testing is run via `mutmut` as a quality tool, not as a pytest marker. ### Execution Tier - **Tier 1 (Pre-commit):** ❌ NO (extremely slow) - **Tier 2 (PR/CI):** ✅ YES, but only for designated Tier 1 modules See [Execution Guide](execution.md) for details on execution tiers. --- ## 2. Coverage Analysis ### What it is Measurement of what percentage of your code is executed during test runs. ### Types of coverage - **Line coverage:** Percentage of code lines executed - **Branch coverage:** Percentage of conditional branches (if/else) taken - **Function coverage:** Percentage of functions called ### Purpose Identify untested code paths, NOT a measure of test quality. ### Important distinction - **High coverage ≠ Good tests:** You can have 100% coverage with weak assertions - **Mutation testing** verifies test quality; **coverage** only verifies code execution ### Tool `pytest-cov` ### Usage ```bash # Generate coverage report pytest --cov=src/ncaa_eval --cov-report=term-missing # HTML report for detailed analysis pytest --cov=src/ncaa_eval --cov-report=html open htmlcov/index.html # Branch coverage (more thorough than line coverage) pytest --cov=src/ncaa_eval --cov-branch --cov-report=term-missing ``` ### Coverage is a signal, not a gate Use coverage to identify gaps, but don't block PRs solely for coverage numbers. A few well-written tests with 70% coverage are better than many weak tests with 100% coverage. --- ## 3. Combining Quality Assurance Tools Use both mutation testing and coverage analysis together for comprehensive quality assessment: ### Workflow 1. **Write tests** for your feature 2. **Check coverage** - Identify untested code paths 3. **Add tests** to cover gaps 4. **Run mutation testing** - Verify tests actually catch bugs 5. **Fix surviving mutants** - Add stronger assertions or test cases ### Example - Comprehensive quality check ```bash # Step 1: Run tests with coverage pytest --cov=src/ncaa_eval/evaluation/metrics.py --cov-report=term-missing # Coverage: 95% ✓ (high coverage) # Step 2: Run mutation testing mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py # Mutation Score: 60% ✗ (many surviving mutants = weak tests!) # Step 3: Review surviving mutants mutmut results mutmut show 42 # Mutation: changed `>` to `>=` and tests still passed # Step 4: Add test to catch this mutation def test_threshold_boundary(): """Regression test: Ensure > is used, not >= (catches mutant #42).""" result = is_above_threshold(value=50, threshold=50) assert result is False # Should be False because 50 is NOT > 50 ``` --- ## Key Principles 1. ✅ **Coverage shows what code runs** - It identifies untested code paths 2. ✅ **Mutation testing shows test effectiveness** - It verifies tests catch bugs 3. ✅ **Use both together** - Coverage finds gaps, mutation testing validates quality 4. ✅ **Selective mutation testing** - Only run on high-priority modules (too slow for everything) 5. ✅ **Coverage is a signal, not a gate** - Don't block PRs solely on coverage numbers 6. ✅ **Target 80%+ mutation score** - For critical modules (Tier 1) 7. ✅ **Weak assertions fail mutation tests** - Use specific assertions, not just `is not None` --- ## See Also - [Test Scope Guide](test-scope-guide.md) - Unit vs Integration tests - [Test Approach Guide](test-approach-guide.md) - Example-based vs Property-based - [Test Purpose Guide](test-purpose-guide.md) - Functional, Performance, Regression - [Execution Guide](execution.md) - When tests and checks run (4-tier model) - [Conventions Guide](conventions.md) - Fixtures, markers, organization, coverage targets - [Domain Testing Guide](domain-testing.md) - Performance and data leakage testing