Test Suite Quality Assurance

This guide explains how to ensure test quality through mutation testing and coverage analysis.


Overview

Tools and techniques for evaluating the quality and effectiveness of your test suite. These are NOT test types - they are meta-testing practices that evaluate your existing tests.


1. Mutation Testing (Mutmut)

What it is

A technique that evaluates test quality by mutating production code and verifying that tests fail.

How it works

  1. Mutmut introduces small changes to your code (e.g., +-, >>=, TrueFalse)

  2. Runs your existing test suite against each mutation

  3. Verifies that at least one test fails for each mutation

  4. Surviving mutants = mutations that didn’t cause test failures = potential gaps in test coverage

Purpose

Verify that your tests are effective at catching bugs, not just achieving high coverage.

When to use

  • After initial test suite is written - Mutation testing measures test quality, not code quality

  • Validating test quality for high-risk modules - Ensure critical code has strong test coverage

  • Periodic quality audits - Not every PR (too slow), but periodically to validate test effectiveness

  • Before major refactoring - Ensure tests will catch regressions

NOT a test type

Mutation testing doesn’t write tests - it evaluates your existing tests.

Characteristics

  • Extremely slow: Mutates code and reruns entire test suite for each mutation

  • Quality signal: Measures test effectiveness (do tests catch bugs?), not just coverage (are lines executed?)

  • Selective application: Only run on designated high-priority modules

  • Orthogonal to test dimensions: Works with any test scope, approach, or purpose

Priority Tiers

Tier

Modules

Frequency

Rationale

Tier 1 (Always)

evaluation/metrics.py, evaluation/simulation.py

Every PR

Critical for correctness - errors invalidate all evaluations

Tier 2 (Periodic)

model/, transform/

Weekly or before release

Important but less critical than metrics

Tier 3 (Rare)

ingest/, utils/

Monthly or as needed

Lower priority, test coverage may suffice

Usage

# Run mutation testing on high-priority module
mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py

# Review results
mutmut results

# Show details of surviving mutants (gaps in test coverage)
mutmut show <mutant-id>

# Run only against fast tests (for quicker feedback)
mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py --runner="pytest -m smoke"

Interpreting results

  • Mutation Score = (Killed Mutants / Total Mutants) × 100%

  • Target: 80%+ for Tier 1 modules, 70%+ for Tier 2

  • Surviving mutants indicate:

    • Missing test cases

    • Weak assertions (e.g., assert result is not None instead of assert result == expected)

    • Dead code (code that can be changed without affecting behavior)

Example surviving mutant (test gap)

# Original code
def calculate_margin(home_score, away_score):
    return home_score - away_score

# Mutant (changed - to +)
def calculate_margin(home_score, away_score):
    return home_score + away_score  # Mutmut changed this

# If this mutant survives, it means no test verifies the margin calculation is correct!
# Solution: Add a test with known inputs/outputs
def test_calculate_margin_correctness():
    assert calculate_margin(100, 80) == 20  # Would catch the + mutation

Marker

No dedicated marker — mutation testing is run via mutmut as a quality tool, not as a pytest marker.

Execution Tier

  • Tier 1 (Pre-commit): ❌ NO (extremely slow)

  • Tier 2 (PR/CI): ✅ YES, but only for designated Tier 1 modules

See Execution Guide for details on execution tiers.


2. Coverage Analysis

What it is

Measurement of what percentage of your code is executed during test runs.

Types of coverage

  • Line coverage: Percentage of code lines executed

  • Branch coverage: Percentage of conditional branches (if/else) taken

  • Function coverage: Percentage of functions called

Purpose

Identify untested code paths, NOT a measure of test quality.

Important distinction

  • High coverage ≠ Good tests: You can have 100% coverage with weak assertions

  • Mutation testing verifies test quality; coverage only verifies code execution

Tool

pytest-cov

Usage

# Generate coverage report
pytest --cov=src/ncaa_eval --cov-report=term-missing

# HTML report for detailed analysis
pytest --cov=src/ncaa_eval --cov-report=html
open htmlcov/index.html

# Branch coverage (more thorough than line coverage)
pytest --cov=src/ncaa_eval --cov-branch --cov-report=term-missing

Coverage is a signal, not a gate

Use coverage to identify gaps, but don’t block PRs solely for coverage numbers. A few well-written tests with 70% coverage are better than many weak tests with 100% coverage.


3. Combining Quality Assurance Tools

Use both mutation testing and coverage analysis together for comprehensive quality assessment:

Workflow

  1. Write tests for your feature

  2. Check coverage - Identify untested code paths

  3. Add tests to cover gaps

  4. Run mutation testing - Verify tests actually catch bugs

  5. Fix surviving mutants - Add stronger assertions or test cases

Example - Comprehensive quality check

# Step 1: Run tests with coverage
pytest --cov=src/ncaa_eval/evaluation/metrics.py --cov-report=term-missing

# Coverage: 95% ✓ (high coverage)

# Step 2: Run mutation testing
mutmut run --paths-to-mutate=src/ncaa_eval/evaluation/metrics.py

# Mutation Score: 60% ✗ (many surviving mutants = weak tests!)

# Step 3: Review surviving mutants
mutmut results
mutmut show 42  # Mutation: changed `>` to `>=` and tests still passed

# Step 4: Add test to catch this mutation
def test_threshold_boundary():
    """Regression test: Ensure > is used, not >= (catches mutant #42)."""
    result = is_above_threshold(value=50, threshold=50)
    assert result is False  # Should be False because 50 is NOT > 50

Key Principles

  1. Coverage shows what code runs - It identifies untested code paths

  2. Mutation testing shows test effectiveness - It verifies tests catch bugs

  3. Use both together - Coverage finds gaps, mutation testing validates quality

  4. Selective mutation testing - Only run on high-priority modules (too slow for everything)

  5. Coverage is a signal, not a gate - Don’t block PRs solely on coverage numbers

  6. Target 80%+ mutation score - For critical modules (Tier 1)

  7. Weak assertions fail mutation tests - Use specific assertions, not just is not None


See Also