6. Apr 2026 by Lasse Foo-Rafn

You're measuring code coverage wrong

How most teams get code coverage wrong and why it matters in 2026

You are measuring code coverage wrong. Not because code coverage is a bad metric, but because the way people are using it makes it one.

I agree with most of the criticism. And that is exactly why I know how to fix it.

The problem is part tooling, part education, and part teams cargo-culting a number without understanding what it actually tells them. Let's break it down.

The case against coverage (and why it's half right)

Teams often end up in the same evil cycle: "code coverage must be X% or your code is bad." The number becomes the goal instead of a tool. Developers get frustrated. Criticism follows.

The most common argument against coverage:

You can just write tests that call a bunch of functions without asserting anything, and get 100% coverage!

This is true. And it is also not a coverage problem — it is a testing culture problem. We will get to that.

The drama

Theo did a video in which he talks negatively about code coverage, amongst other topics. I actually agree with his point. He is objectively right; 100% code coverage does not mean your code is well tested.

An example he makes:

While working at Twitch [...] we had a hard line set at minimum 85% code coverage.

I was rewriting this function from scratch, I was swapping 100k lines of code (LoC) with 90% code coverage, with 2K loc with 100% coverage.

Obvious win, ... right?

But he was not able to merge this PR, despite removing 98% of the code and having 100% coverage. Why?

Because that would mean project code coverage would drop; the massive function had great coverage and weighed a lot in the metric.

That is a REAL and surprisingly common situation. It makes developers loathe code coverage. It turns a quality metric into unnecessary bureaucracy that blocks good code from shipping.

Especially when non-technical management sets coverage goals, it becomes a checkbox exercise: "hit the number so corporate's reports look good." That is not quality engineering. That is theater.

Where you went wrong

Full disclosure: I build OtterWise, a code quality platform that tracks coverage metrics. I have a bias, and I also have years of watching how teams actually use — and misuse — these metrics. Where I mention OtterWise features below, it is because we built them specifically to solve these exact problems. Not because I am sneaking in a sales pitch; because we saw these failures and engineered around them.

Code coverage as a metric by itself is not incredibly useful. I agree.

I like to compare it to agile. Agile gets a bad rep, and almost always because teams shoehorn it in and use it wrong. Same with coverage.

The wrong metric for the wrong goal

Twitch's problem was a strict project-level code coverage requirement of 85%. How do we fix this?

Lowering the threshold is a band-aid, not a fix. Being "more lax" is just admitting you do not know what to track.

The problem is not the 85%, or the strictness, or even that it is project-level tracking. The problem is the why. Why are they tracking code coverage? What is the actual goal? The goal is to improve code quality and write good tests.

Here is the fix:

Track patch coverage, not just project coverage

Project-level code coverage (sum of all covered lines divided by all lines) is decent for tracking progress over time, but it has the problem Theo hit: removing well-covered code tanks the number, making a great PR look bad.

Patch coverage measures how much of the modified code is covered. Theo had 100% patch coverage. His PR was perfect.

We have seen teams massively increase deployment frequency after switching to patch coverage tracking. It gave developers and reviewers greater confidence that new code was tested properly, without punishing refactors.

OtterWise tracks patch coverage as a first-class metric, alongside deployment frequency, time-to-merge, and more.

Combine coverage with complexity

On top of patch coverage, look at Cyclomatic Complexity and CRAP scores. Theo's refactor — removing 98k lines of code — would have massively improved complexity scores.

CRAP = Change Risk Anti-Pattern, a metric that combines code coverage and cyclomatic complexity. High complexity + low coverage = high change risk. Read more about CRAP.

Replace the old 85% project-level rule with:

85% minimum patch coverage = 85% of added/modified code must have test coverage
Average CRAP cannot increase by more than 5% = Complexity stays roughly in check, or improves

Now you are ensuring code changes are covered by tests and not making the codebase unnecessarily complex. That is a real quality signal.

But we are still not addressing the "I can just write bad tests and tick the coverage box" argument.

Code review as a quality gate

This is where most teams drop the ball. "Foster a good testing culture" is advice everyone gives and nobody acts on. Here are four things that actually work:

1. Review tests before code.

Open the test file first. Read the test names like a spec. Do the assertions match the described behavior? Only then look at the implementation. If the tests do not make sense on their own, the implementation review is premature.

Example PR comment:

"The tests cover the happy path, but what happens when the payment gateway returns a timeout? Can you add a test that asserts the retry behavior?"

2. Name tests as behavioral specs.

Good: it calculates shipping cost with multiple discount codes applied

Bad: test_calculateShipping_3

A test name should tell you exactly what broke when it fails. If your test names read like a feature spec, your test suite becomes documentation.

3. Automate the threshold, humanize the review.

Use automated status checks to enforce patch coverage and CRAP thresholds on every PR. This removes the human cost of "policing" coverage numbers. The reviewer can focus entirely on test quality — does this test catch real bugs? — because the tool already handles quantity.

4. Celebrate good tests.

Do not only flag missing coverage. When a teammate writes a great edge case test, say so:

Nice catch on the null input edge case — this is exactly the kind of test that prevents production bugs.

This shifts culture from punitive ("your coverage is too low") to constructive ("great test, more like this"). That shift is worth more than any metric.

Story points are a useful analogy here. A manager who says "are you sure this is an 8? Can it be a 5?" is using story points wrong. Same goes for "number high = code good." Coverage is a tool for quality conversations, not a leaderboard.

Monitor your test quality

With OtterWise, you can track Code Coverage, contributor stats, code quality, and much more.

Start improving your code quality

Free for open source

Beyond coverage: complementary metrics

Frankly, the tooling has been lackluster, and that is part of why code coverage is misused so often. Coverage alone was never supposed to carry the entire quality burden.

Mutation testing

Mutation testing is the nuclear option against the "empty test" argument. It works by making small changes (mutations) to your code — flipping a condition, changing a return value, removing a line — and then running your test suite. If a test fails, the mutant was "killed." If all tests pass, the mutant "survived," which means your tests missed a real code change.

A test with no assertions would let every mutant survive. Mutation testing exposes those tests immediately.

So why not use it on every PR? Because it is expensive. Mutation testing runs your test suite once per mutation, which can mean hundreds or thousands of test runs for a single change. That is too slow for CI on every pull request.

The practical approach: Run mutation testing quarterly or on critical code paths (payment processing, authentication, data integrity). Use patch coverage and CRAP for daily, per-PR quality checks. Mutation testing is an audit tool, not a gate.

Churn-weighted coverage risk

Tests exist to ensure functionality works as intended, and more importantly: that code changes do not break existing functionality.

Code that never changes does not need 100% coverage. A config file untouched for six months with 0% coverage? Low risk. A payment processing module changed 47 times this quarter with 22% coverage? That is a ticking time bomb.

Churn-weighted coverage risk combines how often a file changes with how well it is tested. High churn + low coverage = highest risk. This is where your testing effort should go first.

OtterWise's risk reports surface these files automatically, so you can prioritize testing where it matters most instead of chasing a global coverage number.

The full picture

Different metrics serve different purposes at different cadences:

Metric	Cadence	Scope	What it catches
Patch coverage	Every PR	Changed code	Untested new/modified code
CRAP score	Every PR	Changed code	Complex code with low coverage
Churn-weighted risk	Weekly/Monthly	Project-wide	High-risk files that need tests
Mutation testing	Quarterly	Critical paths	Tests that pass without catching bugs

No single metric tells the full story. Used together, they form a system that actually catches quality problems.

Why this matters more in the AI era

AI coding assistants — Copilot, Cursor, Claude — are changing how fast teams ship code. Developers are generating more code per day than ever before. That is mostly a good thing. But it has a side effect: the volume of untested code grows proportionally if coverage discipline does not keep up.

AI-generated code has a specific failure mode: it looks correct. It passes a glance review. It often handles the happy path well. But it frequently misses edge cases, error handling, and subtle integration issues that a human developer would have caught while writing the code by hand.

"Vibe coding" — shipping AI output with minimal review — makes automated quality gates more critical, not less. When you are moving faster, your guardrails need to be stronger.

And yes, AI can write tests too. But AI-written tests need validation just as much as AI-written code. Patch coverage confirms the tests run through the code. CRAP scores flag when AI generates something needlessly complex. Mutation testing (even periodically) reveals whether the AI-written tests actually assert meaningful behavior.

The teams that treat coverage as a legacy metric are the ones most likely to get burned by AI-generated code slipping through without proper tests. Automated coverage tracking is not a relic — it is more relevant than ever.