Flaky Tests
7 min read

The real cost of a 5% test failure rate

A 5% flaky test rate sounds manageable. Run the numbers on rerun compute, engineer interruptions, and delayed releases, and it stops sounding that way.

BuildPulse Team

June 19, 2026

The Real Cost of a 5% Test Failure Rate | BuildPulse Blog

The number that sounds fine until you do the math

Five percent. If someone told you 5% of your pull requests had a bug, you'd stop the line. But when 5% of your CI test runs fail due to flakiness, most teams treat it as background noise — a rounding error, not a budget line.

I've watched engineering orgs absorb that noise for months, sometimes years, before someone finally asked: what is this actually costing us? The answer is never comfortable. And if you're at a company where CI gates are part of your change-management controls — SOC2, ISO 27001, FedRAMP — the cost isn't just money. It's audit exposure and a documented process that's quietly broken.

This post builds a model you can populate with your own numbers. None of this is hypothetical. Every bucket here is a cost I've seen show up in real post-mortems.

What "5% failure rate" actually means in practice

Before the model, let's define the input. A 5% test failure rate means: across all test suite runs on all branches and PRs in a given month, one in twenty exits non-zero due to a test that isn't deterministically failing — it just failed this time.

For a team shipping 200 PRs a month with a suite that takes 15 minutes end-to-end, that's 10 failed runs per day that need a human decision: retry, investigate, or override.

Now let's put prices on each of those decisions.

Cost bucket 1: Rerun compute

This one shows up directly on your cloud or CI bill, which makes it the easiest to defend in a budget conversation.

Assume:

  • 200 PRs/month, each triggering one full suite run
  • 15-minute suite, running on 4 parallel workers
  • $0.008 per CPU-minute (roughly GitHub Actions Linux large runner pricing)
  • 5% of runs trigger at least one full retry
Monthly runs:         200
Failed runs (5%):      10/day × ~22 working days = 220/month
Compute per run:       15 min × 4 workers = 60 CPU-min
Cost per rerun:        60 × $0.008 = $0.48
Monthly rerun cost:    220 × $0.48 = ~$106

$106/month sounds trivial. Scale to a 50-engineer org with five teams each running their own suites, and you're at $530/month — over $6,000/year — just for the retries that don't require human review. When you add matrix builds, preview deployments, and integration suites, that multiplier gets uncomfortable fast.

And this assumes one retry per failure. Teams with deeply entrenched flakiness often configure two or three automatic retries. Triple the number above.

Cost bucket 2: Engineer time — the dominant cost

Compute is cheap. Engineers are not.

Every flaky failure that isn't auto-resolved creates an interruption. Someone has to look at it. Even a "probably flaky, re-running" decision costs time — context switch included.

Research on developer interruptions consistently puts the cost of a single context switch (leaving flow, triaging the failure, deciding it's safe to retry, returning to the original task) at 10–20 minutes. I'll use 12 minutes as a conservative middle.

Failed runs requiring human triage: 220/month
Time per triage:                    12 minutes
Total triage time:                  220 × 12 = 2,640 min/month = 44 hours
Fully-loaded engineer cost:         $150/hour (mid-senior IC, US)
Monthly engineer cost:              44 × $150 = $6,600

For a single team. $6,600/month is $79,200/year — for one team doing nothing but staring at red CI they already don't trust.

Scale that across five teams: $396,000 annually. That's a headcount. That is a person you could hire.

And that's just the triage cost. It doesn't count the cases where someone investigates for 30 minutes before concluding it's flaky, or where a flaky failure blocks a junior engineer for half a day because they assume they broke something.

Cost bucket 3: Delayed releases

This one is harder to quantify but often the largest in practice.

If a flaky test blocks a deployment pipeline and the on-call engineer isn't available to override it, the release waits. In a trunk-based team shipping multiple times a day, a two-hour block is a significant incident. In a team on a weekly release cadence, a flaky failure the morning of release day creates pressure to either delay or override controls — neither of which is free.

For compliance-conscious teams, "override the gate" isn't a casual decision. It requires documentation, approval, and sometimes a change-management ticket. The flaky test just generated 45 minutes of process overhead per occurrence in addition to the engineering interruption.

I won't put a formula on release delay because it depends too much on your deployment model. But ask yourself: how many times in the last quarter did a release slip by one day due to a CI issue that turned out not to be a real bug? Price that at the opportunity cost that matters to your business.

Cost bucket 4: Eroded trust — the compounding cost

This is the cost that doesn't show up anywhere until it's already done serious damage.

When engineers learn to ignore red CI, they stop using CI as a signal. That's the entire point of having CI. A team that has mentally written off their test suite is a team that will eventually ship a real bug under cover of flakiness — "oh, CI was probably just being flaky again."

I've seen this happen. A production incident traced back to a real regression that went unnoticed for two weeks because the suite had cried wolf enough times that nobody looked carefully anymore. The post-mortem finding: "engineers had low confidence in test results and were applying judgment about which failures to investigate." That judgment, however reasonable it seemed individually, was systematically wrong.

The remediation cost for a trust deficit isn't a line item — it's an engineering culture project that takes quarters, not sprints. Rebuilding confidence in CI after it's been eroded requires not just fixing the flaky tests but demonstrating a track record of reliability long enough that the learned behavior reverses.

Detecting and quarantining flaky tests before that trust erodes is orders of magnitude cheaper than fixing it after. This is the argument for treating flakiness as a first-class defect, not a rerun problem.

Putting it together: the model

Here's the full cost summary for a single 10-engineer team with a 5% flaky failure rate:

Cost bucket                  Monthly        Annual
─────────────────────────────────────────────────
Rerun compute                   $106          $1,272
Engineer triage time          $6,600         $79,200
Release delays                   ???              ???
CI trust erosion (lagging)       ???              ???
─────────────────────────────────────────────────
Conservative total            $6,706         $80,472

The two question marks are not zero. They're just harder to defend in a spreadsheet. In my experience they're often larger than the engineer triage cost when you account for a single production incident attributable to the trust deficit.

To adapt this to your org:

  1. Pull your actual CI run count for the last 30 days
  2. Multiply by your observed flaky failure rate (most CI platforms surface this; BuildPulse can give you this by test, not just by run)
  3. Apply your actual compute cost per run-minute and your fully-loaded engineering cost
  4. Count release pipeline blocks in the last quarter and price them honestly

Why "just rerun it" is a debt strategy, not a solution

Automatic retries are seductive. They make the red go green, the PR unblocks, and nobody has to file a ticket. But every automatic retry is a vote to carry the debt forward — and like financial debt, flakiness compounds. A test that passes on retry 80% of the time today will, as the underlying non-determinism is never addressed, drift toward needing two retries, then three, then failing even on retry.

The retry also masks the signal you need to fix the problem. If a test only fails once in five runs and you always retry, you may never accumulate enough failure data to diagnose the root cause. You need to know which tests are flaky, how often, and under what conditions — and retrying silently destroys that data.

Tracking flaky tests systematically — by test name, failure rate, and failure pattern — is what turns a vague cultural problem ("CI is unreliable") into an engineering task with a ticket and an owner. That's the difference between a flaky test and a fixed one.

The compliance angle

For teams operating under SOC2 or ISO 27001, there's a dimension to this that goes beyond cost: your CI gate is a documented control. If that control is routinely bypassed — through manual overrides, suppressed failures, or retries that pass non-deterministically — you have a control that looks effective on paper and isn't in practice.

Auditors don't look at your flaky test rate. But they do look at override logs, change-management exceptions, and deployment approvals that bypass normal gates. A pattern of "CI was red, we overrode because it was probably flaky" is exactly the kind of finding that generates a management letter comment. Fixing your flaky tests is, in this light, a compliance hygiene item — not a nice-to-have.

What an acceptable failure rate actually looks like

The uncomfortable answer: for most teams, an acceptable flaky test rate is closer to 0.1% than 5%. That's not perfectionism — it's the threshold below which the noise doesn't meaningfully degrade CI signal. Google's internal research on test flakiness, which has been discussed publicly by their engineers, suggests that even a 1% flaky rate is enough to materially erode developer trust over time.

Getting from 5% to under 0.5% is a project, not an afternoon. It requires identifying which tests are flaky (harder than it sounds without systematic tracking), triaging them into quarantine so they don't block CI while being fixed, and then actually fixing or deleting them. Quarantine is the step most teams skip — and it's what prevents the "we're fixing flaky tests" project from itself becoming a source of CI disruption.

The math isn't subtle. A 5% flaky test rate costs a 10-engineer team the equivalent of a full engineer's annual salary in triage time alone — before you count compute, releases, or the slow-motion trust collapse that makes the whole thing worse. If that number doesn't show up anywhere in your engineering budget conversation, it's because nobody has run the model yet.

Now you have one.