Change failure rate is lying to you if your tests are flaky

If flaky tests trigger rollbacks or reruns, your change failure rate is counting noise as incidents. Here's how to fix the signal before it reaches leadership.

BuildPulse Team

June 22, 2026

Change Failure Rate Is Lying to You | BuildPulse Blog

The number your VP is staring at might be fiction

Your change failure rate is on the Q3 engineering health slide. It ticked up two points last month. Someone in the room says "we need to slow down and stabilize." Someone else says deployments have been fine — nothing actually broke in production. Both people are looking at the same number and reaching opposite conclusions.

That's not a communication problem. That's a data quality problem. And in most engineering orgs running CI pipelines with at least a handful of flaky tests, change failure rate (CFR) — the DORA metric that measures what fraction of deployments cause a production incident or require a rollback — is quietly absorbing noise it was never designed to handle.

What CFR is actually measuring (and what it thinks it's measuring)

DORA defines a "failure" in CFR as a deployment that results in a degraded-service event or requires remediation — a hotfix, a rollback, a patch deployment. The intent is to capture real production failures: a deploy that took down checkout, a regression that corrupted user data, a config change that pegged CPU.

The problem is how most teams operationalize the measurement. In practice, CFR gets computed from whatever signals are easiest to collect:

  • Pipeline failures that gate the deployment
  • Deployment automation that marks a release as "failed" when health checks don't pass
  • Manual rollback events triggered by on-call engineers
  • Re-deployment events within N hours of an initial deploy

Every one of those signals is corruptible by flaky tests. A test that fails randomly in the post-deploy smoke suite will mark an otherwise-healthy deployment as failed. A flaky integration test in your rollout health check will trigger an automated rollback — not because anything broke, but because the test had a bad day. The deployment automation doesn't know the difference between "this broke production" and "this test is untrustworthy."

Your CFR calculation doesn't know either.

The three ways flakiness inflates CFR

1. Automated rollbacks from flaky post-deploy checks

This is the most direct path. You run a smoke suite or canary validation after each deployment. A flaky test fails. Your rollout tooling — Argo Rollouts, Spinnaker, a custom shell script — calls the deployment failed and rolls back. The event gets logged. CFR goes up. Nothing was actually wrong.

I've watched teams spend a quarter chasing a "CFR spike" that turned out to be a single Selenium test that failed whenever the canary environment had more than 80% CPU load — which is exactly when canary environments run post-deploy checks.

2. Manual reruns that shadow-increment failure counts

Your pipeline fails in staging or pre-prod. An engineer clicks "Retry." It passes. They deploy. But if your CFR tooling is watching pipeline outcomes rather than deployment outcomes, that initial failure may already be counted. Worse: if the rerun-to-pass pattern is common, you have a quiet convention that "one retry is normal" — which means real failures get one free rerun before anyone pays attention. The signal is eroded in both directions.

3. Revert commits that look like rollbacks

A flaky test in a branch causes a merge to be reverted because someone assumed the test failure was real. The revert shows up as a rollback event. CFR picks it up. Again: nothing failed in production. The production system was never touched. But the metric moved.

Why this is especially painful in compliance-conscious environments

If you're operating under SOC 2 Type II, ISO 27001, or a regulated environment with change-management controls, CFR isn't just a dashboard number — it's potentially part of your audit trail. Change records with "rollback" in the resolution field carry weight in evidence reviews. Auditors reading a change log don't parse the difference between "rolled back due to flaky test" and "rolled back due to production incident." Both look like failures.

Getting that wrong in your controls documentation is a different kind of problem than just having a noisy engineering metric. It's worth being deliberate about what your change-management tooling records as a failure versus what your engineering metrics tooling counts as one.

Cleaning the signal: a practical approach

The fix isn't to stop measuring CFR. It's to stop feeding it garbage. Here's the sequence that actually works:

Step 1: Tag flaky test failures explicitly at the pipeline level

Before you can exclude flaky noise from CFR, you need the pipeline to know what a flaky failure looks like. This requires a feedback loop from your test analytics back into CI.

In GitHub Actions, you can do this with job outputs and step conditions:

jobs:
  smoke-tests:
    runs-on: ubuntu-latest
    outputs:
      failure_type: ${{ steps.classify.outputs.failure_type }}
    steps:
      - name: Run smoke suite
        id: smoke
        run: ./scripts/run-smoke-tests.sh
        continue-on-error: true

      - name: Classify failure
        id: classify
        if: steps.smoke.outcome == 'failure'
        run: |
          # Query your test analytics API for known-flaky test IDs
          # that appeared in this run's JUnit XML output
          python scripts/classify_failure.py --junit-xml test-results.xml \
            --output-env GITHUB_OUTPUT

  deploy-gate:
    needs: smoke-tests
    runs-on: ubuntu-latest
    steps:
      - name: Block on real failures only
        if: needs.smoke-tests.outputs.failure_type == 'real'
        run: exit 1

The classify_failure.py script compares failed test IDs against a list of known-flaky tests from your test analytics platform. If every failing test in the run is a known flake, the failure is tagged flaky. If any failing test is not known-flaky, it's tagged real.

This is exactly the kind of quarantine signal BuildPulse exposes via API — a machine-readable list of tests that have exhibited non-deterministic behavior across recent runs, which you can query at deploy time.

Step 2: Separate "pipeline outcome" from "deployment health" in your change records

Your change-management tooling (Jira, ServiceNow, a homegrown CMDB) should distinguish between:

  • Pipeline failure: the CI/CD pipeline exited non-zero before or after a deployment
  • Deployment failure: the deployed artifact caused a degraded production state

These are not the same thing. A deployment that passed CI and caused a production outage is a CFR event. A deployment that was blocked by a flaky test and never reached production is not. A deployment that reached production and triggered an automated rollback from a flaky health check is not — assuming you've tagged the health check failure correctly.

If your change records can't carry a failure_reason field that distinguishes these, add one. It's a one-time schema change that will save you from misread quarterly reviews for years.

Step 3: Recompute CFR with a flake-exclusion filter

Once you have tagged failure data, recompute CFR over your last 90 days with and without flaky-attributed events. The gap between those two numbers is your flake tax — the amount by which your CFR has been overstated.

For most teams that haven't actively managed test flakiness, this gap is 15–40%. Which means if your CFR is sitting at 12%, the real number might be closer to 7–8%. That's the difference between "we have a stability problem" and "our shipping process is actually pretty healthy."

Publishing both numbers on your engineering health dashboard — raw CFR and flake-adjusted CFR — is more honest than picking one. It also creates a visible incentive to drive the flaky test count down: as you fix flakes, the two numbers converge.

The deeper issue: test reliability is a prerequisite for trustworthy metrics

CFR is just the most visible victim. Mean time to recovery gets distorted when engineers can't tell whether a failing health check is real. Deployment frequency gets distorted when teams slow their release cadence to avoid triggering flaky post-deploy checks. The DORA metrics framework assumes your automation is giving you honest signals. The moment your CI infrastructure is unreliable, every metric downstream of it becomes suspect.

This is the argument I'd make to any engineering leader who's treating their DORA numbers as ground truth: your metrics are only as good as the infrastructure that generates them. If you haven't actively measured and managed flaky tests, you don't know how distorted those numbers are. You're flying with a miscalibrated altimeter.

Fixing flaky tests isn't just about developer experience — though the developer experience cost is real. It's about the integrity of the signals you use to make engineering leadership decisions. Slowing down your release cadence because CFR is high is a strategic call. Making that call based on noise is expensive.

What to do this week

Three concrete things:

  1. Audit your last 30 rollback events. For each one, trace the trigger. How many originated from a test failure rather than a production health signal? That's your starting data point.

  2. Pull your known-flaky test list and cross-reference it against your post-deploy check suite. If any of the same tests are in both places, you have a direct path from flaky test to inflated CFR — and a quick win by either fixing or quarantining those tests.

  3. Add a failure_reason field to your change records so future rollbacks carry enough metadata to distinguish flake-driven from incident-driven. You want this data before you need it, not after.

Your CFR can be a genuinely useful signal. It's not there yet if you haven't cleaned the test reliability layer underneath it.