8 min read

Keeping AI-generated code from breaking your test suite

AI coding assistants raised your PR throughput — and quietly raised your flaky-test rate with it. Here's how to keep your CI signal trustworthy at the new volume.

BuildPulse Team

June 13, 2026

AI-Generated Code Testing: Protecting CI Quality | BuildPulse Blog

The volume problem nobody budgeted for

A director I talked to recently pulled up two charts side by side. The first: PRs merged per week, up about 35% since the team rolled out AI coding assistants. The second: CI failure rate on main, up almost exactly the same amount over the same period. Nobody had changed the test strategy. Nobody had changed the infrastructure. The team was just shipping more code — and more tests — than the existing review and CI practices were built to absorb.

Here's the uncomfortable part: the assistants were doing what everyone asked. They wrote tests. Lots of tests. The suite grew 30% in a quarter, which looked great in the coverage dashboard. But test count is not test quality, and AI assistants are exceptionally good at producing tests that pass on the machine where they were written, once.

If you're an engineering leader, this is the AI code-quality problem that actually lands on your desk. Not "will the model write a bug" — humans write bugs too, and your review process exists for that. The real problem is subtler: AI-generated tests fail differently than human-written ones, they arrive in volume, and they degrade the one signal your whole delivery process depends on — whether a red build means anything.

How AI-written tests go wrong

I've reviewed a lot of assistant-generated test code at this point. The failure modes are remarkably consistent, because they all come from the same root cause: the model optimizes for plausible test code, not correct test code. Four patterns show up over and over.

1. Timing and ordering assumptions

Ask an assistant to test async code and you'll frequently get something like this:

test('processes the upload queue', async () => {
  const queue = new UploadQueue();
  queue.enqueue(fakeFile);

  queue.start();
  await new Promise((r) => setTimeout(r, 500));

  expect(queue.completed).toHaveLength(1);
});

This passes on the developer's M3 laptop every time. It passes on a warm CI runner most of the time. It fails on a cold runner under load just often enough to make people start clicking the rerun button. The model has seen ten thousand examples of setTimeout-based test synchronization in its training data, so that's what it reaches for. A human senior engineer would await the actual completion promise or use fake timers. The model will too — but only if someone asks, and at 35% higher PR volume, fewer people are asking.

2. Asserting against the mock

This one is insidious because the test looks rigorous:

def test_calculate_invoice_total(mocker):
    mocker.patch(
        "billing.tax.get_rate",
        return_value=Decimal("0.08"),
    )
    mock_calc = mocker.patch(
        "billing.invoice.calculate_total",
        return_value=Decimal("108.00"),
    )

    result = calculate_total(line_items)

    assert result == Decimal("108.00")
    mock_calc.assert_called_once()

The test mocks the function under test and then asserts the mock returned what the mock was told to return. It will never fail, no matter what calculate_total actually does. Coverage tools count it. Reviewers skimming the diff see assertions and move on. You now have a test that is pure liability: it adds CI time, adds maintenance surface, and certifies nothing.

3. Change-detector tests

Assistants love snapshot-style assertions because they're easy to generate: run the code, record the output, assert the output. The result is a test that doesn't encode intent — it encodes current behavior, bugs included. Six weeks later someone makes a legitimate change, forty snapshot tests fail, and the fix is --update-snapshots applied without reading. The test suite has become a ritual rather than a control.

4. Tests that enshrine the bug

The sneakiest case: the assistant writes the implementation and the test in the same session, so the test faithfully verifies whatever the implementation does — including the off-by-one, the wrong rounding mode, the timezone assumption. The test is green, the coverage is high, and the bug now has a unit test guarding it against being fixed.

Why this hits your CI signal hardest

Each of these failure modes existed before AI assistants. What's new is the rate. Flakiness in a test suite scales roughly with test count and test complexity; reviewer attention does not scale at all. When the suite grows 30% in a quarter and the share of tests written by a model grows from zero to half, the math is straightforward: more flaky tests enter the suite per week than your team has ever had to triage.

And the default human response makes it worse. When a test fails and the developer is confident their change is unrelated — which, with AI-generated tests they didn't write and don't recognize, is always — they rerun the job. The rerun culture that follows is corrosive everywhere, but if you're in a SOC2 or change-managed environment, it's worse than corrosive. Your required CI checks are change-management controls. If the de facto process is "rerun until green, then merge," the control still exists on paper, but what it actually attests has quietly changed from "the tests passed" to "the tests passed eventually." That's a gap you'd rather find yourself than have someone else find for you.

We've written before about what flaky tests actually cost teams — the short version is that the damage isn't the rerun minutes, it's the erosion of trust. Once developers stop believing red means broken, you've lost the thing CI exists to provide, and no amount of AI-assisted throughput compensates for shipping on a signal nobody trusts.

Gates that hold at the new volume

The answer is not "ban AI-generated tests." That ship has sailed, and honestly, assistants writing the boring 80% of test scaffolding is one of the genuinely good uses of the technology. The answer is to assume more test code, written faster, with less per-line human scrutiny — and build gates that hold under those conditions.

Review test code like it's production code — because it is

The biggest single change is cultural: test files in a PR get read, not skimmed. Give reviewers a short, concrete checklist for AI-written tests:

Can this test fail? Mentally break the implementation. If no assertion would catch it, the test is decorative.
What does it mock? If the function under test appears in a mock or patch call, reject it.
How does it wait? Any fixed sleep or arbitrary timeout in async code is a flake waiting for a slow runner.
Does it assert intent or behavior? "The output equals this blob" is a change detector. "The total includes tax at the customer's rate" is a test.
Would the test have caught a plausible bug in this PR? If the test and the code were generated together, this is the question that catches enshrined bugs.

Five items. A reviewer can apply them in two minutes per test file, and they catch the overwhelming majority of what assistants get wrong.

Detect flakiness systematically, because you can't eyeball it anymore

At human-scale test growth, a senior engineer's memory was a workable flake detector — "oh yeah, that checkout spec, it does that." At AI-scale growth, that breaks down completely. Nobody recognizes tests nobody wrote. You need detection that watches every test result across every build and flags the tests that fail and pass on the same commit.

That starts with actually capturing results. If your CI throws away JUnit XML on failure, fix that first:

- name: Run tests
  run: pnpm test -- --reporters=jest-junit

- name: Upload test results to BuildPulse
  if: always()  # capture results on failure too — that's the whole point
  uses: buildpulse/buildpulse-action@main
  with:
    account: ${{ env.BUILDPULSE_ACCOUNT_ID }}
    repository: ${{ env.BUILDPULSE_REPOSITORY_ID }}
    path: test-results/**/*.xml
    key: ${{ secrets.BUILDPULSE_ACCESS_KEY_ID }}
    secret: ${{ secrets.BUILDPULSE_SECRET_ACCESS_KEY }}

With historical results in one place, the question "is this failure my change or a known flake?" gets answered by data instead of by vibes — which also means the answer is defensible, a property your compliance team will appreciate more than they'll say out loud.

Quarantine instead of rerun

When a flaky test is identified, the wrong move is letting every developer independently rediscover it via rerun. The right move is quarantining it: pull it out of the merge-blocking path, keep running it, and put it in a triage queue with an owner and a deadline. Quarantine converts an ambient tax on every engineer into a bounded, visible work item. It also restores the meaning of red: if a non-quarantined test fails, your change broke it. Full stop.

For AI-generated tests specifically, quarantine triage often has a pleasantly cheap resolution: the test was low-value to begin with, and the fix is deletion. Be comfortable with that. A suite that grew 30% and then shed the worst 5% is healthier than one that kept everything.

Stop trusting coverage as a quality metric

Coverage was always a weak proxy for test quality. With AI-generated tests, it's actively misleading — the mock-asserting test in the example above produces beautiful coverage numbers while verifying nothing. If you report coverage to leadership or auditors, pair it with something that measures assertion strength: mutation testing on critical paths, or at minimum periodic spot-audits where someone breaks the code on purpose and counts how many tests notice. If the number is embarrassing, better to learn it in an exercise than in an incident.

What I'd tell a VP rolling this out

If your org is leaning into AI-assisted development — and it should be — treat test-suite integrity as a first-class part of the rollout, not a cleanup project for next year:

Instrument before you accelerate. Get test-result reporting and flaky-test detection in place before assistant adoption peaks, so you have a baseline and can see the curve bend.
Make the review checklist policy, not folklore. Put it in the PR template. Five questions, two minutes.
Track flaky-test count and time-to-resolution as KPIs alongside your throughput metrics. Throughput that's up 35% while CI trust collapses is not a win you want to present twice.
Write down the rerun policy. In a change-managed environment, "reruns require a linked quarantine ticket" is the difference between a control and a checkbox.

AI assistants changed the economics of writing code, and they changed the economics of writing bad tests right along with it. The teams that come out ahead won't be the ones that generated the most tests. They'll be the ones whose CI still means something at the end of the year.

AI Engineering

6 min read

Setting pass/fail thresholds for LLM evals in CI without gaslighting yourself

A hard-coded score threshold on a non-deterministic eval is a coin flip wearing a suit. Here's how to gate LLM changes in CI without lying to yourself.

BuildPulse Team

Jul 22, 2026

AI Engineering

7 min read

Your LLM evals are flaky, and your CI is lying to you

Non-deterministic LLM evals wreck your CI signal the same way flaky unit tests do — worse, actually. Here's how to build eval suites you can gate on.

BuildPulse Team

Jul 10, 2026

AI Engineering

7 min read

CI for AI agents: testing non-deterministic systems

Agents are the hardest software to test. A practical blueprint for CI for AI agents: trajectory assertions, tool mocking, eval harnesses, and what 'green' actually attests to.

BuildPulse Team

Jun 13, 2026