AI Engineering
7 min read

CI for AI agents: testing non-deterministic systems

Agents are the hardest software to test. A practical blueprint for CI for AI agents: trajectory assertions, tool mocking, eval harnesses, and what 'green' actually attests to.

BuildPulse Team

June 13, 2026

CI for AI Agents: Testing Non-Deterministic Systems | BuildPulse Blog

The demo worked; the test suite is where agents go to die

I watched a team ship a support agent that could look up orders, check refund policy, and issue refunds through Stripe. The demo was flawless. Then someone asked the obvious question — "how do we test this in CI?" — and the room got quiet.

Six weeks later their pipeline had a job called agent-tests that passed about 70% of the time. The fix everyone reached for was retries: 3 in the workflow file. Now it passed 97% of the time, nobody trusted it, and one Tuesday it merged a change that made the agent issue refunds before checking the policy. The tests were green. They were just green for the wrong reasons.

Agents are genuinely the hardest software most of us have ever had to test. They're multi-step, so a small early deviation compounds. They're stateful, so order matters. And they're non-deterministic at the core, so the same input legitimately produces different outputs. Your existing CI mental model — same input, same output, red means broken — doesn't survive contact with that. But "we can't really test it" isn't an answer either, especially if your CI gates are part of your change-management story and an auditor will eventually ask what a passing check actually verified.

Here's the model I've landed on after shipping (and un-shipping) a few of these.

What does "green" even mean for an agent?

For a normal test suite, green means "this code behaves as specified." For an agent, you have to split that claim into three layers, because no single test can make it:

  • The scaffolding works. Tool dispatch, schema validation, state management, retry logic, guardrails. This is deterministic code and it should be tested deterministically.
  • The agent behaves acceptably on known scenarios. Given a recorded or live model, it reaches correct end states and its trajectory respects your invariants.
  • Quality hasn't regressed. Across a scenario suite, success rates and scores are at or above where they were yesterday.

Most teams smash all three into one pytest run against a live model and wonder why CI is a coin flip. Each layer needs different machinery, different gates, and — critically — a different interpretation when it fails.

Layer 1: test the deterministic shell like normal software

Here's an underrated fact: most of an agent's code isn't the model. It's the loop around the model. Parsing tool calls, validating arguments against schemas, enforcing step budgets, handling tool errors, persisting state between turns. All of that is plain code, and a shocking amount of agent breakage lives there — not in the prompt.

So test it with the model stubbed out entirely:

def test_malformed_tool_args_get_rejected_and_retried():
    model = ScriptedModel([
        tool_call("issue_refund", {"order_id": "not-a-number"}),
        tool_call("issue_refund", {"order_id": 4821, "amount_cents": 2999}),
    ])
    agent = SupportAgent(model=model, tools=TOOLS)

    result = agent.run("refund order 4821")

    # The bad call never reached the tool
    assert stripe_stub.refund_calls == [(4821, 2999)]
    # The agent fed the validation error back to the model
    assert "order_id must be an integer" in model.messages_received[1]

No network, no tokens, fully deterministic. These tests run on every PR and a failure here means your code is broken, full stop. If a test in this layer starts flaking, treat it like any other flaky test — it's a test bug or an infra bug, not "the model being spicy," because the model isn't even in the room.

Layer 2: trajectory and end-state testing

Now the interesting part: testing the agent's actual behavior. The trap here is writing tests that assert the exact sequence of steps. Those tests are wrong twice — they fail when the agent finds an equally valid path, and they pass when the agent takes the expected path to a bad outcome.

Instead, assert two things:

  • End state: did the world finish in the right configuration? The refund exists, the PR was opened, the ticket was updated. This is what the user cares about.
  • Trajectory invariants: properties of the path, not the path itself. Ordering constraints, safety constraints, budgets.
def test_refund_flow(replayed_model):
    agent = SupportAgent(model=replayed_model, tools=TOOLS)
    result = agent.run("Order #4821 arrived broken, I want a refund")

    # End state: the only thing the customer cares about
    assert result.outcome == "refund_issued"
    assert stripe_stub.refunds == [Refund(order_id=4821, amount_cents=2999)]

    # Trajectory invariants: properties, not a script
    tools_used = [step.tool for step in result.steps]
    assert tools_used.index("check_refund_policy") < tools_used.index("issue_refund")
    assert tools_used.count("issue_refund") == 1          # no double refunds, ever
    assert len(result.steps) <= 8                          # step budget
    assert not any(s.tool in DESTRUCTIVE_TOOLS and not s.confirmed for s in result.steps)

Notice what's not asserted: whether the agent looked up the order before or after reading the complaint, what it said in intermediate reasoning, how it phrased the confirmation. Two different trajectories can both be correct. Your assertions should be exactly as strict as your actual requirements — and "policy check happens before money moves" is a real requirement; "the agent calls tools in this exact order" almost never is.

That team I mentioned at the top? Their refund-before-policy-check bug would have been caught by one ordering assertion. Their brittle exact-sequence tests caught nothing and flaked constantly.

Mock the tools, replay the model

Two separate boundaries, two separate strategies.

Tools get mocked or sandboxed, always. Your CI run should never touch production Stripe, never open real PRs, never email a customer. Build fakes with enough fidelity to return realistic payloads and record what was called. This is table stakes, and it also makes tool-level failure injection trivial — what does the agent do when lookup_order times out? You should have a test for that, and you can't write it against the real API.

The model gets record/replay for the PR lane. Capture real model responses for each scenario, store them as fixtures, and replay them in CI. Yes, this means you're not testing the live model on every PR — that's a feature. A PR that changes your tool-dispatch code shouldn't fail because the model provider had a weird afternoon. Replay gives you determinism where determinism is the point: verifying that your changes didn't break the loop.

The trade-off is real — recorded fixtures go stale when you change prompts or models — so treat re-recording as part of the change. A prompt change without re-recorded fixtures is like a schema change without a migration.

Layer 3: the eval harness, where probability lives

At some point you have to test against the live model, because that's the system you actually ship. This is agent evaluation proper, and the cardinal rule is: a single run is not a data point. Run each scenario k times and gate on the pass rate.

@eval_scenario(runs=5, required_pass_rate=0.8)
def eval_refund_flow(agent):
    result = agent.run("Order #4821 arrived broken, I want a refund")
    return (
        result.outcome == "refund_issued"
        and ordering_ok(result, before="check_refund_policy", after="issue_refund")
        and len(result.steps) <= 8
    )

A scenario passing 5/5 yesterday and 3/5 today is signal. A scenario passing once is an anecdote. The pass-rate threshold makes your tolerance explicit and reviewable — which matters a lot if you're in a SOC2 or regulated environment, because "this check attests that the refund agent succeeds at least 80% on the policy-compliance suite, measured over 5 runs" is a claim you can stand behind in a control narrative. "The test passed after two retries" is not.

Wire it into CI as a distinct lane:

jobs:
  agent-evals:
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'run-evals')
    runs-on: ubuntu-latest
    timeout-minutes: 45
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - name: Run agent eval suite
        run: |
          pytest evals/ \
            --eval-runs=5 \
            --junitxml=reports/evals.xml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: reports/

Emit JUnit XML even for evals. It means your eval results flow through the same reporting pipeline as everything else, and you can see scenario-level history instead of a single red/green blob.

The CI topology that actually works

Putting the layers together:

  • Every PR: deterministic shell tests + replay-based trajectory tests. Fast, hermetic, hard gate. Red means your change broke something. No retries, no excuses.
  • Merge to main (or labeled PRs): a small live smoke suite — your five most business-critical scenarios at k=3. Hard gate on safety invariants, soft gate on quality.
  • Nightly: the full eval suite at k=5+. Gate releases on it, alert on regressions, trend the scores.

This topology also fixes the failure-interpretation problem, which is where most teams hurt themselves. When everything runs in one undifferentiated suite, every red build becomes "eh, probably the model" and people rerun until green — the exact normalization of deviance that flaky tests breed, now with an even better excuse. Separated lanes give every failure a default diagnosis: deterministic lane red means your code, replay lane red means your change or stale fixtures, eval rate dropped means your prompt or the model.

And be honest about residual flakiness in the deterministic lanes. Even with the model stubbed, agent test suites lean hard on async runtimes, sandboxed services, and timeouts — classic flake territory. Track those failures the same way you'd track any flaky test, and quarantine the ones that fail without a corresponding code change so they stop poisoning the signal. What you must not do is quarantine an eval. A flaky test is noise to be removed; a declining eval pass rate is your product getting worse. Confusing the two is how a quality regression rides a "known flaky" label into production.

What green should mean

When this is set up well, a green check on an agent PR means something specific: the scaffolding is correct, recorded scenarios still produce valid end states, every trajectory respected the safety invariants, and live success rates are at or above threshold. That's not "the agent is perfect." It's a bounded, documented, defensible claim — which is all any CI gate has ever been, for any software.

The teams that struggle with testing AI agents are mostly trying to get binary certainty out of a probabilistic system. The teams that succeed change the question: not "did it pass?" but "what exactly does passing claim, and at what confidence?" Answer that precisely, encode it in layers, and CI for AI agents stops being a coin flip and starts being an instrument you can actually read.