AI Engineering
5 min read

Flaky evals: when your LLM tests are non-deterministic

Your LLM eval passes on Monday, fails on Wednesday, and you changed nothing. That's not a model problem — that's a flaky eval. Here's how to fix it.

BuildPulse Team

June 5, 2026

Flaky Evals: Non-Deterministic LLM Tests | BuildPulse Blog

The eval that lied to you

You ship a prompt change. CI goes green. You deploy. Two days later a user files a bug that your eval was supposed to catch. You re-run the eval locally — it passes. You run it again — it fails. You haven't touched the code.

Congratulations. You have a flaky eval.

This isn't exotic. Every team I've talked to that runs LLM evals in CI eventually hits this. The response is usually the same: developers start ignoring eval failures, treating them as noise, clicking re-run until the board turns green. Sound familiar? It's the exact trust-erosion loop that flaky unit tests create — just with higher stakes, because now you're making deployment decisions based on model behavior.

Eval flakiness is worth taking seriously. Here's how to think about it.

Why evals flake

Unit tests flake because of non-determinism: race conditions, time-dependent logic, shared mutable state, network calls. LLM evals flake for the same reason, just with a different source of randomness: the model itself.

A few specific culprits:

  • Temperature > 0. Every inference call at temperature > 0 samples from a probability distribution. Run the same prompt twice, get two different outputs. That's by design — but it means your eval outcome can vary even when everything else is identical.
  • No seed, or ignored seeds. Most inference APIs accept a seed parameter. Many teams never set it. Even when they do, some providers treat seeds as best-effort, not guaranteed.
  • Evaluator-model variability. If your eval uses a second LLM to judge output quality ("LLM-as-judge"), that judge is also non-deterministic. You've stacked two sources of randomness.
  • String-match evals on generative outputs. Checking output === "Yes" against a model that sometimes says "Yes," sometimes "Yes.", sometimes "yes" is a fragile assertion, not a stable test.
  • External state. Evals that hit live APIs, retrieve from a changing vector index, or depend on today's date will drift over time regardless of the model.

The fix isn't to accept flakiness as the cost of working with LLMs. It's to treat eval non-determinism as an engineering problem with engineering solutions.

Detecting flaky evals before they erode trust

The first step is measurement. You can't fix what you haven't named.

The blunt instrument: run each eval case multiple times in a single CI pass and check for disagreement. If the same input produces different pass/fail outcomes across runs, the eval is flaky.

import statistics

def run_eval_with_flake_detection(eval_fn, input_case, n=5):
    results = [eval_fn(input_case) for _ in range(n)]
    pass_rate = sum(results) / n
    if 0 < pass_rate < 1:
        print(f"[FLAKY] pass_rate={pass_rate:.2f} over {n} runs: {input_case['id']}")
    return pass_rate

This is expensive — you're paying for n inference calls per case — but even a single nightly job that runs your eval suite 5x will quickly surface the worst offenders.

A cheaper signal: track eval results over time in CI and flag cases with high variance across builds on the same commit. This is exactly what flaky-test detection does with unit tests: it watches the history, not just the current run. If you're already using a tool like BuildPulse for flaky test detection in your test suite, you can apply the same pattern to eval results by feeding them as JUnit XML — more on that below.

Stabilizing evals you can fix

1. Lock temperature to zero for deterministic cases

For evals that test factual recall, classification, structured extraction, or any behavior where you want a single correct answer, set temperature to 0 and set a seed.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0,
    seed=42,
)

This won't give you byte-identical outputs forever — models get updated, provider infrastructure changes — but it dramatically narrows the variance within a stable model version. If you're pinning to a specific model snapshot (e.g., gpt-4o-2024-08-06), you'll get much more consistent results.

2. Sample multiple times and use majority vote

For evals where you legitimately need temperature > 0 (testing creative tasks, checking that a model can produce certain output), a single pass/fail is a bad signal. Sample N times and report the majority outcome.

def majority_eval(eval_fn, input_case, n=7):
    results = [eval_fn(input_case) for _ in range(n)]
    passes = sum(results)
    # Majority vote with a clear threshold
    return passes >= (n // 2 + 1), passes / n

Odd sample counts (5, 7) avoid ties. The pass rate you get out of this is also more honest than a binary result — a case that passes 4/7 times is a near-miss worth investigating, not a clean pass.

3. Replace brittle string matching with tolerance bands

If you're asserting on numeric outputs — scores, ratings, confidence values — don't check for equality. Check for range.

def score_eval(output: str, expected: float, tolerance: float = 0.15) -> bool:
    try:
        actual = float(output.strip())
        return abs(actual - expected) <= tolerance
    except ValueError:
        return False

For semantic evals ("does this output capture the key idea?"), embedding similarity with a threshold is more stable than exact-match or a binary LLM judge:

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

def semantic_eval(actual_embedding, expected_embedding, threshold=0.85):
    return cosine_similarity(actual_embedding, expected_embedding) >= threshold

Set the threshold based on empirical data from your dataset, not intuition. Sample 50-100 outputs, plot the distribution, and pick a threshold that separates real failures from noise.

4. Freeze external dependencies

If your eval retrieves from a vector store, snapshot the retrieval results. If it calls a live API, mock it. Evals should be hermetic. Any eval that can return different results because the data changed is testing your data pipeline, not your model behavior — and that's a different test with different ownership.

Quarantining what you can't yet fix

Some evals will remain flaky after you've done all of the above. Maybe they test genuinely stochastic behaviors. Maybe the LLM-as-judge is inherently variable on edge cases. Maybe you haven't had time to fix them yet.

Don't delete them. Don't ignore them. Quarantine them — just like you'd quarantine a flaky unit test.

The pattern: tag flaky evals, run them but don't gate CI on them, and report them separately.

# .github/workflows/evals.yml
jobs:
  evals-stable:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python run_evals.py --tag stable --output stable-results.xml
      - name: Fail CI on stable eval failures
        run: python check_results.py stable-results.xml --fail-on-error

  evals-quarantine:
    runs-on: ubuntu-latest
    continue-on-error: true   # Never blocks merge
    steps:
      - uses: actions/checkout@v4
      - run: python run_evals.py --tag flaky --output quarantine-results.xml
      - name: Upload for tracking (no gate)
        uses: actions/upload-artifact@v4
        with:
          name: quarantine-eval-results
          path: quarantine-results.xml

The quarantine job runs on every PR but can never block a merge. You still collect the data, which is the important part: you want to know if a quarantined eval stops flaking, so you can promote it back to the stable suite.

Reporting eval results like test results

If you emit your eval results as JUnit XML, every existing test-reliability tool works with them immediately — including flaky test detection.

import xml.etree.ElementTree as ET
from datetime import datetime

def write_junit(results, output_path):
    suite = ET.Element("testsuite", name="llm-evals",
                       tests=str(len(results)),
                       failures=str(sum(1 for r in results if not r["passed"])))
    for r in results:
        case = ET.SubElement(suite, "testcase",
                             classname=r["category"],
                             name=r["id"],
                             time=str(r["duration_s"]))
        if not r["passed"]:
            fail = ET.SubElement(case, "failure", message=r["reason"])
            fail.text = r["actual_output"]
    tree = ET.ElementTree(suite)
    tree.write(output_path, encoding="unicode", xml_declaration=True)

Once you're emitting JUnit XML, tools like BuildPulse can track eval pass rates over time, surface which eval cases flip-flop across runs on the same commit, and flag the ones that need quarantine — the same way it does for Jest or pytest. The eval case becomes a first-class citizen in your CI health dashboard rather than a CSV someone exports once a quarter.

The real cost of ignoring this

Flaky unit tests cost you time and trust. Developers learn to re-run until green. The test suite becomes a ritual rather than a signal.

Flaky evals cost you something worse: false confidence in model behavior. If your eval suite is noisy enough that failures are assumed to be flakes, you will eventually deploy a real regression and rationalize the eval failure as noise. The team that clicked re-run one too many times.

The good news is this is a solved problem in the unit-test world, and the same solutions transfer cleanly. Measure variance, reduce non-determinism where you can, use statistical aggregation where you can't, quarantine the rest, and track everything over time. That's not AI-specific wisdom — it's just good engineering applied to a new domain.

Your evals should be the thing you trust. Do the work to make them trustworthy.