5 min read

How to test LLM-powered applications

Traditional assertions fall apart when your function returns something different every time. Here's how to build a test suite that actually catches regressions in LLM apps.

BuildPulse Team

June 3, 2026

How to Test LLM-Powered Applications | BuildPulse Blog

The assertion that lies to you every time

Here's a test I've seen on more than one real codebase:

def test_summarize():
    result = summarize("The quick brown fox jumps over the lazy dog.")
    assert result == "A fox jumps over a dog."

It passes on the developer's laptop — once. Then it fails in CI. Then it passes again. Nobody touches it for three weeks. Then a model upgrade ships and it fails forever.

This is the fundamental problem with testing LLM applications: the function under test is not a pure function. Same input, different output, every single time. Exact-match assertions aren't just fragile here — they're actively misleading. A test that flickers between pass and fail tells you nothing useful about whether your application is actually working.

So let's talk about what does work.

Four strategies that actually hold up

1. Structural and schema assertions

Before you worry about whether the LLM said the right thing, verify it returned something in the right shape. This sounds obvious but it catches a surprising number of real regressions — especially after prompt changes or model swaps.

If your endpoint is supposed to return JSON with specific fields, assert on the structure:

import jsonschema

SUMMARY_SCHEMA = {
    "type": "object",
    "required": ["summary", "sentiment", "key_points"],
    "properties": {
        "summary": {"type": "string", "minLength": 10},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "key_points": {"type": "array", "items": {"type": "string"}, "minItems": 1}
    }
}

def test_summary_structure():
    result = analyze_article(SAMPLE_ARTICLE)
    jsonschema.validate(result, SUMMARY_SCHEMA)  # raises on violation

For classification tasks, assert the output is one of a known set of values. For generation tasks with length constraints, assert on min/max token counts or character ranges. For RAG pipelines, assert that citations actually appear in the retrieved context.

These tests are deterministic even when the content isn't. They're fast. They give you a floor of confidence that's worth having.

2. Semantic similarity assertions

When you need to verify meaning rather than exact wording, cosine similarity against an embedding model is your friend. The idea: embed both the model output and a reference string, then assert that similarity exceeds a threshold.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(a: str, b: str) -> float:
    embeddings = model.encode([a, b], convert_to_tensor=True)
    return float(util.cos_sim(embeddings[0], embeddings[1]))

def test_summarize_captures_main_idea():
    output = summarize(ARTICLE_ABOUT_CLIMATE_CHANGE)
    score = semantic_similarity(output, "global warming and its effects on the environment")
    assert score > 0.75, f"Summary seems off-topic (similarity={score:.2f})"

The threshold is the thing you'll tune over time. Start conservative (0.70–0.75), run it against a batch of known-good outputs to calibrate, and commit that calibration to code review so changes are visible. Don't let thresholds drift silently.

One caveat: semantic similarity doesn't catch factual errors. A summary that's topically similar but factually wrong will score high. Use this for topic relevance, not factual accuracy — that's a different tool.

3. LLM-as-judge

This is the most powerful strategy in the toolkit, and also the one most likely to give you a false sense of rigor if you don't set it up carefully.

The pattern: send the model output to a separate LLM call (often a more capable, cheaper judge model) with a rubric, and parse a structured verdict.

import openai
import json

JUDGE_PROMPT = """
You are evaluating the output of a customer support AI assistant.

User question: {question}
Assistant response: {response}

Evaluate the response on these criteria. Return JSON only.
{
  "answers_question": true/false,
  "is_polite": true/false,
  "contains_hallucination": true/false,
  "score": 1-5
}
"""

def judge_response(question: str, response: str) -> dict:
    result = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, response=response
        )}],
        response_format={"type": "json_object"},
        temperature=0,  # determinism matters for your judge
    )
    return json.loads(result.choices[0].message.content)

def test_support_response_quality():
    response = support_bot("How do I cancel my subscription?")
    verdict = judge_response("How do I cancel my subscription?", response)
    assert verdict["answers_question"] is True
    assert verdict["contains_hallucination"] is False
    assert verdict["score"] >= 4

A few things I've learned building these:

Set temperature=0 on your judge. You want the evaluation to be stable even if the generation isn't. A flickering judge is worse than no judge.
Make the rubric specific. "Is this a good response?" is useless. "Does this response include a link to the cancellation page?" is testable.
Sanity-check your judge against known failures. Feed it obviously broken outputs and make sure it catches them. A judge that misses 30% of hallucinations is just expensive noise.
Track judge cost separately. LLM-as-judge can run up a real API bill in CI if you're not watching it.

4. Golden datasets with regression tracking

The three strategies above handle individual test cases. Golden datasets let you track quality trends across many examples over time — which is how you actually catch model degradation or prompt regressions before they hit production.

The setup:

Curate a set of representative inputs (50–200 examples).
Run your pipeline on all of them and store the outputs + scores.
On every relevant code change, re-run and compare aggregate metrics to the stored baseline.

# eval_runner.py
import json
from pathlib import Path

def run_eval(dataset_path: str, baseline_path: str, threshold: float = 0.05):
    dataset = json.loads(Path(dataset_path).read_text())
    baseline = json.loads(Path(baseline_path).read_text())

    scores = []
    for example in dataset:
        output = my_pipeline(example["input"])
        score = judge_response(example["input"], output)["score"]
        scores.append(score)

    current_mean = sum(scores) / len(scores)
    baseline_mean = baseline["mean_score"]

    print(f"Current: {current_mean:.3f} | Baseline: {baseline_mean:.3f}")

    delta = baseline_mean - current_mean
    assert delta <= threshold, (
        f"Quality regression detected: score dropped by {delta:.3f} "
        f"(threshold={threshold})"
    )

You wire this into CI as a scheduled job or as a gate on model/prompt changes:

# .github/workflows/llm-eval.yml
name: LLM quality eval

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/pipeline/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: python eval_runner.py --dataset evals/golden.json --baseline evals/baseline.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The baseline file lives in source control. Intentional improvements to the model get a deliberate baseline update — a PR that says "improved accuracy by 4%, updating baseline" — rather than silently shifting the floor.

The flakiness problem is real here

Even with all of this, LLM tests will be flakier than your unit tests. That's the honest truth. Temperature > 0 means you'll occasionally get edge-case outputs. Network calls to model APIs will timeout. Rate limits will bite you at 3am.

A few things that help:

Cache responses during local development. Only hit the real API in CI. A simple JSON file keyed on (model, prompt hash) saves a lot of grief and money.
Retry with backoff on infrastructure failures, not on assertion failures. If the test fails because the output was wrong, that's signal. If it fails because the API returned 429, that's noise.
Mark known-flaky LLM tests explicitly and track their pass rates over time. A test that fails 15% of the time isn't a reliable gate — it's a guess. If you're already using a flaky test detection tool in CI, route LLM eval failures through the same pipeline. Distinguishing "this test is inherently probabilistic" from "this test broke with the last deploy" is exactly the kind of signal you need.
Run evals in parallel when the dataset is large. Sequential LLM calls across 200 examples will time out your CI job.

What to skip

A few approaches I've seen teams waste time on:

BLEU/ROUGE scores. These were designed for machine translation research and measure n-gram overlap. They correlate poorly with actual quality for most LLM tasks. Skip them unless you're specifically evaluating translation or summarization against a very specific reference corpus.

Exact-match on any free-text generation. Already covered, but worth repeating. If your test will fail when the LLM says "I'd be happy to help" instead of "I'm happy to help", your test is wrong, not the model.

Testing every possible edge case manually. You can't enumerate the input space of a language model. Invest in the golden dataset + judge approach and let aggregate metrics do the work.

The part nobody talks about: test maintenance

LLM test suites rot faster than regular ones because the underlying model changes under you. GPT-4 Turbo behaves differently than GPT-4o. Claude 3 Opus behaves differently than Claude 3.5 Sonnet. When you upgrade, your tests need to be reviewed — not just rerun.

The teams I've seen do this well treat model upgrades like dependency upgrades: a dedicated PR, a deliberate baseline re-evaluation, and a sign-off from whoever owns the product behavior. Not just "CI was green so we shipped it."

Build your eval pipeline so that re-baselining is cheap and visible. If updating the baseline requires deleting a file and regenerating it, that's a five-minute task. The important thing is that it happens in a PR, with a diff, where someone has to consciously approve the quality change.

Testing LLM applications is harder than testing regular software, but it's not impossible. Structure first, semantics second, judge for the subtle stuff, golden datasets for trends. That stack will catch most real regressions — and unlike a flickering exact-match assertion, it'll actually tell you something when it fails.

AI Engineering

7 min read

Your LLM evals are flaky, and your CI is lying to you

Non-deterministic LLM evals wreck your CI signal the same way flaky unit tests do — worse, actually. Here's how to build eval suites you can gate on.

BuildPulse Team

Jul 10, 2026

AI Engineering

7 min read

CI for AI agents: testing non-deterministic systems

Agents are the hardest software to test. A practical blueprint for CI for AI agents: trajectory assertions, tool mocking, eval harnesses, and what 'green' actually attests to.

BuildPulse Team

Jun 13, 2026

AI Engineering

8 min read

Keeping AI-generated code from breaking your test suite

AI coding assistants raised your PR throughput — and quietly raised your flaky-test rate with it. Here's how to keep your CI signal trustworthy at the new volume.

BuildPulse Team

Jun 13, 2026