AI Engineering
6 min read

How to test LLM-powered applications

Testing LLM-powered applications is genuinely hard — but most teams are making it harder than it needs to be. Here's what actually works.

BuildPulse Team

May 30, 2026

The test you wrote yesterday is wrong today

You shipped a feature backed by GPT-4. It worked great in staging. You wrote a test that asserted the response contained the word "summary". Two weeks later, your prompt changed slightly, the model started returning "overview" instead, and your test is red — but the feature is fine.

Welcome to LLM testing, where the wrong kind of assertion is worse than no assertion at all.

Testing LLM-powered applications isn't impossible, but it requires a different mental model than testing deterministic code. The output isn't a specific value — it's a distribution. Your job isn't to pin down exact outputs; it's to define the properties that must hold across that distribution. Get that framing right and the rest starts to fall into place.

Separate what's deterministic from what isn't

The biggest mistake I see teams make is treating the entire LLM feature as a black box and either over-testing (brittle exact-match assertions) or under-testing ("we'll just eyeball it"). Neither works.

Most LLM features have layers:

  1. Input construction — assembling the prompt from user input, retrieved context, system instructions
  2. The LLM call — the non-deterministic part
  3. Output parsing and handling — extracting structured data, routing, error handling

Layers 1 and 3 are fully deterministic. Test them like normal code. Layer 2 is where you need a different strategy.

# testable: pure function, no LLM involved
def build_prompt(user_query: str, context_docs: list[str]) -> str:
    context = "\n".join(f"- {doc}" for doc in context_docs)
    return f"""Answer the question using only the provided context.

Context:
{context}

Question: {user_query}"""


def test_build_prompt_includes_all_context():
    docs = ["The sky is blue.", "Water is wet."]
    prompt = build_prompt("What color is the sky?", docs)
    assert "The sky is blue." in prompt
    assert "Water is wet." in prompt
    assert "What color is the sky?" in prompt

This is boring, obvious advice — but I've seen codebases where the prompt builder had a subtle bug that dropped context beyond a certain length, and nobody caught it because every test mocked the LLM call and never inspected the actual prompt being sent. Don't do that.

Mock the LLM for unit tests — but carefully

For unit tests, mock the LLM. Don't fight this. Your CI pipeline shouldn't be making OpenAI API calls on every push: it's slow, it costs money, and it introduces flakiness that has nothing to do with your code.

But what you mock matters. Most teams mock at the wrong level.

# don't mock the entire client — you lose visibility into what was sent
with patch("openai.ChatCompletion.create", return_value=mock_response):
    result = my_feature(user_input)

# better: mock at the boundary your code controls,
# and capture what was sent for assertion
def test_parser_handles_markdown_code_blocks():
    raw_llm_output = """Sure! Here's the JSON:\n```json\n{"action": "approve"}\n```"""
    result = parse_llm_response(raw_llm_output)
    assert result == {"action": "approve"}

Test your output parser against the full range of outputs the model might plausibly return: JSON wrapped in markdown fences, JSON with a preamble sentence, JSON with trailing commas (yes, models do this), refusals, empty strings. You don't need the model to generate these — write them by hand. This is where a lot of silent production failures live.

Evals are your integration tests

Once you've unit-tested the deterministic parts, you need a way to assert something meaningful about actual model outputs. This is what the ML world calls evals — evaluation pipelines that run a set of example inputs through your system and score the outputs against some criteria.

The simplest eval is a dataset of (input, expected_property) pairs:

EVAL_CASES = [
    {
        "input": "Summarize this contract clause: The licensor grants...",
        "must_contain": ["licensor", "grant"],
        "must_not_contain": ["I cannot", "I'm unable"],
        "max_words": 100,
    },
    # ...
]

def run_evals(cases, model_fn):
    results = []
    for case in cases:
        output = model_fn(case["input"])
        passed = True
        if any(term not in output for term in case.get("must_contain", [])):
            passed = False
        if any(term in output for term in case.get("must_not_contain", [])):
            passed = False
        if "max_words" in case and len(output.split()) > case["max_words"]:
            passed = False
        results.append({"case": case["input"][:50], "passed": passed, "output": output})
    return results

This is unsophisticated — and that's the point. Start here. You can layer in LLM-as-judge scoring later. The goal right now is to build a repeatable, automated signal about whether your prompts are regressing.

Note the assertions here aren't "did the model return exactly X" — they're behavioral properties: does the response stay on topic, does it avoid refusals on valid inputs, is it within a reasonable length. This is the mental shift that makes LLM testing tractable.

Running evals in CI

Evals belong in CI — just not necessarily on every push. They're slower and cost real money, so the right pattern is usually:

  • On every push: unit tests with mocked LLM (fast, free)
  • On PR merge or nightly: eval suite against real models (slower, small cost, high signal)

Here's a minimal GitHub Actions setup:

name: LLM evals

on:
  schedule:
    - cron: '0 3 * * *'  # nightly at 3am
  workflow_dispatch:      # manual trigger for prompt changes

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python -m pytest evals/ -v --tb=short

      - name: Upload eval results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results.json

The workflow_dispatch trigger is underrated here. When a prompt engineer or product manager changes a system prompt, they should be able to manually kick off the eval suite and see results before merging. That feedback loop is the difference between "we tested the prompt" and "we eyeballed the prompt".

Prompt regression testing

Prompts change — and every prompt change is a potential regression. Treat your prompts like code: version them, diff them, and run evals when they change.

The simplest approach is storing prompts in files alongside your code and running targeted evals in CI when those files change:

- name: Check if prompts changed
  id: prompt_diff
  run: |
    if git diff --name-only HEAD~1 | grep -q 'prompts/'; then
      echo "changed=true" >> $GITHUB_OUTPUT
    fi

- name: Run prompt evals
  if: steps.prompt_diff.outputs.changed == 'true'
  run: python -m pytest evals/prompt_regression/ -v

This is table stakes. A prompt is a dependency. When it changes, you run tests. Simple.

The flakiness problem

Here's the thing nobody wants to admit: even with all of this, LLM-backed tests are flakier than deterministic tests. The same eval case might pass 19 out of 20 runs with a temperature > 0. That's not your CI being broken — that's the nature of sampling from a probability distribution.

Your options:

  1. Run evals at temperature 0 when the model supports it. Not always representative of production, but dramatically more stable.
  2. Use pass@k scoring — run each eval case k times and pass if it succeeds on a threshold percentage. More expensive, more honest.
  3. Treat eval flakiness as signal, not noise. If a case fails 30% of the time, that's a real problem with your prompt or parser.

The worst thing you can do is retry the flaky eval until it passes. That's not a green test suite — it's a polite lie. If your eval suite is flaky, invest in figuring out why before you start ignoring it.

This is the same discipline that applies to any flaky test: track failure rate over time, identify which cases are consistently unstable, and fix the root cause. The tools you'd use to track flaky unit tests — failure history, test analytics — are directly applicable here. An eval that's red 40% of the time is a flaky test, full stop.

LLM-as-judge: powerful, but earn it

For sufficiently complex outputs — long-form content, multi-step reasoning, nuanced tone — heuristic checks aren't enough. This is where LLM-as-judge comes in: you use a separate model call to evaluate the quality of your primary model's output.

JUDGE_PROMPT = """
You are evaluating a customer support response.
Rate the following response on a scale of 1-5 for each criterion:
1. Accuracy: Does it correctly address the customer's issue?
2. Tone: Is it professional and empathetic?
3. Completeness: Does it fully resolve or escalate the issue?

Respond with JSON only: {"accuracy": N, "tone": N, "completeness": N}

Customer issue: {issue}
Response to evaluate: {response}
"""

LLM-as-judge is genuinely useful, but don't reach for it first. It adds cost, latency, and its own reliability concerns — the judge model can be wrong, biased toward longer responses, or inconsistent across runs. Use heuristic evals for 80% of your cases and reserve LLM-as-judge for the cases where heuristics genuinely can't capture what matters.

What a mature testing pyramid looks like

Put it all together and you get a testing pyramid that should feel familiar:

  • Unit tests (fast, free, every push): test prompt builders, output parsers, routing logic — all deterministic code, LLM mocked
  • Eval suite (moderate cost, nightly or on prompt change): behavioral assertions over (input, output) pairs against real models
  • LLM-as-judge (higher cost, targeted): quality scoring for complex outputs where heuristics fall short
  • Human review (highest cost, highest signal): spot-checks on a sample of real production outputs, feeding back into your eval dataset

The human review loop is the one teams skip — and it's the most important one. Your eval dataset is only as good as your coverage of real failure modes, and real failure modes come from production. Build a process to regularly review a sample of live outputs and add the interesting failures to your eval suite. That's how the system improves over time.

The teams that ship reliable LLM features aren't the ones who found a magic testing framework. They're the ones who applied the same engineering discipline they'd bring to any complex system: clear layers, automated checks at each layer, and honest signal about where things break.