6 min read

Setting up CI for prompt and eval regression testing

A prompt change that tanks quality is the silent killer of LLM features. Here's how to run evals in CI so every PR gets a score — and failing scores block the merge.

BuildPulse Team

June 8, 2026

LLM CI: Prompt Regression Testing with Evals in CI | BuildPulse Blog

The problem with "it looked fine in my testing"

You've probably seen this movie. An engineer tweaks a system prompt — tightens the wording, swaps "helpful assistant" for "expert advisor", cuts some tokens — ships it, and two days later someone notices the model started hallucinating product names. Nobody caught it because nobody had a test for it. The PR looked clean. The vibes were good.

This is the defining CI gap for teams shipping LLM features right now. Traditional software has deterministic outputs: you change a function, you run the tests, red means broken. With LLMs the outputs are probabilistic, and quality lives in a distribution. You can't assertEqual your way out of this. What you can do is score outputs against a rubric and gate the merge on those scores — the same way you'd gate on test coverage or a passing test suite.

That's what this post is about: wiring evals into CI so a prompt change can't silently regress quality.

What an eval pipeline actually needs to do

Before writing any YAML, get clear on the job:

Run a fixed set of inputs through the prompt under test.
Score each output against some rubric (another LLM, a heuristic, a human-labeled reference).
Aggregate scores into pass/fail thresholds.
Report results back to the PR so reviewers see them without hunting through logs.
Stay cheap enough that you'd actually run it on every PR.

Items 1–3 are the eval framework's job. Items 4–5 are CI's job. Most teams nail the framework and completely ignore the CI integration — and then the evals only run manually, which means they rarely run at all.

Picking an eval framework

I'm not going to tell you which framework to use — that depends on your stack — but a few are worth knowing:

promptfoo — YAML-driven, batteries included, great CI integration, outputs JUnit XML natively.
Braintrust — hosted evals with a nice SDK, good for teams that want a managed eval store.
RAGAS — specifically for RAG pipelines, scores faithfulness, answer relevancy, context recall.
Roll your own — if your eval logic is domain-specific enough, a Python script that calls your LLM and scores outputs is totally legitimate.

For the rest of this post I'll use promptfoo for examples because it speaks YAML and outputs JUnit XML, which plugs cleanly into any CI system.

A minimal eval suite

Start with a promptfooconfig.yaml at the root of your repo:

# promptfooconfig.yaml
prompts:
  - file://prompts/support_agent.txt

providers:
  - id: openai:gpt-4o-mini
    config:
      temperature: 0

tests:
  - description: Doesn't hallucinate product names
    vars:
      user_message: "What versions of Acme Widget are available?"
    assert:
      - type: llm-rubric
        value: "Response only mentions product versions explicitly listed in the context. Does not invent version numbers."
        threshold: 0.8

  - description: Stays on topic
    vars:
      user_message: "Tell me a joke"
    assert:
      - type: llm-rubric
        value: "Response politely declines and redirects to support topics."
        threshold: 0.9

  - description: Concise response
    vars:
      user_message: "How do I reset my password?"
    assert:
      - type: javascript
        value: "output.split(' ').length < 150"

The llm-rubric assertions use a judge model (GPT-4o by default) to score the output. The threshold is a 0–1 score; fall below it and the test fails. The javascript assertion is a cheap heuristic — no LLM call, no cost, instant.

The temperature: 0 is non-negotiable in CI. Determinism is already hard enough with LLMs; don't make it worse.

Wiring it into GitHub Actions

# .github/workflows/eval.yml
name: Prompt evals

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Cache promptfoo results
        uses: actions/cache@v4
        with:
          path: ~/.promptfoo/cache
          key: promptfoo-${{ hashFiles('promptfooconfig.yaml', 'prompts/**') }}
          restore-keys: promptfoo-

      - name: Install promptfoo
        run: npm install -g promptfoo@latest

      - name: Run evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          promptfoo eval \
            --output results.json \
            --output junit-results.xml \
            --max-concurrency 4

      - name: Publish test results
        uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: junit-results.xml
          check_name: Prompt eval results
          fail_on_failure: true

      - name: Upload full results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-results
          path: results.json

A few things worth calling out:

The paths filter is important. Only run evals when prompts or the eval config actually change. There's no reason to spend money on eval runs when someone updates a README.

fail_on_failure: true on the JUnit report step is what turns this into a merge gate. The check will appear on the PR and block merge if any eval falls below threshold.

--max-concurrency 4 throttles parallel LLM calls. Don't set this to 20 and then complain about rate limits — or a $40 CI bill.

Managing cost and latency

Running LLM evals on every PR sounds expensive. It doesn't have to be.

Cache aggressively. promptfoo caches responses by default keyed on the prompt + input + model. The cache path above means identical test cases that haven't changed won't re-run. If someone edits one prompt file, only tests that reference that prompt re-run. This alone cuts cost by 60–80% on a mature eval suite.

Use a cheaper judge model for unit-level evals. GPT-4o-mini as the judge model instead of GPT-4o costs roughly 25x less and is often good enough for rubric scoring on focused assertions. Reserve the expensive model for your production prompt, not the judge.

Keep the fast suite fast. Separate your eval cases into two tiers:

Fast suite (< 60s): heuristic assertions, small rubrics, cheap model. Runs on every PR.
Full suite: comprehensive rubrics, expensive model, 30+ test cases. Runs nightly or on merge to main.

In GitHub Actions you can express this with a matrix or separate workflow files. The key insight is that a 30-second eval suite that catches 80% of regressions is infinitely more useful than a 10-minute suite that nobody wants to wait for.

Budget caps. If you're using OpenAI, set a usage limit on the CI service account key. You want to know if a misconfigured eval is burning through budget, not discover it on your invoice.

Score thresholds as merge gates

This is where teams get squeamish. "What if a legitimate prompt change makes a test fail?" Yes — that's the point. If your change drops an eval score below threshold, you either:

a) Fix the prompt until it passes, or
b) Update the threshold because the old one was wrong, and document why in the PR.

Option (b) is fine. The eval isn't sacred. What's not fine is merging a change that silently drops quality with no record that anyone noticed.

Set thresholds conservatively at first. If your baseline pass rate on a rubric is 0.85, set the gate at 0.75 — you want signal, not noise. Tighten over time as you build confidence.

For aggregate scores ("at least 90% of test cases must pass"), promptfoo supports --pass-rate-threshold as a CLI flag. Pair that with the JUnit output and you get clean pass/fail at the job level.

Reporting results back to the PR

The JUnit report action above adds a check to the PR with individual test names and failure messages — good for "this specific eval failed". For a richer summary, promptfoo can output HTML and you can post it as a comment using the GitHub CLI:

      - name: Post eval summary as PR comment
        if: always() && github.event_name == 'pull_request'
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          PASS=$(jq '.results.stats.successes' results.json)
          FAIL=$(jq '.results.stats.failures' results.json)
          TOTAL=$(jq '.results.stats.total' results.json)
          gh pr comment ${{ github.event.pull_request.number }} \
            --body "**Prompt eval results:** $PASS/$TOTAL passed, $FAIL failed. [See full results artifact.]"

Simple. Reviewers see the score without leaving the PR. If you want to go further, you can post a table of per-test scores using jq to pull from results.json — but a summary line is usually enough to answer "did something break?".

If you're already using BuildPulse for test analytics, the JUnit XML output from promptfoo plugs into the same pipeline as your regular test results. Flaky evals — cases that pass sometimes and fail sometimes due to LLM non-determinism — show up in the flaky test dashboard. That's actually useful signal: if an eval is flaky at temperature 0, your rubric is probably too vague.

The prompts-as-code mindset

All of this only works if your prompts live in version control alongside the code. No prompts edited directly in a playground and copy-pasted into production. No prompt management systems that bypass the repo.

This sounds obvious until you watch a team ship a prompt change through their "prompt management UI" and wonder why the evals didn't catch anything. The eval only runs if the change goes through the PR. The change only goes through the PR if prompts are code.

Store prompts as plain .txt or .md files in a prompts/ directory. Reference them by path in your eval config. Treat a prompt change like a code change — because it is one.

Where to start

If you have zero evals today, don't try to boil the ocean. Pick the two or three behaviors that, if they broke, would cause an incident. Write rubric assertions for those. Wire it into CI with a permissive threshold. Watch it run for a few weeks. Tighten the threshold as you build confidence.

The hardest part isn't the tooling — promptfoo makes this genuinely straightforward. The hardest part is the discipline to treat prompt changes as changes that need verification, not just vibes.

Your LLM feature has a test suite now. It just doesn't know it yet.

AI Engineering

7 min read

Your LLM evals are flaky, and your CI is lying to you

Non-deterministic LLM evals wreck your CI signal the same way flaky unit tests do — worse, actually. Here's how to build eval suites you can gate on.

BuildPulse Team

Jul 10, 2026

AI Engineering

7 min read

CI for AI agents: testing non-deterministic systems

Agents are the hardest software to test. A practical blueprint for CI for AI agents: trajectory assertions, tool mocking, eval harnesses, and what 'green' actually attests to.

BuildPulse Team

Jun 13, 2026

AI Engineering

8 min read

Keeping AI-generated code from breaking your test suite

AI coding assistants raised your PR throughput — and quietly raised your flaky-test rate with it. Here's how to keep your CI signal trustworthy at the new volume.

BuildPulse Team

Jun 13, 2026