5 min read

Testing RAG pipelines: what actually breaks

RAG systems break in two very different places — retrieval and generation — and treating them as one testable unit is how you end up debugging blindly in production.

BuildPulse Team

June 12, 2026

Testing RAG Pipelines: What Actually Breaks | BuildPulse Blog

The lie that RAG is one thing

I've watched teams ship a RAG feature, get great demo results, and then spend three weeks playing whack-a-mole with production bugs they couldn't reproduce locally. The culprit wasn't the LLM. It wasn't the prompt. It was the retrieval layer returning subtly wrong chunks — and nobody had written a single test for it.

RAG is not one system. It's two systems bolted together: a retrieval system (search, chunking, embeddings, vector store) and a generation system (the LLM plus your prompt template). They fail independently, they fail differently, and they need different tests. If your entire evaluation strategy is "I asked it a question and the answer seemed right," you're going to get burned.

Let's break down what actually breaks, and what to do about it.

Stage 1: Retrieval — the silent killer

Retrieved context that's wrong or incomplete will poison every answer downstream. The LLM is just the last step — garbage in, hallucination out.

The bugs that bite

Stale chunks. Your knowledge base got updated. Your vector index did not. A user asks about your current refund policy; the retriever surfaces a chunk from six months ago. The LLM confidently answers with the old policy. No error, no warning — just wrong.

Bad chunking. You split documents every 512 tokens. Sounds fine until a critical answer requires context that spans chunk 3 and chunk 4, and your retriever returns chunk 3 alone. The model hallucinates the rest. This is especially vicious with tables, numbered lists, and code examples that span chunk boundaries.

Embedding drift. You update your embedding model. All your new chunks are encoded with text-embedding-3-large; old chunks are still encoded with text-embedding-ada-002. The cosine distances are now comparing apples to bowling balls. Retrieval degrades silently, and you notice three weeks later when support tickets spike.

Wrong top-k cutoff. With k=5, everything looked fine. You bump it to k=10 to "give the model more context," and now irrelevant chunks crowd out the relevant ones. Your precision tanks.

How to test retrieval

Retrievals tests should be mechanical — no LLM in the loop. Given a query, you know exactly which chunk(s) should come back. Assert that they do.

Build a golden dataset: a set of (query, expected_chunk_ids) pairs. Run your retriever against them in CI. Track hit rate, precision@k, and mean reciprocal rank (MRR).

# tests/test_retrieval.py
import pytest
from myapp.retriever import retrieve

GOLDEN = [
    {
        "query": "What is the current return window for electronics?",
        "must_contain": ["chunk_policy_v3_returns_electronics"],
    },
    {
        "query": "How do I reset my 2FA?",
        "must_contain": ["chunk_security_2fa_reset"],
    },
]

@pytest.mark.parametrize("case", GOLDEN)
def test_retrieval_returns_expected_chunks(case):
    results = retrieve(case["query"], k=5)
    result_ids = [r.chunk_id for r in results]
    for expected_id in case["must_contain"]:
        assert expected_id in result_ids, (
            f"Query '{case['query']}' did not retrieve '{expected_id}'. Got: {result_ids}"
        )

This runs in milliseconds, costs nothing, and will catch stale index bugs, chunking regressions, and embedding model mismatches before they hit prod.

For staleness specifically: add a test that re-indexes a known document, then immediately queries for content from the updated version.

def test_index_reflects_updated_document(indexer, retriever):
    indexer.upsert("doc_returns_policy", "Return window is now 60 days for all categories.")
    results = retriever.retrieve("How long is the return window?", k=3)
    top_chunk_text = results[0].text
    assert "60 days" in top_chunk_text

If your index pipeline is async, you'll need to flush or poll — but the test is worth it.

Stage 2: Generation — where prompt assumptions go to die

Assume retrieval is perfect. The right chunks are in the context window. The LLM can still fail you in a handful of ways.

The bugs that bite

Context overflow. You're running a long conversation plus a system prompt plus five retrieved chunks. At some query complexity, you blow past the model's context limit. Most APIs will silently truncate. Others throw an error. Either way, some of your carefully retrieved context just disappeared, and the model fills the gap with a confabulation.

Lost-in-the-middle. There's solid research showing that LLMs pay less attention to context in the middle of a long prompt. If your most relevant chunk is chunk 3 of 7, the model may effectively ignore it. Chunk order matters, and most RAG implementations don't account for it.

Prompt template rot. Someone edits the system prompt to fix one behavior and breaks another. The model starts ignoring instructions to cite sources. Or it stops refusing out-of-scope questions. If you don't have generation tests that assert on specific behaviors, you'll never catch this until users complain.

Hallucination when context is insufficient. The retrieved chunks don't actually answer the question. A well-behaved model should say so. Yours says something plausible-sounding instead.

How to test generation

Generation tests are harder because the output is non-deterministic. The answer to this is not to give up — it's to test behaviors rather than exact strings.

For anything mechanical (does the model cite sources? does it refuse out-of-scope questions? does it stay under a token budget?), write deterministic assertions:

def test_refuses_out_of_scope_question(rag_chain):
    response = rag_chain.query(
        "What's the capital of France?",
        context_chunks=[],  # no relevant docs retrieved
    )
    refusal_phrases = ["I don't have information", "outside my knowledge", "can't help with"]
    assert any(phrase.lower() in response.answer.lower() for phrase in refusal_phrases), (
        f"Expected a refusal, got: {response.answer}"
    )

def test_context_overflow_does_not_silently_drop_chunks(rag_chain):
    # Construct a payload that approaches context limit
    big_chunks = [generate_chunk(tokens=800) for _ in range(10)]
    response = rag_chain.query("Summarize the policy.", context_chunks=big_chunks)
    assert response.chunks_used >= 5, "Too many chunks were dropped during context trimming"

For quality — "is this answer actually correct?" — you have two good options.

LLM-as-judge: Use a second, inexpensive model call to evaluate whether the response is grounded in the provided context. This is the RAGAS approach. It's not free, but it's cheap enough to run on a representative sample in CI.

# Using a lightweight judge model to check groundedness
def test_answer_is_grounded_in_context(rag_chain, judge_llm):
    context = "Refunds are processed within 5-7 business days."
    response = rag_chain.query("How long do refunds take?", context_chunks=[context])
    
    verdict = judge_llm.evaluate(
        claim=response.answer,
        context=context,
        instruction="Does the claim contradict or go beyond the context? Answer YES or NO."
    )
    assert verdict.strip().upper() == "NO"

Snapshot testing: For prompt stability, render the full prompt (system + retrieved chunks + user query) and snapshot it. Any prompt template change will show up as a diff in CI, forcing an intentional review instead of an accidental regression.

Putting it in CI

The instinct is to run everything on every PR. That's wrong. LLM-in-the-loop tests are slow and cost money. Structure your test suite in layers:

Layer 1 — every PR, no LLM:

Retrieval golden tests (pytest tests/test_retrieval.py)
Chunking unit tests (correct splits, no boundary truncation)
Prompt template snapshot tests
Index freshness smoke test

Layer 2 — nightly or on main merge, LLM involved:

Groundedness evals on a representative query set
Refusal behavior tests
Regression evals against your golden Q&A dataset

Here's a minimal GitHub Actions setup that separates the two layers:

# .github/workflows/rag-tests.yml
name: RAG tests

on:
  pull_request:
  schedule:
    - cron: '0 3 * * *'  # nightly at 3am UTC

jobs:
  retrieval-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: pytest tests/test_retrieval.py tests/test_chunking.py -v
        env:
          VECTOR_STORE_URL: ${{ secrets.VECTOR_STORE_URL }}

  generation-evals:
    if: github.event_name == 'schedule' || github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements.txt
      - run: pytest tests/test_generation_evals.py -v --timeout=120
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          VECTOR_STORE_URL: ${{ secrets.VECTOR_STORE_URL }}

Flaky evals are a real problem here — an LLM judge that changes its verdict between runs will generate noise that erodes trust in the entire suite. Treat flaky eval tests the same way you'd treat any flaky test: track them, quarantine them, fix the root cause. If you're already using BuildPulse for flaky test detection on your main CI suite, the same JUnit XML output from pytest works for eval tests too.

The checklist you actually need

Before any RAG feature ships, I want to see:

Retrieval golden set: at least 20 (query, expected_chunk_ids) pairs, covering your main use cases
Staleness test: verifies the index reflects a known recent update
Chunking boundary test: at least one case where the answer spans a chunk boundary
Context overflow test: documents what happens when you exceed the context limit — truncation strategy is explicit, not accidental
Refusal test: the model declines when no relevant context is retrieved
Groundedness eval: a sample of queries checked nightly for hallucination
Prompt snapshot: any change to the system prompt is a reviewed diff, not a silent edit

That's not a huge surface area. But it's the difference between a RAG system you can iterate on confidently and one you're afraid to touch.

The retrieval layer especially tends to get zero test coverage — it feels like infrastructure, so it gets treated like infrastructure. But it's application logic. It makes decisions about what your LLM gets to see. Test it like it matters, because it does.

AI Engineering

6 min read

Setting pass/fail thresholds for LLM evals in CI without gaslighting yourself

A hard-coded score threshold on a non-deterministic eval is a coin flip wearing a suit. Here's how to gate LLM changes in CI without lying to yourself.

BuildPulse Team

Jul 22, 2026

AI Engineering

7 min read

Your LLM evals are flaky, and your CI is lying to you

Non-deterministic LLM evals wreck your CI signal the same way flaky unit tests do — worse, actually. Here's how to build eval suites you can gate on.

BuildPulse Team

Jul 10, 2026

AI Engineering

7 min read

CI for AI agents: testing non-deterministic systems

Agents are the hardest software to test. A practical blueprint for CI for AI agents: trajectory assertions, tool mocking, eval harnesses, and what 'green' actually attests to.

BuildPulse Team

Jun 13, 2026