RAG System Testing: Retrieval Accuracy, Grounding & Hallucination

TL;DR

A RAG system has two failure points, so test both. Measure retrieval quality (context precision, recall, hit-rate@k) separately from generation quality (faithfulness/groundedness to the retrieved context). Build a golden eval set of questions with known sources, score with retrieval metrics + an LLM-as-judge for groundedness, and gate every chunking, embedding, prompt, or model change in CI.

Retrieval-augmented generation (RAG) powers most production LLM features today — chatbots over your docs, support assistants, internal search. When a RAG answer is wrong, the cause is almost always one of two things: the retriever fetched the wrong context, or the model ignored the right context and made something up. Testing RAG means measuring those two stages independently, then end-to-end.

Why can’t you test a RAG system with a single accuracy score?

A single end-to-end score hides where the failure is. If answers are wrong, you need to know whether to fix the retriever (chunking, embeddings, top-k) or the generator (prompt, model, grounding instructions). A good RAG test suite separates the two so each can be tuned and regression-gated on its own.

How do you measure retrieval quality?

Start with a labelled set of questions mapped to the document chunks that should answer them, then score the retriever before the LLM is involved:

Metric	What it measures
Context precision	Of the chunks retrieved, how many are actually relevant
Context recall	Of the relevant chunks that exist, how many were retrieved
Hit-rate @ k	Did at least one correct chunk appear in the top-k
MRR / NDCG	How highly the correct chunks are ranked

Low recall points to chunking or embedding problems; low precision points to a top-k that is too large or a noisy index.

How do you measure generation quality and catch hallucination?

Once the right context is retrieved, grade the answer against it:

Faithfulness / groundedness: every claim in the answer must be supported by the retrieved context. An LLM-as-judge or NLI model checks each statement against the sources.
Answer relevance: does the response actually address the question?
Context utilisation: did the model use the retrieved context rather than its own parametric memory?
Citation accuracy: if the system cites sources, do the citations match the claims?

This groundedness discipline is the RAG-specific extension of the evaluation methods in How to Test LLM-Powered Features.

What does a RAG regression suite look like in CI?

Freeze a golden eval set: representative questions, the expected source chunks, and reference answers.
On every change to chunking, embeddings, retriever, prompt, or model, re-run the set.
Score retrieval metrics + faithfulness, and fail the build if either drops below threshold.
Track scores over time so a silent model update can’t quietly degrade grounding.

Common RAG testing pitfalls

Testing only end-to-end and never isolating retrieval vs generation.
Evaluating on a handful of cherry-picked questions instead of a representative set.
Ignoring negative cases — the system should say “I don’t know” when the answer isn’t in the corpus.
Never re-testing after the knowledge base changes.

Frequently asked questions

Q1. What is RAG testing?
RAG (retrieval-augmented generation) testing validates both halves of a RAG system: retrieval quality (did it fetch the right context?) and generation quality (did the answer stay grounded in that context, without hallucinating?). You measure them separately and end-to-end.

Q2. How do you measure retrieval quality in RAG?
Use information-retrieval metrics on a labelled question-to-document set: context precision, context recall, and hit-rate@k. These tell you whether the retriever surfaces the right chunks before the LLM ever sees them.

Q3. How do you detect hallucination in RAG answers?
Score faithfulness/groundedness: every claim in the answer should be supported by the retrieved context. An LLM-as-judge or NLI model checks each statement against the sources; unsupported claims are flagged as hallucinations.

Q4. Can RAG systems be regression-tested in CI?
Yes. Freeze a golden evaluation set of questions with expected sources and answers, run it on every change to chunking, embeddings, prompt, or model, and gate the build on retrieval and faithfulness thresholds.

Building a RAG feature? VTEST builds retrieval and groundedness evaluation suites and CI gates for RAG systems. Talk to our team about testing your RAG pipeline →

RAG System Testing: Validating Retrieval Accuracy, Grounding and Hallucination at Scale

Why can’t you test a RAG system with a single accuracy score?

How do you measure retrieval quality?

How do you measure generation quality and catch hallucination?

What does a RAG regression suite look like in CI?

Common RAG testing pitfalls

Frequently asked questions

Further reading

Insight Categories

Our Services