TL;DR
A RAG system has two failure points, so test both. Measure retrieval quality (context precision, recall, hit-rate@k) separately from generation quality (faithfulness/groundedness to the retrieved context). Build a golden eval set of questions with known sources, score with retrieval metrics + an LLM-as-judge for groundedness, and gate every chunking, embedding, prompt, or model change in CI.
Retrieval-augmented generation (RAG) powers most production LLM features today — chatbots over your docs, support assistants, internal search. When a RAG answer is wrong, the cause is almost always one of two things: the retriever fetched the wrong context, or the model ignored the right context and made something up. Testing RAG means measuring those two stages independently, then end-to-end.
Why can’t you test a RAG system with a single accuracy score?
A single end-to-end score hides where the failure is. If answers are wrong, you need to know whether to fix the retriever (chunking, embeddings, top-k) or the generator (prompt, model, grounding instructions). A good RAG test suite separates the two so each can be tuned and regression-gated on its own.
How do you measure retrieval quality?
Start with a labelled set of questions mapped to the document chunks that should answer them, then score the retriever before the LLM is involved:
| Metric | What it measures |
|---|---|
| Context precision | Of the chunks retrieved, how many are actually relevant |
| Context recall | Of the relevant chunks that exist, how many were retrieved |
| Hit-rate @ k | Did at least one correct chunk appear in the top-k |
| MRR / NDCG | How highly the correct chunks are ranked |
Low recall points to chunking or embedding problems; low precision points to a top-k that is too large or a noisy index.
How do you measure generation quality and catch hallucination?
Once the right context is retrieved, grade the answer against it:
- Faithfulness / groundedness: every claim in the answer must be supported by the retrieved context. An LLM-as-judge or NLI model checks each statement against the sources.
- Answer relevance: does the response actually address the question?
- Context utilisation: did the model use the retrieved context rather than its own parametric memory?
- Citation accuracy: if the system cites sources, do the citations match the claims?
This groundedness discipline is the RAG-specific extension of the evaluation methods in How to Test LLM-Powered Features.
What does a RAG regression suite look like in CI?
- Freeze a golden eval set: representative questions, the expected source chunks, and reference answers.
- On every change to chunking, embeddings, retriever, prompt, or model, re-run the set.
- Score retrieval metrics + faithfulness, and fail the build if either drops below threshold.
- Track scores over time so a silent model update can’t quietly degrade grounding.
Common RAG testing pitfalls
- Testing only end-to-end and never isolating retrieval vs generation.
- Evaluating on a handful of cherry-picked questions instead of a representative set.
- Ignoring negative cases — the system should say “I don’t know” when the answer isn’t in the corpus.
- Never re-testing after the knowledge base changes.
Frequently asked questions
Q1. What is RAG testing?
RAG (retrieval-augmented generation) testing validates both halves of a RAG system: retrieval quality (did it fetch the right context?) and generation quality (did the answer stay grounded in that context, without hallucinating?). You measure them separately and end-to-end.
Q2. How do you measure retrieval quality in RAG?
Use information-retrieval metrics on a labelled question-to-document set: context precision, context recall, and hit-rate@k. These tell you whether the retriever surfaces the right chunks before the LLM ever sees them.
Q3. How do you detect hallucination in RAG answers?
Score faithfulness/groundedness: every claim in the answer should be supported by the retrieved context. An LLM-as-judge or NLI model checks each statement against the sources; unsupported claims are flagged as hallucinations.
Q4. Can RAG systems be regression-tested in CI?
Yes. Freeze a golden evaluation set of questions with expected sources and answers, run it on every change to chunking, embeddings, prompt, or model, and gate the build on retrieval and faithfulness thresholds.
Building a RAG feature? VTEST builds retrieval and groundedness evaluation suites and CI gates for RAG systems. Talk to our team about testing your RAG pipeline →
Further reading
- Agentic Testing: The Complete Guide to AI-Powered Software Testing in 2026
- How to Test LLM-Powered Features: Evaluation, Hallucination Checks, and Regression
- Testing AI Agents: Tool-Use, Multi-Step Reasoning and Guardrails
- Prompt Injection & the OWASP LLM Top 10: A Security-Testing Guide
See how VTEST delivers this: VTEST as an AI Testing Partner
Akbar Shaikh — CTO, VTEST
Akbar is the CTO at VTEST and leads QA transformation engagements for enterprise clients across the UK, UAE, India, the US, and Singapore. He specialises in modernising legacy testing practices and implementing AI-augmented quality assurance at scale.