RAG Evaluation Guide: Metrics, Methods, and Production Monitoring
Written by Max Zeshut
Founder at Agentmelt · Last updated Apr 26, 2026
RAG-powered AI agents fail in two distinct ways: the retrieval surfaces the wrong context, or the generation produces a poor answer from correct context. Most teams measure RAG quality with end-to-end accuracy alone, which conflates these failure modes and makes debugging nearly impossible. A rigorous RAG evaluation framework separates retrieval quality from generation quality, monitors both continuously in production, and catches regressions before users notice. This guide covers the full evaluation methodology used by teams running RAG agents reliably at scale.
Why end-to-end accuracy is not enough
The standard "did the agent answer correctly?" metric is necessary but insufficient. Consider a support agent answering "What is your refund policy?":
- Retrieval correctly returns the refund policy article. Generation correctly summarizes it. → Good answer.
- Retrieval returns the wrong article (e.g., the shipping policy). Generation faithfully summarizes what it received. → Bad answer—but the generation step worked correctly.
- Retrieval correctly returns the refund policy article. Generation hallucinates a 60-day return window when the policy says 30 days. → Bad answer—but retrieval worked correctly.
- Retrieval correctly returns the refund policy article. Generation refuses to answer despite having the right context. → Bad user experience—both steps worked but the agent failed to use the result.
End-to-end accuracy captures the bad outcomes but tells you nothing about which component to fix. A RAG evaluation framework needs to assess each component separately and the system as a whole.
Retrieval metrics
Retrieval quality is the foundation. If retrieval returns the wrong context, no amount of prompt engineering can recover. Three metrics matter:
1. Recall@K
Recall@K measures whether the relevant document is in the top K retrieved results. For each query in your evaluation set, you mark which documents should have been retrieved (the "ground truth"), then check how often they appear in the top K results returned by the system.
For most production systems, target Recall@5 above 95%—meaning the right document is in the top 5 results for at least 95% of queries. Below 90%, retrieval is the limiting factor on agent quality and should be addressed before any other optimization.
2. MRR (Mean Reciprocal Rank)
MRR measures where in the result list the correct document appears. If the right document is always first, MRR is 1.0. If it is always second, MRR is 0.5. MRR captures something Recall@K misses: when retrieval is right, is it confidently right?
A high Recall@5 with low MRR (say 0.3) means the right document is usually in the top 5 but rarely first—a sign that ranking needs improvement. Reranking models often dramatically improve MRR without changing Recall@K.
3. NDCG (Normalized Discounted Cumulative Gain)
NDCG handles cases where multiple relevant documents exist with varying degrees of relevance. It rewards systems that put the most relevant documents first and the somewhat-relevant documents lower. NDCG matters most for queries where the answer requires synthesizing multiple sources.
For agents that should cite multiple sources (legal research, technical documentation, scientific Q&A), monitor NDCG@10 alongside Recall@K. For agents serving single-source answers, NDCG matters less.
Building the evaluation set
Retrieval evaluation requires a labeled dataset—queries paired with the correct documents to retrieve. Building this dataset is the hardest part of RAG evaluation:
- Start with real user queries. Sample actual production queries (anonymized as needed) rather than synthetic ones. Real queries reveal patterns synthetic queries miss—colloquial phrasing, typos, ambiguity, multi-part questions.
- Have subject matter experts label. SMEs identify which documents in your corpus contain the answer to each query. Multiple SMEs labeling the same query expose subjective edge cases.
- Include hard cases. Don't just sample easy queries. Deliberately include queries with no good answer (the system should retrieve nothing or a "we don't have this" document) and queries where the answer requires synthesizing multiple documents.
- Maintain over time. As your corpus and user base evolve, the evaluation set must evolve. Add new queries quarterly; review and re-label existing queries annually.
A practical evaluation set has 200-500 labeled queries—small enough to maintain, large enough to detect meaningful changes.
Generation metrics
With retrieval evaluated, the next step is evaluating generation quality given correct retrieval. Three metrics dominate:
1. Faithfulness
Faithfulness measures whether the generated answer is supported by the retrieved context. A faithful answer makes claims that the context actually contains. An unfaithful answer adds claims not supported by context (hallucination) or contradicts the context.
Modern faithfulness evaluation uses LLM-as-judge: an evaluator LLM compares each claim in the generated answer to the retrieved context and scores faithfulness on a 0-1 scale. Aim for faithfulness above 0.95 for production agents in high-stakes domains and above 0.90 for general use.
2. Answer relevance
Answer relevance measures whether the generated answer addresses the query. An answer can be faithful (everything it says is in the context) but irrelevant (it answers a different question than was asked).
LLM-as-judge again: evaluator scores how directly the answer addresses the query intent. Low relevance often indicates the agent is over-grounding in retrieved content—repeating documentation rather than answering the question.
3. Context utilization
Context utilization measures how much of the retrieved context the answer actually uses. Low utilization means the agent is ignoring useful context—often because the context is poorly formatted or the prompt does not direct the agent to ground its answer.
Calculate utilization as the percentage of retrieved chunks referenced in the generated answer. If only 20% of retrieved context is used, you are paying for inference on context that adds no value.
End-to-end metrics
With component metrics in place, end-to-end metrics confirm the system works as a whole:
1. Answer correctness
Compare the generated answer to a reference answer (golden answer) per query. LLM-as-judge or human evaluators score correctness—does the answer convey the right information?
Correctness depends on retrieval and generation both working. Decompose failures: was the right context retrieved? If yes, did generation fail? If no, did retrieval fail? This decomposition guides where to invest improvement effort.
2. Citation accuracy
For agents that cite sources, citation accuracy verifies the citations are correct. Hallucinated citations (made-up document IDs, page numbers, or quotes) are particularly damaging because they create the appearance of grounding while undermining trust.
Citation accuracy should be 100%—every cited fact must be verifiable in the cited source. Anything less requires immediate investigation.
3. Latency and cost
Quality is one dimension; the system must also be operationally viable. Track:
- P50 and P99 latency end-to-end and per component
- Cost per query broken down by retrieval (embedding calls, vector DB) and generation (LLM tokens)
- Cache hit rate for retrieval and prompt caching
Quality improvements often come at the expense of latency or cost. Track all three to ensure changes are net positive.
Production monitoring
Evaluation does not end at deployment. Production data reveals issues that pre-launch testing misses—new query types, corpus drift, model updates, and edge cases. Continuous monitoring catches these.
Sampling-based evaluation
Run a sample of production queries through full evaluation continuously:
- 1-5% sampling rate for high-volume agents (1,000+ queries per day)
- 10-20% sampling rate for moderate-volume agents (100-1,000 per day)
- Full evaluation for low-volume agents
Each sampled query gets the same metrics as your evaluation set: faithfulness, relevance, citation accuracy. Aggregate weekly; alert on metric degradation.
User feedback signals
Implicit and explicit feedback supplements sampled evaluation:
- Thumbs up/down on responses. Correlate negative feedback to specific failure modes—is the system retrieving wrong content? Generating unfaithful answers? Refusing valid queries?
- Followup queries. When users immediately ask the same question with rephrased wording, the original answer was insufficient.
- Escalations. Queries that result in human handoff are particularly valuable—the agent failed in a meaningful way.
- Time on page / engagement. For longer answers, engagement patterns (full read vs. immediate dismissal) indicate quality.
User feedback is noisy but high-volume. Combined with structured evaluation on samples, it provides a rich signal of production quality.
Drift detection
Three types of drift threaten RAG quality:
- Corpus drift. New documents added to your knowledge base may have different formats, terminology, or quality. Monitor retrieval quality metrics by document age and source.
- Query drift. Users start asking new types of questions—new product features, new policies, seasonal patterns. Cluster production queries periodically and alert on new clusters.
- Model drift. When LLM providers update models (often without explicit version changes), behavior shifts subtly. Run a regression test suite on a fixed evaluation set after any model change.
Each drift type degrades quality slowly enough to escape casual monitoring. Automated drift detection is critical for catching them.
Common failure modes
Frequent issues we see in production RAG systems:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Confident wrong answers | Low faithfulness scoring not enforced | Add faithfulness threshold; refuse low-confidence answers |
| Many "I don't know" responses | Recall@K below 90% | Improve retrieval (better embeddings, hybrid search, reranking) |
| Long, repetitive answers | Low context utilization | Restructure prompt to focus on answering, not summarizing context |
| Hallucinated citations | No citation verification | Validate citations programmatically before serving |
| Slow responses | Too many retrieved chunks in context | Use reranking to surface top results; reduce context size |
| Inconsistent answers to same query | Temperature too high or context retrieval not deterministic | Lower temperature; use deterministic retrieval (or seed) |
Implementation checklist
Before launching any RAG agent in production:
- Labeled evaluation set of 200+ queries with ground truth documents and answers
- Retrieval metrics (Recall@K, MRR) tracked and meeting targets
- Generation metrics (faithfulness, relevance) tracked and meeting targets
- End-to-end correctness measured against golden answers
- Citation accuracy verified at 100% for cited claims
- Latency budgets defined and met
- Cost per query measured and within budget
- Sampling-based production evaluation pipeline running
- User feedback collection mechanism deployed
- Drift detection alerts configured
- Regression test suite running on every model or prompt change
For broader agent evaluation patterns, see AI agent evaluation testing. For RAG fundamentals, see Agentic RAG explained.
Get the AI agent deployment checklist
One email, no spam. A short checklist for choosing and deploying the right AI agent for your team.
[email protected]