RAG Evaluation Guide: Metrics, Methods, and Production Monitoring

RAG-powered AI agents fail in two distinct ways: the retrieval surfaces the wrong context, or the generation produces a poor answer from correct context. Most teams measure RAG quality with end-to-end accuracy alone, which conflates these failure modes and makes debugging nearly impossible. A rigorous RAG evaluation framework separates retrieval quality from generation quality, monitors both continuously in production, and catches regressions before users notice. This guide covers the full evaluation methodology used by teams running RAG agents reliably at scale.

Why end-to-end accuracy is not enough

The standard "did the agent answer correctly?" metric is necessary but insufficient. Consider a support agent answering "What is your refund policy?":

Retrieval correctly returns the refund policy article. Generation correctly summarizes it. → Good answer.
Retrieval returns the wrong article (e.g., the shipping policy). Generation faithfully summarizes what it received. → Bad answer—but the generation step worked correctly.
Retrieval correctly returns the refund policy article. Generation hallucinates a 60-day return window when the policy says 30 days. → Bad answer—but retrieval worked correctly.
Retrieval correctly returns the refund policy article. Generation refuses to answer despite having the right context. → Bad user experience—both steps worked but the agent failed to use the result.

End-to-end accuracy captures the bad outcomes but tells you nothing about which component to fix. A RAG evaluation framework needs to assess each component separately and the system as a whole.

Retrieval metrics

Retrieval quality is the foundation. If retrieval returns the wrong context, no amount of prompt engineering can recover. Three metrics matter:

1. Recall@K

Recall@K measures whether the relevant document is in the top K retrieved results. For each query in your evaluation set, you mark which documents should have been retrieved (the "ground truth"), then check how often they appear in the top K results returned by the system.

For most production systems, target Recall@5 above 95%—meaning the right document is in the top 5 results for at least 95% of queries. Below 90%, retrieval is the limiting factor on agent quality and should be addressed before any other optimization.

2. MRR (Mean Reciprocal Rank)

MRR measures where in the result list the correct document appears. If the right document is always first, MRR is 1.0. If it is always second, MRR is 0.5. MRR captures something Recall@K misses: when retrieval is right, is it confidently right?

A high Recall@5 with low MRR (say 0.3) means the right document is usually in the top 5 but rarely first—a sign that ranking needs improvement. Reranking models often dramatically improve MRR without changing Recall@K.

3. NDCG (Normalized Discounted Cumulative Gain)

NDCG handles cases where multiple relevant documents exist with varying degrees of relevance. It rewards systems that put the most relevant documents first and the somewhat-relevant documents lower. NDCG matters most for queries where the answer requires synthesizing multiple sources.

For agents that should cite multiple sources (legal research, technical documentation, scientific Q&A), monitor NDCG@10 alongside Recall@K. For agents serving single-source answers, NDCG matters less.

Building the evaluation set

Retrieval evaluation requires a labeled dataset—queries paired with the correct documents to retrieve. Building this dataset is the hardest part of RAG evaluation:

Start with real user queries. Sample actual production queries (anonymized as needed) rather than synthetic ones. Real queries reveal patterns synthetic queries miss—colloquial phrasing, typos, ambiguity, multi-part questions.
Have subject matter experts label. SMEs identify which documents in your corpus contain the answer to each query. Multiple SMEs labeling the same query expose subjective edge cases.
Include hard cases. Don't just sample easy queries. Deliberately include queries with no good answer (the system should retrieve nothing or a "we don't have this" document) and queries where the answer requires synthesizing multiple documents.
Maintain over time. As your corpus and user base evolve, the evaluation set must evolve. Add new queries quarterly; review and re-label existing queries annually.

A practical evaluation set has 200-500 labeled queries—small enough to maintain, large enough to detect meaningful changes.

Generation metrics

With retrieval evaluated, the next step is evaluating generation quality given correct retrieval. Three metrics dominate:

1. Faithfulness

Faithfulness measures whether the generated answer is supported by the retrieved context. A faithful answer makes claims that the context actually contains. An unfaithful answer adds claims not supported by context (hallucination) or contradicts the context.

Modern faithfulness evaluation uses LLM-as-judge: an evaluator LLM compares each claim in the generated answer to the retrieved context and scores faithfulness on a 0-1 scale. Aim for faithfulness above 0.95 for production agents in high-stakes domains and above 0.90 for general use.

2. Answer relevance

Answer relevance measures whether the generated answer addresses the query. An answer can be faithful (everything it says is in the context) but irrelevant (it answers a different question than was asked).

LLM-as-judge again: evaluator scores how directly the answer addresses the query intent. Low relevance often indicates the agent is over-grounding in retrieved content—repeating documentation rather than answering the question.

3. Context utilization

Context utilization measures how much of the retrieved context the answer actually uses. Low utilization means the agent is ignoring useful context—often because the context is poorly formatted or the prompt does not direct the agent to ground its answer.

Calculate utilization as the percentage of retrieved chunks referenced in the generated answer. If only 20% of retrieved context is used, you are paying for inference on context that adds no value.

End-to-end metrics

With component metrics in place, end-to-end metrics confirm the system works as a whole:

1. Answer correctness

Compare the generated answer to a reference answer (golden answer) per query. LLM-as-judge or human evaluators score correctness—does the answer convey the right information?

Correctness depends on retrieval and generation both working. Decompose failures: was the right context retrieved? If yes, did generation fail? If no, did retrieval fail? This decomposition guides where to invest improvement effort.

2. Citation accuracy

For agents that cite sources, citation accuracy verifies the citations are correct. Hallucinated citations (made-up document IDs, page numbers, or quotes) are particularly damaging because they create the appearance of grounding while undermining trust.

Citation accuracy should be 100%—every cited fact must be verifiable in the cited source. Anything less requires immediate investigation.

3. Latency and cost

Quality is one dimension; the system must also be operationally viable. Track:

P50 and P99 latency end-to-end and per component
Cost per query broken down by retrieval (embedding calls, vector DB) and generation (LLM tokens)
Cache hit rate for retrieval and prompt caching

Quality improvements often come at the expense of latency or cost. Track all three to ensure changes are net positive.

Production monitoring

Evaluation does not end at deployment. Production data reveals issues that pre-launch testing misses—new query types, corpus drift, model updates, and edge cases. Continuous monitoring catches these.

Sampling-based evaluation

Run a sample of production queries through full evaluation continuously:

1-5% sampling rate for high-volume agents (1,000+ queries per day)
10-20% sampling rate for moderate-volume agents (100-1,000 per day)
Full evaluation for low-volume agents

Each sampled query gets the same metrics as your evaluation set: faithfulness, relevance, citation accuracy. Aggregate weekly; alert on metric degradation.

User feedback signals

Implicit and explicit feedback supplements sampled evaluation:

Thumbs up/down on responses. Correlate negative feedback to specific failure modes—is the system retrieving wrong content? Generating unfaithful answers? Refusing valid queries?
Followup queries. When users immediately ask the same question with rephrased wording, the original answer was insufficient.
Escalations. Queries that result in human handoff are particularly valuable—the agent failed in a meaningful way.
Time on page / engagement. For longer answers, engagement patterns (full read vs. immediate dismissal) indicate quality.

User feedback is noisy but high-volume. Combined with structured evaluation on samples, it provides a rich signal of production quality.

Drift detection

Three types of drift threaten RAG quality:

Corpus drift. New documents added to your knowledge base may have different formats, terminology, or quality. Monitor retrieval quality metrics by document age and source.
Query drift. Users start asking new types of questions—new product features, new policies, seasonal patterns. Cluster production queries periodically and alert on new clusters.
Model drift. When LLM providers update models (often without explicit version changes), behavior shifts subtly. Run a regression test suite on a fixed evaluation set after any model change.

Each drift type degrades quality slowly enough to escape casual monitoring. Automated drift detection is critical for catching them.

Common failure modes

Frequent issues we see in production RAG systems:

Symptom	Likely Cause	Fix
Confident wrong answers	Low faithfulness scoring not enforced	Add faithfulness threshold; refuse low-confidence answers
Many "I don't know" responses	Recall@K below 90%	Improve retrieval (better embeddings, hybrid search, reranking)
Long, repetitive answers	Low context utilization	Restructure prompt to focus on answering, not summarizing context
Hallucinated citations	No citation verification	Validate citations programmatically before serving
Slow responses	Too many retrieved chunks in context	Use reranking to surface top results; reduce context size
Inconsistent answers to same query	Temperature too high or context retrieval not deterministic	Lower temperature; use deterministic retrieval (or seed)

Implementation checklist

Before launching any RAG agent in production:

Labeled evaluation set of 200+ queries with ground truth documents and answers
Retrieval metrics (Recall@K, MRR) tracked and meeting targets
Generation metrics (faithfulness, relevance) tracked and meeting targets
End-to-end correctness measured against golden answers
Citation accuracy verified at 100% for cited claims
Latency budgets defined and met
Cost per query measured and within budget
Sampling-based production evaluation pipeline running
User feedback collection mechanism deployed
Drift detection alerts configured
Regression test suite running on every model or prompt change

For broader agent evaluation patterns, see AI agent evaluation testing. For RAG fundamentals, see Agentic RAG explained.

Why end-to-end accuracy is not enough

The standard "did the agent answer correctly?" metric is necessary but insufficient. Consider a support agent answering "What is your refund policy?":

Retrieval correctly returns the refund policy article. Generation correctly summarizes it. → Good answer.
Retrieval returns the wrong article (e.g., the shipping policy). Generation faithfully summarizes what it received. → Bad answer—but the generation step worked correctly.
Retrieval correctly returns the refund policy article. Generation hallucinates a 60-day return window when the policy says 30 days. → Bad answer—but retrieval worked correctly.
Retrieval correctly returns the refund policy article. Generation refuses to answer despite having the right context. → Bad user experience—both steps worked but the agent failed to use the result.

End-to-end accuracy captures the bad outcomes but tells you nothing about which component to fix. A RAG evaluation framework needs to assess each component separately and the system as a whole.

Retrieval metrics

Retrieval quality is the foundation. If retrieval returns the wrong context, no amount of prompt engineering can recover. Three metrics matter:

1. Recall@K

2. MRR (Mean Reciprocal Rank)

3. NDCG (Normalized Discounted Cumulative Gain)

Building the evaluation set

Retrieval evaluation requires a labeled dataset—queries paired with the correct documents to retrieve. Building this dataset is the hardest part of RAG evaluation:

Start with real user queries. Sample actual production queries (anonymized as needed) rather than synthetic ones. Real queries reveal patterns synthetic queries miss—colloquial phrasing, typos, ambiguity, multi-part questions.
Have subject matter experts label. SMEs identify which documents in your corpus contain the answer to each query. Multiple SMEs labeling the same query expose subjective edge cases.
Include hard cases. Don't just sample easy queries. Deliberately include queries with no good answer (the system should retrieve nothing or a "we don't have this" document) and queries where the answer requires synthesizing multiple documents.
Maintain over time. As your corpus and user base evolve, the evaluation set must evolve. Add new queries quarterly; review and re-label existing queries annually.

A practical evaluation set has 200-500 labeled queries—small enough to maintain, large enough to detect meaningful changes.

Generation metrics

With retrieval evaluated, the next step is evaluating generation quality given correct retrieval. Three metrics dominate:

1. Faithfulness

2. Answer relevance

3. Context utilization

Calculate utilization as the percentage of retrieved chunks referenced in the generated answer. If only 20% of retrieved context is used, you are paying for inference on context that adds no value.

End-to-end metrics

With component metrics in place, end-to-end metrics confirm the system works as a whole:

1. Answer correctness

Compare the generated answer to a reference answer (golden answer) per query. LLM-as-judge or human evaluators score correctness—does the answer convey the right information?

2. Citation accuracy

Citation accuracy should be 100%—every cited fact must be verifiable in the cited source. Anything less requires immediate investigation.

3. Latency and cost

Quality is one dimension; the system must also be operationally viable. Track:

P50 and P99 latency end-to-end and per component
Cost per query broken down by retrieval (embedding calls, vector DB) and generation (LLM tokens)
Cache hit rate for retrieval and prompt caching

Quality improvements often come at the expense of latency or cost. Track all three to ensure changes are net positive.

Production monitoring

Sampling-based evaluation

Run a sample of production queries through full evaluation continuously:

1-5% sampling rate for high-volume agents (1,000+ queries per day)
10-20% sampling rate for moderate-volume agents (100-1,000 per day)
Full evaluation for low-volume agents

Each sampled query gets the same metrics as your evaluation set: faithfulness, relevance, citation accuracy. Aggregate weekly; alert on metric degradation.

User feedback signals

Implicit and explicit feedback supplements sampled evaluation:

Thumbs up/down on responses. Correlate negative feedback to specific failure modes—is the system retrieving wrong content? Generating unfaithful answers? Refusing valid queries?
Followup queries. When users immediately ask the same question with rephrased wording, the original answer was insufficient.
Escalations. Queries that result in human handoff are particularly valuable—the agent failed in a meaningful way.
Time on page / engagement. For longer answers, engagement patterns (full read vs. immediate dismissal) indicate quality.

User feedback is noisy but high-volume. Combined with structured evaluation on samples, it provides a rich signal of production quality.

Drift detection

Three types of drift threaten RAG quality:

Corpus drift. New documents added to your knowledge base may have different formats, terminology, or quality. Monitor retrieval quality metrics by document age and source.
Query drift. Users start asking new types of questions—new product features, new policies, seasonal patterns. Cluster production queries periodically and alert on new clusters.
Model drift. When LLM providers update models (often without explicit version changes), behavior shifts subtly. Run a regression test suite on a fixed evaluation set after any model change.

Each drift type degrades quality slowly enough to escape casual monitoring. Automated drift detection is critical for catching them.

Common failure modes

Frequent issues we see in production RAG systems:

Symptom	Likely Cause	Fix
Confident wrong answers	Low faithfulness scoring not enforced	Add faithfulness threshold; refuse low-confidence answers
Many "I don't know" responses	Recall@K below 90%	Improve retrieval (better embeddings, hybrid search, reranking)
Long, repetitive answers	Low context utilization	Restructure prompt to focus on answering, not summarizing context
Hallucinated citations	No citation verification	Validate citations programmatically before serving
Slow responses	Too many retrieved chunks in context	Use reranking to surface top results; reduce context size
Inconsistent answers to same query	Temperature too high or context retrieval not deterministic	Lower temperature; use deterministic retrieval (or seed)

Implementation checklist

Before launching any RAG agent in production:

Labeled evaluation set of 200+ queries with ground truth documents and answers
Retrieval metrics (Recall@K, MRR) tracked and meeting targets
Generation metrics (faithfulness, relevance) tracked and meeting targets
End-to-end correctness measured against golden answers
Citation accuracy verified at 100% for cited claims
Latency budgets defined and met
Cost per query measured and within budget
Sampling-based production evaluation pipeline running
User feedback collection mechanism deployed
Drift detection alerts configured
Regression test suite running on every model or prompt change

For broader agent evaluation patterns, see AI agent evaluation testing. For RAG fundamentals, see Agentic RAG explained.

Why end-to-end accuracy is not enough

Retrieval metrics

1. Recall@K

2. MRR (Mean Reciprocal Rank)

3. NDCG (Normalized Discounted Cumulative Gain)

Building the evaluation set

Generation metrics

1. Faithfulness

2. Answer relevance

3. Context utilization

End-to-end metrics

1. Answer correctness

2. Citation accuracy

3. Latency and cost

Production monitoring

Sampling-based evaluation

User feedback signals

Drift detection

Common failure modes

Implementation checklist

Get the AI agent deployment checklist

Related posts

Why end-to-end accuracy is not enough

Retrieval metrics

1. Recall@K

2. MRR (Mean Reciprocal Rank)

3. NDCG (Normalized Discounted Cumulative Gain)

Building the evaluation set

Generation metrics

1. Faithfulness

2. Answer relevance

3. Context utilization

End-to-end metrics

1. Answer correctness

2. Citation accuracy

3. Latency and cost

Production monitoring

Sampling-based evaluation

User feedback signals

Drift detection

Common failure modes

Implementation checklist

Get the AI agent deployment checklist

Related posts