Skip to main content

Evaluation

Measure whether your RAG system is actually useful

RAG quality is not just answer fluency. It includes retrieval quality, source support, citations, refusal behavior, cost, latency, and maintainability.

Evaluation layers

  • Retrieval: did the system find the right evidence?
  • Grounding: is the answer supported?
  • Citation: can users verify claims?
  • Safety: does it refuse unsupported questions?
  • Operations: is it fast, affordable, and monitored?

Retrieval recall

Did the system find the evidence needed to answer?

Use labeled questions with expected source passages or expert-reviewed relevant documents.

Retrieval precision

Were the retrieved passages actually relevant?

Review top results, track irrelevant chunks, and compare before/after chunking or reranking changes.

Answer faithfulness

Is the answer supported by the retrieved context?

Check generated claims against sources with human review and RAG-specific evaluation tools.

Citation usefulness

Can users verify important claims quickly?

Inspect whether citations point to the right source, page, section, or passage.

No-answer behavior

Does the system refuse or ask for clarification when evidence is missing?

Test ambiguous, adversarial, and out-of-scope questions.

Operational quality

Is the system fast, affordable, monitored, and maintainable?

Track latency, cost, error rates, trace coverage, data freshness, and update success.

Evaluation workflow

  1. 1.Create representative questions
  2. 2.Label expected sources
  3. 3.Run retrieval-only tests
  4. 4.Review generated answers
  5. 5.Score citations and faithfulness
  6. 6.Inspect traces and failures
  7. 7.Fix ingestion, retrieval, prompts, or sources
  8. 8.Repeat before release

Example evaluation set

A useful evaluation set should include normal questions, hard questions, exact-reference questions, ambiguous questions, and questions the system should not answer.

Known-answer question: the source contains one direct answer.
Multi-source question: the answer requires comparing two approved sources.
Exact-reference question: the query contains a regulation number, policy code, or product identifier.
Ambiguous question: the system should ask for clarification.
Out-of-scope question: the system should refuse or redirect.
Stale-source question: the system must prefer the current document over older copies.