Evaluation layers
- Retrieval: did the system find the right evidence?
- Grounding: is the answer supported?
- Citation: can users verify claims?
- Safety: does it refuse unsupported questions?
- Operations: is it fast, affordable, and monitored?
Evaluation
RAG quality is not just answer fluency. It includes retrieval quality, source support, citations, refusal behavior, cost, latency, and maintainability.
Did the system find the evidence needed to answer?
Use labeled questions with expected source passages or expert-reviewed relevant documents.
Were the retrieved passages actually relevant?
Review top results, track irrelevant chunks, and compare before/after chunking or reranking changes.
Is the answer supported by the retrieved context?
Check generated claims against sources with human review and RAG-specific evaluation tools.
Can users verify important claims quickly?
Inspect whether citations point to the right source, page, section, or passage.
Does the system refuse or ask for clarification when evidence is missing?
Test ambiguous, adversarial, and out-of-scope questions.
Is the system fast, affordable, monitored, and maintainable?
Track latency, cost, error rates, trace coverage, data freshness, and update success.
A useful evaluation set should include normal questions, hard questions, exact-reference questions, ambiguous questions, and questions the system should not answer.