Skip to main content
Intermediate45-90 minRagasTruLensPhoenix / ArizeLangfuse

Evaluate a RAG system

Build a practical evaluation loop for retrieval quality, answer faithfulness, and citations.

Prerequisites

  • At least 20 representative questions
  • Known authoritative sources
  • Access to logs or traces

Step-by-step tutorial

Step 1

Create a question set

Include easy, hard, ambiguous, out-of-scope, and adversarial questions.

  • Label expected sources
  • Include no-answer cases
  • Cover important personas
  • Version the set

Step 2

Test retrieval first

Score whether the right evidence appears before evaluating answer style.

  • Measure recall
  • Inspect top-k noise
  • Check filters
  • Compare reranking

Step 3

Test generation

Check whether answers are faithful, complete, well cited, and appropriately uncertain.

  • Review factual claims
  • Check citations
  • Flag unsupported synthesis
  • Track refusal quality

Next steps

  • Automate regression checks
  • Add human review
  • Track retrieval failures over time