# RAG evaluation A RAG (retrieval-augmented generation) pipeline has more failure modes than a plain LLM call. Bad retrieval kills good models; great retrieval can't save a hallucinating one. RAG evaluation scores each stage so you know where the failure actually is. ## The shape of the work 1. **Define the three dimensions.** * **Retrieval quality** — did we get back relevant chunks? * **Faithfulness** — did the answer ground itself in the retrieved chunks (not invent)? * **Answer quality** — did the final answer actually help the user? 2. **Pick scorers and judges.** * Retrieval: classical IR metrics (precision\@k, recall\@k) as scorers; or an LLM judge that grades chunk relevance. * Faithfulness: an LLM judge that compares answer claims against the retrieved chunks. GEPA-optimize against a labeled set. * Answer quality: an LLM judge for end-to-end helpfulness, again GEPA-optimized. 3. **Run as a trace evaluation.** Your pipeline emits a trace with retrieval span + generation span. Stratix grades each. 4. **Compare configurations.** Try a different retriever, a different chunk size, a different model — re-run, see the score deltas per dimension. 5. **Lock in CI gates.** Faithfulness regressions are particularly costly; gate them. ## Why it works on Stratix * **Trace-first.** Multi-span traces are first-class; you can score retrieval and generation separately. * **Judge optimization.** GEPA tunes judges to match human labels for faithfulness — usually the trickiest dimension. * **Compare-models.** Try the same retrieval feeding three different generators; see which generator is most faithful to the same chunks. ## Tools you'll use * [Stratix Premium — Agent Evaluation](/7.-observe-see-whats-happening/agent-evaluation.md) (RAG is a small agent) * [Stratix Premium — Traces](/7.-observe-see-whats-happening/traces.md) * [Stratix Premium — Judges](/8.-evaluate-score-the-outputs/judges.md) * [SDK trace ingestion](/4.1-general-use-cases/general.md) ## Outcomes you should see You'll know this is working when: * **Faithfulness judge agreement with humans crosses 90%** after GEPA optimization. * **Retrieval and answer scores diverge meaningfully** — when answer quality drops, you can tell whether retrieval or generation is at fault. * **Hallucination rate trends downward** over consecutive releases, not just bounces. * **Per-chunk-strategy comparisons take <1 day** instead of multi-week experiments. ## Anti-patterns * **Scoring only the final answer.** You'll know quality dropped; you won't know which stage caused it. * **Faithfulness judge without optimization.** Out-of-the-box LLM judges over-credit confident-sounding hallucinations. GEPA-optimize against a labeled set. * **No retrieval ground truth.** If you can't say "these are the right chunks," your retrieval scores are just plausibility scores. ## Where to next * [Concept: Judges (faithfulness)](/8.-evaluate-score-the-outputs/judges-1.md) * [Concept: Traces and spans](/6.-build-wire-your-code/traces-and-spans.md) * [Cookbook: RAG recipes](/2.-get-started/all-cookbook-recipes.md) --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.layerlens.ai/4.1-general-use-cases/rag-evaluation.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.