# Agentic evaluation Agents fail differently from chat. They reach a wrong final state, take a wrong path to a right state, call a tool they shouldn't, or quietly regress on an edge case. Agentic evaluation is the **pre- and post-deployment** practice of catching these failures before they ship. ## The shape of the work 1. **Capture a representative trace set.** Your agent's actual runs — inputs, tool calls, outputs, every span. Real or synthetic. 2. **Define evaluation criteria.** Mix three types: * **Natural-language assertions** — "the agent identified the customer's account tier" * **Deterministic rules** — "the agent never called `admin_api.delete_*`" * **LLM judges** — "rate the helpfulness of the final response 1-5" 3. **Run the evaluation.** Stratix runs all criteria over the trace set in one job. 4. **Read verdict + root-cause.** Each failed criterion ties to the trace, the span, and the decision that broke it. 5. **Detect regressions.** Compare to a baseline; surface newly failing criteria. ## Why it works on Stratix * **One engine, three criteria types** — assertions, rules, and judges in the same evaluation * **First-class trace and span access** — rules can inspect any field of any span * **Judge engine + GEPA optimization** — your subjective bar gets sharper over time * **Regression detection** — built-in baseline comparison * **Pre- and post-deployment fit** — runs on your candidate change, not on live traffic ## Tools you'll use * [Stratix Premium — Agent Evaluation](/7.-observe-see-whats-happening/agent-evaluation.md) * [Stratix Premium — Traces](/7.-observe-see-whats-happening/traces.md) * [Stratix Premium — Judges](/8.-evaluate-score-the-outputs/judges.md) * [SDK: `client.trace_evaluations.create()`](/4.1-general-use-cases/general.md) ## Outcomes you should see You'll know this is working when: * **Zero CRITICAL deterministic-rule violations** in any release-gate run. * **>95% pass rate on your assertion criteria** across the curated trace set. * **Regression report names <2 newly failing criteria** per release. * **Auditor questions about agent safety are answered live in-meeting** with the verdict + root-cause artifacts. ## Anti-patterns * **Judge-only evaluations.** Judges are the slowest, most expensive criterion. Anchor with deterministic rules and assertions; use judges for the residual subjective bar. * **No regression detection.** Without a baseline, today's pass rate is meaningless. * **Evaluating live traffic for pre-deploy gates.** Live traffic is for [continuous evaluation](/4.1-general-use-cases/continuous-evaluation.md). Pre-deploy uses a captured trace set. ## Where to next * [Concept: Agentic evaluation](/4.1-general-use-cases/agentic-evaluation.md) * [Overview: Agentic evaluations](/4.1-general-use-cases/agentic-evals-overview.md) * [Workflow: Evaluate](/9.-improve-tune-the-system/workflow.md) * [Cookbook: agentic recipes](/2.-get-started/all-cookbook-recipes.md) --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.layerlens.ai/4.1-general-use-cases/agentic-evaluation.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.