# Agentic evaluations — overview

Most AI features today are not single-turn chat. They're **agents** — multi-step systems that plan, call tools, retrieve context, and decide when they're done. Single-output evaluation isn't enough for agents. You need to know whether the agent reached the right state, took legal steps to get there, and won't regress when the prompt changes next week.

That's what Agentic Evaluations does.

## The failure scenarios

Agents fail in ways simple chat doesn't. The Stratix Agentic Evaluations product is shaped around these specific failures:

* **Wrong final state.** The agent answered the question but updated the wrong record.
* **Right answer, wrong path.** It got there, but called an internal API it shouldn't have.
* **Catastrophic tool selection.** A `delete_user` call when the user asked to "deactivate."
* **Off-policy reasoning.** The agent's chain-of-thought reveals reasoning your team doesn't want shipped (PII, leaked instructions, harmful claims).
* **Silent regression.** A prompt change five PRs ago broke an edge case nobody is watching.

## Three evaluation criteria types, combined

Agentic evaluations let you mix **three criteria types** in one evaluation:

1. **Natural-language assertions.** Plain-English checks the LLM grades. "The agent correctly identified the customer's account tier." Cheap, flexible, good for fuzzy correctness.
2. **Deterministic rules.** Code or schema checks. "The agent never called `admin_api.*` outside the allowed scope." Fast, cheap, exact.
3. **LLM judges.** Subjective dimensions evaluated by an optimized judge. "How helpful was the final response on a 1-5 scale?" Use sparingly — most subjective bars resolve into combinations of assertions.

A typical agentic evaluation uses all three. Deterministic rules catch hard violations early. Assertions cover the bulk of correctness. Judges grade the residual subjective quality.

## The pre- and post-deployment workflow

The Stratix Agentic Evaluations workflow is built for **pre- and post-deployment** — you run it on a candidate change before it ships, not on live production.

1. **Capture a representative trace set.** Real or synthetic, agent inputs + outputs + every span.
2. **Define your evaluation criteria.** A mix of assertions, rules, and judges in an evaluation space.
3. **Run the evaluation.** Stratix replays the criteria over the trace set.
4. **Read the verdict and root-cause.** Each failed criterion ties back to the trace, the span, the input, and the agent decision that broke it.
5. **Detect regressions.** Compare the run to a baseline; surface the criteria that newly failed.

The output artifacts are:

* A **verdict** (pass/fail and severity per criterion)
* A **root-cause report** (which span, which input, which decision)
* A **regression report** (what newly fails compared to baseline)

You wire those into a CI gate, a release-readiness review, or a Slack alert.

## What you get vs what you build

You build:

* The trace set (your data, your agent runs)
* The evaluation criteria (your assertions, your rules, your judges)
* The CI/CD wiring

Stratix gives you:

* The evaluation engine that runs all three criteria types in one job
* The judge engine and the GEPA optimizer
* The deterministic-rule executor with first-class span access
* The verdict, root-cause, and regression reports
* The Premium UI for browsing failures and the SDK for automation

## When NOT to use Agentic Evaluations

* For **single-turn chat with no tools or chain**, the Model Evaluations product is simpler. Pick that.
* For **post-deployment monitoring of live traffic**, [trace evaluations](/8.-evaluate-score-the-outputs/trace-evaluations.md) is the right shape — same engine, recurring schedule, applied to real production traces.

## Where to next

* [Use case: Agentic evaluation](/4.1-general-use-cases/agentic-evaluation.md)
* [Concept: Agentic evaluation](/4.1-general-use-cases/agentic-evaluation.md)
* [Workflow: Evaluate](/9.-improve-tune-the-system/workflow.md)
* [Stratix Premium: Agent Evaluation](/7.-observe-see-whats-happening/agent-evaluation.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/4.1-general-use-cases/agentic-evals-overview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
