# Agentic evaluation

Agentic evaluation is the pre- and post-deployment practice of grading multi-step agents. Three criteria types combined into one evaluation: assertions, deterministic rules, and judges.

## Why agents need their own evaluation shape

Single-output evaluation (input → output → score) misses the path. Agents fail in ways simple chat doesn't:

* They reach a wrong final state
* They take a wrong path to a right state
* They call a tool that should have been off-limits
* Their chain-of-thought reveals reasoning your team doesn't want shipped
* They quietly regress on an edge case

You need to grade the whole trace — every span, every decision — not just the final output.

## The three criteria types

### 1. Natural-language assertions

Plain-English checks the LLM grades. "The agent correctly identified the customer's account tier." Cheap, flexible, good for fuzzy correctness.

### 2. Deterministic rules

Code or schema checks. "The agent never called `admin_api.delete_*`." Fast, cheap, exact.

### 3. LLM judges

Subjective dimensions evaluated by an optimized judge. "How helpful was the final response on a 1-5 scale?" Use sparingly; anchor with assertions and rules.

## A typical agentic evaluation

* **70% deterministic rules** — hard correctness, hard policy violations
* **20% natural-language assertions** — fuzzy correctness
* **10% LLM judges** — residual subjective bar

This shape keeps evaluations fast and predictable.

## Pre- and post-deployment, not post-deployment

Agentic evaluations target **candidate changes**: a new prompt, a new model, a new tool. They run on a captured trace set, not on live traffic. (For live traffic, see [Continuous evaluation](/7.-observe-see-whats-happening/continuous-evaluation.md).)

## Output artifacts

* **Verdict** — pass/fail and severity per criterion
* **Root-cause report** — which trace, which span, which decision broke
* **Regression report** — what newly fails compared to baseline

## Where to use it

* CI gates on agent code/prompt changes
* Release-readiness reviews
* Auditor-friendly evidence in regulated industries

## Where to next

* [Continuous evaluation](/7.-observe-see-whats-happening/continuous-evaluation.md)
* [Stratix Premium — Agent Evaluation](/7.-observe-see-whats-happening/agent-evaluation.md)
* [Use case: Agentic evaluation](/8.-evaluate-score-the-outputs/agentic-evaluation.md)
* [Overview: Agentic evaluations](/4.1-general-use-cases/agentic-evals-overview.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/8.-evaluate-score-the-outputs/agentic-evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
