# Evaluation anatomy

A useful reference for what an evaluation actually contains, end-to-end.

## At creation

| Field                          | Required                  | Description                       |
| ------------------------------ | ------------------------- | --------------------------------- |
| `name`                         | yes                       | Human-readable name               |
| `model_id` or `model`          | yes (if not a comparison) | The model under test              |
| `benchmark_id` or `dataset_id` | yes                       | The data                          |
| `scorers`                      | optional                  | List of scorer IDs                |
| `judges`                       | optional                  | List of judge IDs                 |
| `tags`                         | optional                  | For filtering and audit           |
| `parent_run_id`                | optional                  | Link to the previous baseline run |

## During execution

States: `queued` → `running` → `completed` (or `failed`, `cancelled`).

Per-row work:

1. Construct the prompt from the dataset row
2. Call the model
3. Apply each scorer to the (input, output, expected)
4. Apply each judge to the (input, output)
5. Record the row's verdicts, latency, cost

## At completion

| Field                        | Description                               |
| ---------------------------- | ----------------------------------------- |
| `id`                         | Stable identifier                         |
| `score`                      | Top-line score (depending on aggregation) |
| `pass_rate`                  | Pass rate over rows                       |
| `judge_results[judge_id]`    | Per-judge aggregated verdict              |
| `scorer_results[scorer_id]`  | Per-scorer aggregated verdict             |
| `rows`                       | Per-row details                           |
| `latency_ms_p50/p95/p99`     | Latency distribution                      |
| `cost_total`                 | ECU consumed                              |
| `created_at`, `completed_at` | Timestamps                                |

## Artifacts

Each evaluation produces:

* The result object above
* A row-level results dataset (browsable, exportable)
* Optional: a comparison artifact if part of a compare-models run
* Optional: a regression artifact if there's a baseline

## See also

* [Evaluations](/8.-evaluate-score-the-outputs/evaluations-1.md)
* [Stratix Premium — Evaluations](/8.-evaluate-score-the-outputs/evaluations.md)
* [SDK: evaluations](/8.-evaluate-score-the-outputs/evaluations-1.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/8.-evaluate-score-the-outputs/evaluation-anatomy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
