# Judges

A **judge** is an LLM that grades dimensions of an output — helpfulness, faithfulness, tone, safety, correctness. Judges are versioned, tunable against labeled examples (GEPA), and can evaluate either evaluation rows or traces directly.

Judges and [scorers](/8.-evaluate-score-the-outputs/scorers-1.md) share the LLM-evaluation surface (both are model + prompt). They're separated by **lifecycle**:

|               | Scorer                   | Judge                                     |
| ------------- | ------------------------ | ----------------------------------------- |
| Versioning    | Immutable                | Versioned with execution history          |
| Where it runs | Inside an evaluation run | Standalone, against traces or evaluations |
| Optimization  | n/a                      | GEPA-tunable against labeled examples     |

## Anatomy

Every judge has:

* **Name and description**
* **Output type** — binary (pass/fail), score (e.g., 1-5), or labeled (multi-class)
* **Judging model** — the LLM that runs the rubric
* **Rubric** — the prompt (the judge's `evaluation_goal`)
* **Versions** — every rubric edit creates a new version; execution history is recorded per version
* **Labeled examples** (optional) — used by GEPA optimization to tune the rubric

## When to use a judge (vs a scorer)

| Use a scorer when...                                                 | Use a judge when...                                           |
| -------------------------------------------------------------------- | ------------------------------------------------------------- |
| The rubric is stable and you won't iterate on it with labels         | You expect to tune the rubric over time with labeled examples |
| You're applying it across many benchmarks as part of evaluation runs | You're evaluating traces (live or imported)                   |
| You don't need version history                                       | You need version history for audit / rollback                 |
| You don't need GEPA tuning                                           | You need GEPA tuning to push agreement-with-humans up         |

For **deterministic** dimensions (exact match, regex, JSON schema validity, Flesch-Kincaid grade, fairness math) use a **code grader** — these don't fit either the scorer or judge surface; they run as separate deterministic checks in the evaluation runtime.

## Judge model selection

The judging model is a knob. Stronger models grade more reliably but cost more. Most rubrics work fine with a balanced choice; only use the strongest model when the dimension is genuinely subtle.

## Judge optimization (GEPA)

GEPA tunes the judge's rubric prompt against a labeled ground-truth set. Iterate prompt variations, pick the variation with the highest agreement rate, repeat.

### What GEPA actually does (algorithm sketch)

GEPA is an **evolutionary prompt search**, not a hand-tuning helper. Each iteration:

1. **Score the current rubric** against the labeled set; record per-example agreement.
2. **Identify systematic disagreements** — examples where the judge's verdict consistently misses the human label.
3. **Generate candidate rubric variations** targeting those disagreements (a meta-LLM proposes rubric edits).
4. **Score each candidate** against the same labeled set.
5. **Promote** the highest-scoring candidate.
6. **Stop** when score plateaus across a window of iterations or the iteration budget is reached.

Convergence typically happens within 15-25 iterations on a 30-100-example labeled set. The optimized rubric is stored as a versioned artifact alongside the original — you can roll back if it regresses on a held-out set. GEPA does not retrain the judging LLM; only the prompt changes. That keeps optimization cheap (tens of dollars rather than tens of thousands) and reproducible (the rubric is a string, not a model checkpoint).

### Why GEPA matters

Out-of-the-box LLM judges agree with humans about 60-70% of the time on subtle dimensions — usable but flaky. GEPA-optimized judges typically reach 85-95% agreement. The gap matters when you're using judges as CI gates.

### When to GEPA-optimize

* You have ≥30 labeled examples
* The judge is being used in a CI gate or in continuous evaluation
* Out-of-the-box agreement isn't good enough for your bar

### When NOT to GEPA-optimize

* You're prototyping — eyeballing 5 outputs is fine
* You don't have labels — without labels GEPA has nothing to optimize against
* The dimension is well-served by a code grader (deterministic match, schema, math)

[More: Judge Optimization (GEPA)](/9.-improve-tune-the-system/judge-optimization.md)

## System judges

LayerLens ships system judges as starting points: helpfulness, faithfulness, safety, tone, brevity, structured-output validity. Clone and customize for your team.

## Common judge dimensions

* **Helpfulness** — does the response advance the user's goal?
* **Faithfulness** — does the response ground its claims in retrieved or supplied context?
* **Safety** — does the response avoid harmful content?
* **Tone** — does the response match the desired tone?
* **Brevity** — is the response appropriately concise?
* **Structured-output validity** — for non-JSON-schema output formats, did the model follow them?

## Where to next

* [Judge optimization (GEPA)](/9.-improve-tune-the-system/judge-optimization.md)
* [Scorers](/8.-evaluate-score-the-outputs/scorers-1.md)
* [Stratix Premium — Judges](/8.-evaluate-score-the-outputs/judges.md)
* [Tutorial: Build your first judge](/8.-evaluate-score-the-outputs/02-first-judge.md)
* [Tutorial: Optimize a judge with GEPA](/9.-improve-tune-the-system/05-gepa-optimize.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/8.-evaluate-score-the-outputs/judges-1.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.