# System judges

{% hint style="info" %}
**Available in Stratix Premium.** System judges are part of the logged-in workspace at [stratix.layerlens.ai](https://stratix.layerlens.ai). Stratix Public users can read about them here but cannot run them.
{% endhint %}

## System judges

System judges are LLM-as-Judge rubrics that ship pre-built with every Premium workspace. They cover the dimensions teams most often need before they've authored anything of their own. You can use them as-is, clone them as a starting point for your own variant, or run [GEPA optimization](/9.-improve-tune-the-system/judge-optimization.md) against them with your labeled examples to push agreement-with-humans up.

System judges are identified in the dashboard with a "System" badge and live in the same Judges catalog as your org's custom judges. They are **read-only** as system records; cloning is how you customize.

### How system judges differ from custom judges

|                   | System judge                                                                          | Custom judge                    |
| ----------------- | ------------------------------------------------------------------------------------- | ------------------------------- |
| Author            | LayerLens                                                                             | Your org                        |
| `is_system` flag  | `true`                                                                                | `false`                         |
| `organization_id` | null                                                                                  | your org's id                   |
| Editing           | clone-then-edit                                                                       | edit in place                   |
| Versioning        | versioned by LayerLens releases                                                       | versioned per edit by your team |
| GEPA optimization | run against your own labeled set; result is a versioned variant in your org's catalog | normal                          |

### The shipping set

The following system judges are available in every Premium workspace. Each judge is a complete LLM rubric in the `judges` collection — name, `evaluation_goal`, `model_id`, and `is_system: true`.

#### Faithfulness

**What it scores:** Whether every factual claim in the model's OUTPUT is supported by the supplied CONTEXT (retrieved documents, source-of-truth data). Returns a 0.0–1.0 score plus a list of unsupported claims.

**When to use:** Any RAG-pattern application. Customer-support assistants with knowledge bases. Technical Q\&A grounded in docs. Healthcare / legal / financial summarization of supplied source material.

**Inputs:** `{{output}}`, `{{context}}`.

**Default model:** Claude Opus class (frontier).

**Tuning notes:** Penalize confident-sounding hallucinations the hardest. Refusal *when the context is genuinely insufficient* should not score low. See the [Faithfulness implementation reference](https://github.com/LayerLens/gitbook-full/blob/main/industry/judges/system/faithfulness-implementation.md) for the full rubric.

#### Hallucination Detector

**What it scores:** Inverse of faithfulness — surfaces specific hallucinated claims and ranks severity (`none`, `minor`, `material`, `dangerous`). Complements faithfulness when you need actionable failure-mode breakdowns rather than a single score.

**When to use:** When you need severity triage on top of an overall score — clinical reasoning, news summarization, legal research, fraud case notes.

**Inputs:** `{{output}}`, `{{context}}` (optional).

**Default model:** Claude Opus class.

#### Refusal Quality

**What it scores:** Whether the model declined to answer **appropriately** vs. **over-refused** (refused when context supported a useful answer) vs. **under-refused** (answered without sufficient grounding). Returns one of four verdicts: `appropriate_answer`, `appropriate_refusal`, `over_refusal`, `under_refusal`.

**When to use:** Customer-service assistants where dismissive refusals are a CX failure; clinical / financial advice where under-refusal is a safety failure; benefits eligibility where the rule is "advisory only."

**Inputs:** `{{prompt}}`, `{{output}}`, `{{context}}` (optional), `{{scope}}` (optional).

#### Citation Accuracy

**What it scores:** Whether citations in the OUTPUT actually support the claims they're attached to. Returns a 0.0–1.0 score plus a list of citation failures categorized as `not_in_source`, `misquoted`, or `wrong_section`.

**When to use:** Legal research, scientific summarization, regulatory filings, citation-grounded customer support. **Always pair with the deterministic Citation Existence code grader** — existence is a database lookup; this judge evaluates support after existence is verified.

**Inputs:** `{{output}}`, `{{context}}`.

#### Tone & Register

**What it scores:** Whether the output matches an expected register (formal / clinical / warm CSR / professional / playful). Sub-scores for warmth, formality, brevity, jargon.

**When to use:** Customer-facing assistants where brand voice or service register matters. Healthcare patient comms. Education student support.

**Inputs:** `{{output}}`, `{{expected_register}}`.

#### De-escalation Quality

**What it scores:** How well the model handles a stress turn — angry customer, complex disruption, sensitive topic. Specifically rates acknowledgement, pathway, and tone in combination. Sub-booleans surface the failure mode.

**When to use:** Customer service across industries; insurance claims intake; travel disruption; healthcare patient triage; education student support.

**Inputs:** `{{context}}` (conversation), `{{output}}` (final turn).

#### Plain-language Quality

**What it scores:** Whether the output is understandable to its intended audience — concept-unpacking, actionability, jargon explanation. Pairs with the deterministic Flesch-Kincaid code grader.

**When to use:** Government citizen services; healthcare patient communications; education student-facing; insurance policy Q\&A; consumer-finance disclosures.

**Inputs:** `{{output}}`, `{{audience}}`.

#### Explanation Quality

**What it scores:** Whether the OUTPUT's reasoning is followable by a non-technical reviewer — typically a regulator, auditor, or judge. Sub-scores for clarity, completeness, and traceability.

**When to use:** Adverse-action notices (ECOA / FCRA / FHA), insurance underwriting rationale, benefits eligibility, fraud SIU referrals, clinical decision support, FOIA exemption rulings.

**Inputs:** `{{decision}}`, `{{output}}` (the reasoning), `{{reviewer_profile}}`.

#### Multilingual Parity

**What it scores:** Whether a translated output preserves accuracy, tone, and required disclosures of the source. Detects "English-first quality, degraded in other languages" failure mode.

**When to use:** Government citizen services (LEP populations); retail customer service; healthcare patient communications; travel booking; telecom CSR; insurance policy Q\&A.

**Inputs:** `{{source_output}}`, `{{output}}` (translation), `{{target_language}}`, `{{glossary}}` (optional).

#### Editorial Judgment

**What it scores:** Whether the OUTPUT meets newsroom standards — accuracy of framing, attribution, headline-vs-content alignment, absence of inserted bias.

**When to use:** News summarization; media recommendation row-titles; podcast / video copy generation; education content with editorial voice.

**Inputs:** `{{context}}` (sources), `{{output}}`.

#### Industry-specialty system judges

The following system judges target specific high-stakes verticals. They ship with the shape described below; tenants in those industries can clone and tune.

| Judge                          | Industry           | What it scores                                                                              |
| ------------------------------ | ------------------ | ------------------------------------------------------------------------------------------- |
| Contract Clause Interpretation | Legal, real estate | Interpretation matches the clause's plain meaning + governing law                           |
| Mata Citation Verifier         | Legal              | Filed-output citations exist, are good law, and support the claim                           |
| Privilege Leak Detector        | Legal              | Output does not leak attorney-client privileged content                                     |
| Clinical Reasoning Soundness   | Healthcare         | Differential diagnosis appropriate; red-flag features identified                            |
| Medical-coding Justification   | Healthcare         | ICD-10-CM / CPT / HCPCS codes are supported by the documentation, no unbundling or upcoding |
| Switching-step Faithfulness    | Energy & utilities | Grid switching sequence matches approved procedures with safety preconditions               |
| Network Mitigation Rationale   | Telecom, energy    | Recommended incident mitigation is grounded in the runbook                                  |
| Fare-rule Paraphrase Quality   | Travel             | Fare-rule paraphrase preserves conditions, dollar amounts, time windows                     |
| Steering-language Detector     | Real estate        | Listing copy doesn't subtly steer buyers/renters on protected-class basis                   |
| Pedagogical Scaffolding        | Education          | AI tutor guides rather than gives outright answers                                          |

Each industry-specialty judge has the same fields as a general system judge (`name`, `evaluation_goal`, `model_id`, `is_system: true`). The full rubric for each is documented at [docs/industry/judges/system/](https://github.com/LayerLens/gitbook-full/blob/main/industry/judges/system/README.md).

### Using a system judge

#### From the dashboard

1. Open **Judges** in the Premium navigation
2. Filter by **Type: System**
3. Click the judge to see its rubric, default model, and `evaluation_goal`
4. To run against a trace, open the trace and add the judge from the trace-evaluation panel
5. To clone for editing, click **Clone** — a copy is created with `organization_id` set to your org and `is_system: false`

#### From the SDK

```python
from layerlens import Stratix
client = Stratix()

# List system judges
system_judges = client.judges.get_many(type="system")
for j in system_judges:
 print(j.id, j.name)

# Use a system judge against a trace
result = client.trace_evaluations.create(
 trace_id=trace_id,
 judge_id=faithfulness_judge_id,
)
```

### Cloning + GEPA-optimizing a system judge

If a system judge's out-of-the-box agreement with your humans is not high enough for your bar:

1. **Clone** the system judge — creates an editable copy in your org's catalog
2. **Gather labels** — ≥ 30 paired examples covering the verdict space (more for multi-class)
3. **Run GEPA optimization** against your labeled set — see [Judge Optimization (GEPA)](/9.-improve-tune-the-system/judge-optimization.md)
4. **Validate** on a held-out 20% slice before deployment

GEPA-optimized variants are versioned in your org's catalog separately from the system judge they descended from. You can roll back if a new variant regresses.

### Where to next

* [Judges (concept)](/8.-evaluate-score-the-outputs/judges-1.md)
* [Judges (Premium surface)](/8.-evaluate-score-the-outputs/judges.md)
* [Judge Optimization (GEPA)](/9.-improve-tune-the-system/judge-optimization.md)
* [SDK reference — Judges](/8.-evaluate-score-the-outputs/judges.md)
* [System judge implementation reference](https://github.com/LayerLens/gitbook-full/blob/main/industry/judges/system/README.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/8.-evaluate-score-the-outputs/system-judges.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
