# Pattern: citizen-services chatbot

A state or federal agency operates an AI chatbot that helps citizens find services, check eligibility, and complete applications (benefits, permits, tax forms, licenses). The agency's source-of-truth policy documents update on their own cadence — and a stale chatbot answer can deny benefits citizens are entitled to.

This pattern shows how to evaluate citizen-services chatbots for policy accuracy, accessibility, and equity.

## What's at stake

| Risk dimension                                       | Magnitude                               | Framework                                 |
| ---------------------------------------------------- | --------------------------------------- | ----------------------------------------- |
| Eligibility-misstatement civil-rights exposure       | Class-action and Title VI claims        | Title VI / ADA / state agency rules       |
| Section 508 accessibility violations                 | Per-violation penalties                 | Section 508 (federal) / state equivalents |
| Public-trust impact from incorrect official guidance | Long-tail agency-credibility damage     | Public agency-performance research        |
| Congressional / legislative inquiry                  | Hearing time, executive-branch response | Public oversight records                  |

## The evaluation pattern

A **policy-grounded evaluation** runs against a versioned source-of-truth.

1. **Custom code grader (policy-document-hash check)** — every chatbot answer references the active policy document version; if the active document hash differs from the answer's referenced version, the answer is stale = fail.
2. **Faithfulness judge** (GEPA-tuned against ≥50 caseworker-labeled examples — scored output) — claims about eligibility and procedures are grounded in the cited policy section.
3. **Reading-level scorer** (Flesch-Kincaid or grade-level equivalent) — patient-facing answers must be at or below the configured grade level (commonly 6th-8th grade for citizen-services).
4. **Multilingual parity scorer** (custom code) — per-language accuracy must be within 5 percentage points of the primary language baseline.
5. **Disclaimer-presence scorer** — the chatbot must include the agency's official "this is informational, not a determination" disclaimer where required.

> Don't have labels yet? See [Bootstrap a judge before GEPA](https://github.com/LayerLens/gitbook-full/blob/main/08-evaluate/guides/bootstrap-judges.md) for the week-1 setup.

**Continuous trace evaluation:** sampled hourly during business hours. Policy-document-hash scorer runs on every trace (cheap and deterministic). Threshold alerts route to the agency's program managers.

## Configuration in code

```python
# Python (SDK)
from layerlens import Stratix
client = Stratix()

policy_hash = client.scorers.create_code(
 name="policy-document-hash",
 code="""
cited = output.get('policy_version_hash')
active = policy_registry.active_hash(output['program'])
result = {'passed': cited == active, 'cited': cited, 'active': active}
""",
)

reading_level = client.scorers.create_code(
 name="reading-level",
 code="result = {'passed': flesch_kincaid_grade(output['text']) <= 8}",
)

faithfulness = client.judges.create(
 name="policy-faithfulness",
 evaluation_goal="Eligibility and procedure claims must be grounded in the cited policy section.",
)

trace_eval = client.trace_evaluations.create(
 trace_set={"tags": {"feature": "citizen-chatbot"}},
 scorers=[policy_hash.id, reading_level.id],
 judges=[faithfulness.id],
 schedule="hourly",
)
```

```typescript
// TypeScript (REST)
const r = await fetch("https://stratix.layerlens.ai/api/v1/trace-evaluations", {
 method: "POST",
 headers: {
 "X-API-Key": process.env.LAYERLENS_STRATIX_API_KEY!,
 "Content-Type": "application/json",
 },
 body: JSON.stringify({
 trace_set: { tags: { feature: "citizen-chatbot" } },
 scorers: [policyHashId, readingLevelId],
 judges: [faithfulnessId],
 schedule: "hourly",
 }),
});
```

## What you get

* Stale-policy answers detected within hours, not weeks.
* Multilingual accuracy parity is measured per language, not assumed.
* Auditor-ready evaluation history for civil-rights and accessibility audits.
* Pre-publication block prevents citizen-facing release on a regression.

## Stratix capabilities used

* [Custom code graders](https://github.com/LayerLens/gitbook-full/blob/main/08-evaluate/cookbook/custom-code-scorer.md) — document-hash, multilingual parity, reading-level
* [Judges with GEPA optimization](/8.-evaluate-score-the-outputs/judges-1.md) — policy faithfulness
* [Trace evaluations](/8.-evaluate-score-the-outputs/trace-evaluations.md) — continuous sampled
* [Notifications](https://github.com/LayerLens/gitbook-full/blob/main/13-reference/sdk-python/notifications.md) — program-manager routing

## Replicate this

**Get started:** [Cookbook: catch hallucinations](https://github.com/LayerLens/gitbook-full/blob/main/08-evaluate/cookbook/catch-hallucinations.md) is the closest runnable starter (policy faithfulness shape).

* [Industry → Government and public sector](/4.2-industry-use-cases/government-public-sector.md)
* [Concept: Continuous evaluation](/7.-observe-see-whats-happening/continuous-evaluation.md)
* [Workflow: Govern](/9.-improve-tune-the-system/workflow.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/4.2-industry-use-cases/pattern-4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
