# Benchmark-driven development

Most teams treat benchmarks as a one-time activity at the start of a project. Pick a model, glance at MMLU, get on with it. **Benchmark-driven development** flips that — every meaningful change to your AI feature runs against the benchmarks that matter, and every result lives on a shared dashboard.

## The shape of the work

1. **Pick your benchmarks.** Public benchmarks for general capability + your own private benchmark suite for task-specific quality.
2. **Wire it into your dev loop.** Local: `pytest`-style harness running a small slice. CI: full benchmark suite on every PR. Nightly: full suite + production-traffic slice.
3. **Track the score over time.** Stratix's evaluation history page shows the score curve per benchmark.
4. **Block regressions in CI.** A score that drops below baseline blocks the PR.
5. **Report wins to the team.** When a change improves a benchmark, share it.

## Why it works on Stratix

* **52+ public benchmarks ready to use** — you don't have to host the data
* **Private benchmark suites** — upload your own dataset and grading config
* **Score history** — every run is recorded; trends are visible
* **CI gates** — easily wired via the SDK or CLI
* **Compare models** — when a benchmark gets stronger, see if a different model is now a better fit

## Tools you'll use

* [Stratix Public — Benchmarks catalog](/5.-select-pick-the-model/benchmarks-catalog.md)
* [Stratix Premium — Benchmarks](https://github.com/LayerLens/gitbook-full/blob/main/13-reference/cli/benchmarks.md)
* [SDK: `client.evaluations.create()`](/4.1-general-use-cases/general.md)
* [CLI: `layerlens evaluate`](/4.1-general-use-cases/general.md)

## Outcomes you should see

You'll know this is working when:

* **Every PR touching AI runs a benchmark gate** — non-negotiable, not opt-in.
* **Your benchmark gating eval runs in <5 minutes** — fast enough that no one wants to skip it.
* **Score-over-time trends are visible to the whole team**, not buried in one engineer's terminal.
* **Net-new benchmarks reach 80% of your team within a week** — adding signal is cheap and shared.

## Anti-patterns

* **Benchmark inflation.** Adding 30 benchmarks because "more signal is better." Pick 3-5 that matter, run them often.
* **Treating public scores as your scores.** Public scores are a leading indicator. Your private benchmark on your data is the verdict.
* **Skipping CI.** If benchmarks don't gate merges, they're decoration.

## Where to next

* [Tutorial: Wire CI/CD quality gates](/6.-build-wire-your-code/03-cicd-gates.md)
* [Workflow: Evaluate](/9.-improve-tune-the-system/workflow.md)
* [Concept: Models and benchmarks](/5.-select-pick-the-model/models-and-benchmarks.md)
* [Cookbook: CI/CD recipes](/2.-get-started/all-cookbook-recipes.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/4.1-general-use-cases/benchmark-driven-development.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
