# Model evaluation

Picking a model is the most expensive decision you'll make as an AI team. The wrong choice locks in months of wasted iteration on a foundation that was never going to work. Stratix's model-evaluation use case is the disciplined version of that decision.

## The shape of the work

1. **Frame the question.** What does your feature actually need? Reasoning, structured output, multi-turn dialog, code, multilingual, multimodal? Don't pick a model on vibes.
2. **Find the right benchmarks.** Browse the [benchmarks catalog](/5.-select-pick-the-model/benchmarks-catalog.md). Pick 1-3 that map to your task. (Most teams pick too many; resist.)
3. **Browse the public leaderboard.** [Compare models](/5.-select-pick-the-model/compare-models.md) head-to-head. Get to a shortlist of 3-5.
4. **Run a private evaluation on your data.** Public scores tell you what's strong in general. Your data tells you what's strong for *you*. Stratix Premium runs a private evaluation against your dataset.
5. **Read the verdict, not the headline.** A model that scores 2% higher overall but 15% lower on your one critical dimension is the wrong pick.

## Why it works on Stratix

* **Public catalog gives you the shortlist for free.** No multi-week vendor proof-of-concepts to find candidate models.
* **Private evaluations on your data give you the verdict.** Public leaderboards don't see your prompts or your distribution.
* **Compare models is built in.** Side-by-side score tables with confidence intervals.
* **Costs are visible up-front.** ECU consumption is shown before you run.

## Tools you'll use

* [Stratix Public — Models catalog](/5.-select-pick-the-model/models-catalog.md)
* [Stratix Public — Compare models](/5.-select-pick-the-model/compare-models.md)
* [Stratix Premium — Evaluations](/8.-evaluate-score-the-outputs/evaluations.md)
* [Stratix Premium — Benchmarks](https://github.com/LayerLens/gitbook-full/blob/main/13-reference/cli/benchmarks.md)
* [SDK: `client.evaluations.create()`](/4.1-general-use-cases/general.md)

## Outcomes you should see

You'll know this is working when:

* **Model selection time drops from weeks to hours.** Public catalog narrows; private eval decides.
* **Every model decision has a citable evaluation ID.** No "we picked X because it felt better."
* **Re-evaluating a new candidate model takes under 1 hour.** Including standing up the eval against your dataset.
* **Cost and latency are part of the choice, not afterthoughts.** You can tell the team "we picked the cheaper model because at 95% of the quality it's 3× cheaper."

## Anti-patterns

* **Benchmark cargo-culting.** "Everyone uses MMLU" is not why MMLU is right for your task.
* **One-shot picking.** The model frontier moves quarterly. Re-evaluate the candidates as new models drop.
* **Ignoring cost and latency.** A 2% accuracy gain at 4× cost is rarely worth shipping.

## Where to next

* [Concept: Models and benchmarks](/5.-select-pick-the-model/models-and-benchmarks.md)
* [Tutorial: First evaluation in 10 minutes](/8.-evaluate-score-the-outputs/01-first-evaluation.md)
* [Workflow: Compare](/9.-improve-tune-the-system/workflow.md)
* [Cookbook: model-selection recipes](/2.-get-started/all-cookbook-recipes.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/4.1-general-use-cases/model-evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
