# Benchmarks catalog

The Stratix Public Benchmarks catalog is the broadest open-browsable collection of LLM benchmarks. 52+ benchmarks, each with documented methodology, sample tasks, and per-model scores.

URL: [`stratix.layerlens.ai/benchmarks`](https://stratix.layerlens.ai/benchmarks)

## What you can see for each benchmark

* **Description** — what the benchmark measures
* **Methodology** — how scoring works, sample size, harness notes
* **Sample tasks** — representative inputs from the benchmark
* **Per-model scores** — every model evaluated against this benchmark
* **Top performers** — leaderboard for this specific benchmark
* **Score-history-over-time** — how the frontier moved on this benchmark

## Filtering

* Capability — reasoning, code, math, multilingual, vision, multi-turn
* Difficulty — easy / medium / hard
* Sample size — small / medium / large
* License — open / restricted

## Sorting

* Alphabetical
* By difficulty
* By number of models evaluated
* By recency

## Per-benchmark page

Each benchmark has a dedicated page with:

* Hero card (name, capability, sample size)
* Full methodology notes
* Sample tasks (a representative subset)
* Score table with every model's result
* Top-N leaderboard
* Score-history chart
* Linked public evaluations

## Picking a benchmark

The hardest part of using benchmarks is picking the right ones. Some heuristics:

* **Don't pick more than 3-5.** More benchmarks doesn't mean more signal.
* **Match capability to task.** Code task → HumanEval/MBPP. Math task → MATH/GSM8K. General reasoning → MMLU/ARC.
* **Validate that the benchmark's distribution matches yours.** A model that scores high on a benchmark drawn from a totally different domain doesn't help you.

## Quarterly methodology updates

Each quarterly research report documents methodology changes for the quarter — new benchmarks added, scoring changes, harness updates. This is the most rigorous public discussion of benchmark methodology you'll find.

## Want to run your own benchmarks?

The Public catalog scores models against an open library of benchmarks. If your task is domain-specific — pricing in your tariff, citing your jurisdiction, answering against your documents — public benchmarks narrow the field but won't decide.

[**Stratix Premium**](/5.-select-pick-the-model/benchmarks.md) lets you author **custom benchmarks** from your own data and run the same candidate models against them. Custom benchmarks live in your workspace, are versioned, and rerun on demand whenever a new model lands in the catalog. See [Stratix Premium → Benchmarks](/5.-select-pick-the-model/benchmarks.md) for the custom-benchmark workflow.

## Where to next

* [Models catalog](/5.-select-pick-the-model/models-catalog.md)
* [Public evaluations](/5.-select-pick-the-model/public-evaluations.md)
* [Quarterly reports](/5.-select-pick-the-model/quarterly-reports.md)
* [Concept: Models and benchmarks](/5.-select-pick-the-model/models-and-benchmarks.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/5.-select-pick-the-model/benchmarks-catalog.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
