# From LangSmith

LangSmith and Stratix solve overlapping problems with different shapes. Migration is **not a byte-for-byte port** — it's a translation of evaluation intent across two data models. Most teams complete cutover in 2–4 weeks running both systems in parallel.

## Concept mapping

| LangSmith                                | Stratix                                                                                                                                                   | Notes                                                                            |
| ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| **Project**                              | Project (within an Organization)                                                                                                                          | Both org-scoped                                                                  |
| **Dataset**                              | Custom benchmark (private dataset)                                                                                                                        | Stratix benchmarks are versioned and rerunnable against any model in the catalog |
| **Example** (input + reference output)   | Benchmark row                                                                                                                                             | One-to-one                                                                       |
| **Run**                                  | Trace                                                                                                                                                     | Stratix traces are OpenTelemetry-aligned; richer span model                      |
| **Trace** (LangSmith's hierarchical run) | Trace + spans                                                                                                                                             | Stratix natively models tool calls, retrieval steps, and sub-LLM calls as spans  |
| **Evaluation**                           | Evaluation (against a benchmark) **or** Trace evaluation (against captured traces)                                                                        | Stratix splits these because their lifecycles differ                             |
| **LLM-as-a-judge evaluator**             | [Judge](/8.-evaluate-score-the-outputs/judges-1.md)                                                                                                       | Versioned, GEPA-tunable, optionally trace-applied                                |
| **Programmatic / heuristic evaluator**   | Choose by implementation: LLM-prompted = [Scorer](/8.-evaluate-score-the-outputs/scorers-1.md); deterministic = **code grader** in the evaluation runtime |                                                                                  |
| **Pairwise evaluation**                  | Compare-evaluations workflow (`samples/core/compare_evaluations.py`)                                                                                      | Side-by-side run diffs at the row level                                          |
| **Feedback** (user thumbs / labels)      | Feedback API on a trace                                                                                                                                   | Feeds back into judge optimization labels                                        |
| **Annotation queue**                     | Labeling workflow on the trace evaluation surface                                                                                                         | Manual labels feed GEPA                                                          |
| **Prompt Hub**                           | Prompt management in Premium                                                                                                                              | Stratix prompt versions are first-class artifacts                                |

## What does NOT map cleanly

* **LangSmith's "evaluator" abstraction is overloaded.** It can be LLM-based or pure-code or a mix. Stratix splits these intentionally — scorers are LLM-prompt, judges are versioned LLM rubrics, code graders are deterministic. The migration step is to **decide, per evaluator, which Stratix surface it belongs on.**
* **LangSmith's eval CLI vs. SDK pattern** doesn't have a 1:1 in Stratix; everything is SDK-first with optional CLI wrappers. See [Cookbook: CI/CD gates](/6.-build-wire-your-code/integration-github-actions.md).
* **LangSmith Hub-published prompts** don't auto-port; re-publish through Stratix Prompt management.

## Migration steps (phased cutover)

### Phase 1 — Inventory (Day 1)

1. List every LangSmith **dataset**, **evaluator**, and **active project**.
2. For each evaluator, classify: **LLM-judge**, **LLM-scorer**, or **code grader**.
3. Identify projects with active CI/CD gates — these are the most critical to port cleanly.

### Phase 2 — Port datasets (Days 2–3)

1. Export each LangSmith dataset to JSONL using `client.list_examples(dataset_id)` and `json.dumps`.
2. Upload to Stratix as a custom benchmark via the SDK:

```python
from layerlens import Stratix
client = Stratix()
benchmark = client.benchmarks.create_custom(
 name="legal-citation-bench",
 description="Ported from LangSmith dataset id=...",
 rows_jsonl_path="legal-citation-bench.jsonl",
)
```

3. Verify row count and schema match.

### Phase 3 — Re-author evaluators (Days 4–8)

**For LLM judges:**

* Open the LangSmith evaluator prompt
* Create a Stratix judge via `client.judges.create(name=, evaluation_goal=, model_id=)` (or the dashboard's New Judge wizard)
* Paste the rubric text into `evaluation_goal`, adjusting input-variable placeholders (`{{output}}`, `{{context}}`, etc.)
* If you have ≥ 30 labeled examples, run [GEPA optimization](/9.-improve-tune-the-system/judge-optimization.md) to push agreement-with-humans up

**For LLM-prompted scorers:**

* Create via `client.scorers.create(name=, description=, model_id=, prompt=)`
* Apply in evaluation runs via `custom_scorer_ids=[...]`

**For deterministic / code evaluators:**

* These don't fit the Stratix `scorers` surface (which is LLM-prompt-driven)
* Author as a code grader in the evaluation runtime; pattern documented at [Custom code grader recipe](https://github.com/LayerLens/gitbook-full/blob/main/08-evaluate/cookbook/custom-code-scorer.md)

### Phase 4 — Dual-run (Days 9–14)

1. Run both LangSmith and Stratix in parallel on the same data for one full cycle (1–2 weeks).
2. Compare scores per row; investigate any divergence > 5%.
3. Adjust Stratix judges where divergence is judge-quality not data-quality.

### Phase 5 — Cut over (Day 15+)

1. Move CI/CD gates from LangSmith evaluators to Stratix evaluations.
2. Update trace ingestion to point at Stratix (or use Stratix's Langfuse integration if you're using LangSmith via Langfuse).
3. Archive LangSmith projects; document the cut date for audit.

## Common cutover gotchas

* **Evaluator output formats differ.** LangSmith evaluators often return `{key: score}` dicts; Stratix scorers/judges return structured outputs per their prompt. Re-format on either side.
* **Feedback semantics differ.** LangSmith feedback is free-form per run; Stratix feedback is structured against traces and feeds GEPA labels.
* **LangSmith's "in-line" evaluator inheritance** doesn't exist in Stratix; each judge/scorer is a first-class artifact with its own permissions.

## See also

* [Cookbook: port a LangSmith eval](https://github.com/LayerLens/gitbook-full/blob/main/06-build/migration/recipes/migration-langsmith.md)
* [Concept: Judges](/8.-evaluate-score-the-outputs/judges-1.md)
* [Concept: Scorers](/8.-evaluate-score-the-outputs/scorers-1.md)
* [Custom code grader recipe](https://github.com/LayerLens/gitbook-full/blob/main/08-evaluate/cookbook/custom-code-scorer.md)
* [SDK reference: Benchmarks](https://github.com/LayerLens/gitbook-full/blob/main/13-reference/sdk-python/models-benchmarks.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.layerlens.ai/6.-build-wire-your-code/from-langsmith.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
