Braintrust is a relatively new LLM observability platform that leads with evaluation as a first-class primitive. Built by ex-Stripe engineers, it treats LLM apps like regular software with regression testing, version control, and continuous integration for prompts and models.
You define evals as Python or TypeScript functions that take inputs and expected outputs, then score outputs against criteria. Braintrust runs these evals against prompt versions, model versions, or production traces. The dashboard shows regression charts, diff views between runs, and statistical significance tests. Prompt management is versioned with deployment environments (dev, staging, prod).
Free tier: 10,000 eval runs/month, 100k traces. Pro: $249/month. Enterprise custom with SSO, audit logs, and dedicated support.
Teams treating LLM apps as software — Brex, Airtable, Notion, Zapier, and others with a strong engineering culture around evals and regression testing.