> braintrust
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
curl "https://skillshub.wtf/TerminalSkills/skills/braintrust?format=md"Braintrust — AI Evaluation and Observability
You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.
Core Capabilities
import { Eval, init } from "braintrust";
init({ apiKey: process.env.BRAINTRUST_API_KEY });
// Run evaluation
await Eval("support-chatbot", {
data: () => [
{ input: "How do I reset my password?", expected: "Go to Settings > Security > Reset Password" },
{ input: "What's the pricing?", expected: "Plans start at $29/month" },
{ input: "I need a refund", expected: "Contact support at help@example.com" },
],
task: async (input) => {
const response = await callChatbot(input);
return response.text;
},
scores: [
// Built-in scorers
Factuality, // Does output match expected facts?
ClosedQA, // Is the answer correct given context?
// Custom scorer
(output, expected) => {
const containsKey = expected.toLowerCase().split(" ")
.some(word => output.toLowerCase().includes(word));
return { name: "keyword_match", score: containsKey ? 1 : 0 };
},
],
});
// Results visible in Braintrust dashboard with diffs, regressions, improvements
# Python
from braintrust import Eval
Eval(
"rag-pipeline",
data=lambda: [{"input": q, "expected": a} for q, a in test_pairs],
task=lambda input: rag_pipeline.query(input),
scores=[Factuality, Relevance],
)
Installation
npm install braintrust autoevals
# or
pip install braintrust autoevals
Best Practices
- Eval-driven development — Write evals first, then iterate on prompts/models; measure before optimizing
- Built-in scorers — Use Factuality, ClosedQA, Relevance from
autoevals; LLM-based quality scoring - Custom scorers — Add domain-specific metrics; combine with built-in for comprehensive evaluation
- Experiments — Each eval run is an experiment; compare side-by-side in dashboard
- Production logging — Use
braintrust.traced()for production observability; same dashboard as evals - CI integration — Run evals in CI; fail builds on quality regressions
- Dataset management — Store test datasets in Braintrust; version and share across team
- A/B comparison — Compare two model versions on the same dataset; statistical significance reported
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> xero-accounting
Integrate with the Xero accounting API to sync invoices, expenses, bank transactions, and contacts — and generate financial reports like P&L and balance sheet. Use when: connecting apps to Xero, automating bookkeeping workflows, syncing accounting data, or pulling financial reports programmatically.
> windsurf-rules
Configure Windsurf AI coding assistant with .windsurfrules and workspace rules. Use when: customizing Windsurf for a project, setting AI coding standards, creating team-shared Windsurf configurations, or tuning Cascade AI behavior.