> clade-performance-tuning
Optimize Anthropic API latency — streaming, prompt caching, model selection, Use when working with performance-tuning patterns. connection reuse, and parallel requests. Trigger with "anthropic slow", "claude latency", "speed up anthropic", "anthropic performance", "claude response time".
curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/clade-performance-tuning?format=md"Anthropic Performance Tuning
Overview
Claude latency has two components: time to first token (TTFT) and tokens per second (TPS). Different strategies target each.
Latency Benchmarks (approximate)
| Model | TTFT (p50) | TTFT (p95) | Output TPS |
|---|---|---|---|
| Claude Haiku 4.5 | 200ms | 600ms | ~150 |
| Claude Sonnet 4 | 400ms | 1.2s | ~90 |
| Claude Opus 4 | 800ms | 2.5s | ~40 |
Optimization Strategies
Instructions
Step 1: Always Stream
// Streaming delivers the first token ASAP — user sees response instantly
// instead of waiting for the full response to generate
const stream = client.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages,
});
// First token arrives in ~400ms (Sonnet)
// Full response may take 5-10s, but user sees progress immediately
for await (const event of stream) {
if (event.type === 'content_block_delta') {
yield event.delta.text;
}
}
Step 2: Prompt Caching — Faster TTFT
// Cached prompts skip re-processing — dramatically lower TTFT for large system prompts
const message = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: [{
type: 'text',
text: largeSystemPrompt, // 10K+ tokens
cache_control: { type: 'ephemeral' },
}],
messages,
}, {
headers: { 'claude-beta': 'prompt-caching-2024-07-31' },
});
// TTFT drops from ~2s to ~500ms on cache hit with large prompts
Step 3: Use Haiku for Speed-Critical Paths
// Haiku is 2-4x faster than Sonnet with 80% quality for many tasks
// Use for: classification, extraction, simple Q&A, routing decisions
const route = await client.messages.create({
model: 'claude-haiku-4-5-20251001', // 200ms TTFT
max_tokens: 10,
system: 'Classify the intent. Reply with exactly one word: search, create, update, delete.',
messages: [{ role: 'user', content: userInput }],
});
// Then use Sonnet/Opus for the actual task
Step 4: Reuse Client Instance
// BAD — creates new connection pool per request
app.get('/api/chat', async (req, res) => {
const client = new Anthropic(); // DON'T
// ...
});
// GOOD — single client shared across requests
const client = new Anthropic(); // Module-level singleton
app.get('/api/chat', async (req, res) => {
const message = await client.messages.create({ ... });
// ...
});
Step 5: Parallel Requests
// When you need multiple independent Claude calls, fire them in parallel
const [summary, sentiment, entities] = await Promise.all([
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
messages: [{ role: 'user', content: `Summarize: ${text}` }] }),
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 20,
messages: [{ role: 'user', content: `Sentiment (positive/negative/neutral): ${text}` }] }),
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
messages: [{ role: 'user', content: `Extract named entities from: ${text}` }] }),
]);
Step 6: Minimize Output Tokens
// Fewer output tokens = faster response
system: 'Be extremely concise. Use bullet points, not paragraphs.',
// Set tight max_tokens
max_tokens: 256, // Don't use 4096 for short answers
Output
- Streaming enabled for all user-facing responses (first token in ~400ms with Sonnet)
- Prompt caching reducing TTFT for large system prompts
- Model routing to Haiku for speed-critical classification/routing tasks
- Client instance reused across requests (no per-request connection overhead)
- Parallel requests firing independent Claude calls concurrently
Error Handling
| Issue | Cause | Fix |
|---|---|---|
| TTFT > 3s | Large uncached prompt | Enable prompt caching |
| Slow output | Using Opus for simple tasks | Downgrade to Haiku/Sonnet |
| Timeouts | Long generation + default timeout | new Anthropic({ timeout: 120_000 }) |
| 529 overloaded | API capacity | SDK auto-retries; add fallback model |
Examples
See Latency Benchmarks table and six numbered strategy sections above, each with complete TypeScript code examples.
Resources
Next Steps
See clade-deploy-integration for production deployment patterns.
Prerequisites
- Completed
clade-install-auth - User-facing application where latency matters
- Understanding of streaming and async patterns
> related_skills --same-repo
> fathom-cost-tuning
Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".
> fathom-core-workflow-b
Sync Fathom meeting data to CRM and build automated follow-up workflows. Use when integrating Fathom with Salesforce, HubSpot, or custom CRMs, or creating automated post-meeting email summaries. Trigger with phrases like "fathom crm sync", "fathom salesforce", "fathom follow-up", "fathom post-meeting workflow".
> fathom-core-workflow-a
Build a meeting analytics pipeline with Fathom transcripts and summaries. Use when extracting insights from meetings, building CRM sync, or creating automated meeting follow-up workflows. Trigger with phrases like "fathom analytics", "fathom meeting pipeline", "fathom transcript analysis", "fathom action items sync".
> fathom-common-errors
Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".