> clickhouse-prod-checklist
Production readiness checklist for ClickHouse — server tuning, backup, monitoring, and deployment verification. Use when launching a ClickHouse deployment, doing go-live reviews, or auditing production readiness. Trigger: "clickhouse production", "clickhouse go-live", "clickhouse launch checklist", "production clickhouse", "clickhouse prod ready".
curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/clickhouse-prod-checklist?format=md"ClickHouse Production Checklist
Overview
Comprehensive go-live checklist for ClickHouse covering server tuning, schema design, backup configuration, monitoring, and operational readiness.
Prerequisites
- ClickHouse instance provisioned (Cloud or self-hosted)
- Application integration code tested in staging
Checklist
1. Schema & Engine Design
- Tables use
MergeTreefamily engines (notMemory,Log, orTinyLog) -
ORDER BYcolumns match primary filter/group patterns -
PARTITION BYis coarse (monthly or weekly, never by ID) -
TTLconfigured for data retention policy -
LowCardinality(String)used for low-cardinality columns -
CODEC(ZSTD)applied to large String/JSON columns - ReplacingMergeTree used with
FINALor dedup logic if upserts needed
2. Server Configuration (Self-Hosted)
<!-- Key production settings in config.xml / users.xml -->
<!-- Memory: set to ~80% of available RAM -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>
<!-- Query limits -->
<max_concurrent_queries>150</max_concurrent_queries>
<max_memory_usage>10000000000</max_memory_usage> <!-- 10GB per query -->
<max_execution_time>300</max_execution_time> <!-- 5 min timeout -->
<!-- Merge settings -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>16</background_schedule_pool_size>
<!-- Logging -->
<query_log>
<database>system</database>
<table>query_log</table>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
3. Backup Configuration
-- ClickHouse native BACKUP to S3
BACKUP TABLE analytics.events
TO S3(
'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events',
'ACCESS_KEY',
'SECRET_KEY'
)
SETTINGS compression_method = 'zstd';
-- Incremental backup (base + delta)
BACKUP TABLE analytics.events
TO S3('s3://my-bucket/backups/events-incremental')
SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');
ClickHouse Cloud: Backups are automatic. Configure retention and frequency in the Cloud console under Service Settings.
- Backup schedule configured (daily minimum)
- Backup restore tested and documented
- Point-in-time recovery possible (incremental backups)
- Backup stored in different region/account from primary
4. Monitoring & Alerting
-- Key metrics to monitor
-- 1. Query latency (p50/p95/p99)
-- 2. Active parts per table (alert > 300)
-- 3. Merge queue depth
-- 4. Memory usage
-- 5. Disk space free
-- 6. Replication lag (if clustered)
-- Health check query (use for load balancer probes)
SELECT 1;
-- More thorough health check
SELECT
(SELECT count() FROM system.processes) AS running_queries,
(SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes,
(SELECT count() FROM system.merges) AS active_merges;
- Prometheus endpoint configured (
/metricsvia ClickHouse Exporter or Cloud endpoint) - Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
- Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
- Query log monitoring for slow/failed queries
5. Security
- Default user password changed or user disabled
- Application user has minimal privileges (see
clickhouse-security-basics) - TLS enabled (HTTPS on port 8443)
- IP allowlist configured
- Secrets in environment variables or secret manager (not code)
6. Application Integration
- Connection pooling configured (
max_open_connections) - Graceful shutdown calls
client.close() - Insert batching in place (10K+ rows per INSERT)
- Retry logic for transient errors (see
clickhouse-rate-limits) - Health check endpoint includes ClickHouse ping
// Health check endpoint
app.get('/health', async (req, res) => {
try {
const { success } = await client.ping();
res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success });
} catch {
res.status(503).json({ status: 'unhealthy', clickhouse: false });
}
});
7. Operational Readiness
- Incident runbook documented (see
clickhouse-incident-runbook) - On-call escalation path defined
- Key rotation procedure documented
- Schema migration process in place (see
clickhouse-migration-deep-dive) - Load testing completed at expected peak traffic
8. Verification Queries
-- Verify table sizes and part health
SELECT database, table, count() AS parts, sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts WHERE active
GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;
-- Verify no pending mutations
SELECT * FROM system.mutations WHERE NOT is_done;
-- Verify no replication lag
SELECT database, table, queue_size, inserts_in_queue
FROM system.replicas WHERE queue_size > 0;
Error Handling
| Issue | Detection | Action |
|---|---|---|
| Parts > 300 | Monitoring alert | Review insert patterns, wait for merges |
| Disk > 80% | Disk alert | Add storage, drop old partitions |
| Query p95 > 5s | Latency alert | Check system.query_log for slow queries |
| Replication lag | Replica check | Investigate network, merge backlog |
Resources
Next Steps
For SDK version upgrades, see clickhouse-upgrade-migration.
> related_skills --same-repo
> fathom-cost-tuning
Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".
> fathom-core-workflow-b
Sync Fathom meeting data to CRM and build automated follow-up workflows. Use when integrating Fathom with Salesforce, HubSpot, or custom CRMs, or creating automated post-meeting email summaries. Trigger with phrases like "fathom crm sync", "fathom salesforce", "fathom follow-up", "fathom post-meeting workflow".
> fathom-core-workflow-a
Build a meeting analytics pipeline with Fathom transcripts and summaries. Use when extracting insights from meetings, building CRM sync, or creating automated meeting follow-up workflows. Trigger with phrases like "fathom analytics", "fathom meeting pipeline", "fathom transcript analysis", "fathom action items sync".
> fathom-common-errors
Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".