> clickhouse-prod-checklist

Production readiness checklist for ClickHouse — server tuning, backup, monitoring, and deployment verification. Use when launching a ClickHouse deployment, doing go-live reviews, or auditing production readiness. Trigger: "clickhouse production", "clickhouse go-live", "clickhouse launch checklist", "production clickhouse", "clickhouse prod ready".

fetch

$curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/clickhouse-prod-checklist?format=md"

SKILL.md•clickhouse-prod-checklist

ClickHouse Production Checklist

Overview

Comprehensive go-live checklist for ClickHouse covering server tuning, schema design, backup configuration, monitoring, and operational readiness.

Prerequisites

ClickHouse instance provisioned (Cloud or self-hosted)
Application integration code tested in staging

Checklist

1. Schema & Engine Design

Tables use MergeTree family engines (not Memory, Log, or TinyLog)
ORDER BY columns match primary filter/group patterns
PARTITION BY is coarse (monthly or weekly, never by ID)
TTL configured for data retention policy
LowCardinality(String) used for low-cardinality columns
CODEC(ZSTD) applied to large String/JSON columns
ReplacingMergeTree used with FINAL or dedup logic if upserts needed

2. Server Configuration (Self-Hosted)

<!-- Key production settings in config.xml / users.xml -->

<!-- Memory: set to ~80% of available RAM -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

<!-- Query limits -->
<max_concurrent_queries>150</max_concurrent_queries>
<max_memory_usage>10000000000</max_memory_usage>  <!-- 10GB per query -->
<max_execution_time>300</max_execution_time>       <!-- 5 min timeout -->

<!-- Merge settings -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>16</background_schedule_pool_size>

<!-- Logging -->
<query_log>
    <database>system</database>
    <table>query_log</table>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>

3. Backup Configuration

-- ClickHouse native BACKUP to S3
BACKUP TABLE analytics.events
    TO S3(
        'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events',
        'ACCESS_KEY',
        'SECRET_KEY'
    )
    SETTINGS compression_method = 'zstd';

-- Incremental backup (base + delta)
BACKUP TABLE analytics.events
    TO S3('s3://my-bucket/backups/events-incremental')
    SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');

ClickHouse Cloud: Backups are automatic. Configure retention and frequency in the Cloud console under Service Settings.

Backup schedule configured (daily minimum)
Backup restore tested and documented
Point-in-time recovery possible (incremental backups)
Backup stored in different region/account from primary

4. Monitoring & Alerting

-- Key metrics to monitor
-- 1. Query latency (p50/p95/p99)
-- 2. Active parts per table (alert > 300)
-- 3. Merge queue depth
-- 4. Memory usage
-- 5. Disk space free
-- 6. Replication lag (if clustered)

-- Health check query (use for load balancer probes)
SELECT 1;

-- More thorough health check
SELECT
    (SELECT count() FROM system.processes) AS running_queries,
    (SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes,
    (SELECT count() FROM system.merges) AS active_merges;

Prometheus endpoint configured (/metrics via ClickHouse Exporter or Cloud endpoint)
Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
Query log monitoring for slow/failed queries

5. Security

Default user password changed or user disabled
Application user has minimal privileges (see clickhouse-security-basics)
TLS enabled (HTTPS on port 8443)
IP allowlist configured
Secrets in environment variables or secret manager (not code)

6. Application Integration

Connection pooling configured (max_open_connections)
Graceful shutdown calls client.close()
Insert batching in place (10K+ rows per INSERT)
Retry logic for transient errors (see clickhouse-rate-limits)
Health check endpoint includes ClickHouse ping

// Health check endpoint
app.get('/health', async (req, res) => {
  try {
    const { success } = await client.ping();
    res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success });
  } catch {
    res.status(503).json({ status: 'unhealthy', clickhouse: false });
  }
});

7. Operational Readiness

Incident runbook documented (see clickhouse-incident-runbook)
On-call escalation path defined
Key rotation procedure documented
Schema migration process in place (see clickhouse-migration-deep-dive)
Load testing completed at expected peak traffic

8. Verification Queries

-- Verify table sizes and part health
SELECT database, table, count() AS parts, sum(rows) AS rows,
       formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts WHERE active
GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;

-- Verify no pending mutations
SELECT * FROM system.mutations WHERE NOT is_done;

-- Verify no replication lag
SELECT database, table, queue_size, inserts_in_queue
FROM system.replicas WHERE queue_size > 0;

Error Handling

Issue	Detection	Action
Parts > 300	Monitoring alert	Review insert patterns, wait for merges
Disk > 80%	Disk alert	Add storage, drop old partitions
Query p95 > 5s	Latency alert	Check `system.query_log` for slow queries
Replication lag	Replica check	Investigate network, merge backlog

Resources

Next Steps

For SDK version upgrades, see clickhouse-upgrade-migration.

> related_skills --same-repo

> fathom-cost-tuning

Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".

> fathom-core-workflow-b

Sync Fathom meeting data to CRM and build automated follow-up workflows. Use when integrating Fathom with Salesforce, HubSpot, or custom CRMs, or creating automated post-meeting email summaries. Trigger with phrases like "fathom crm sync", "fathom salesforce", "fathom follow-up", "fathom post-meeting workflow".

> fathom-core-workflow-a

Build a meeting analytics pipeline with Fathom transcripts and summaries. Use when extracting insights from meetings, building CRM sync, or creating automated meeting follow-up workflows. Trigger with phrases like "fathom analytics", "fathom meeting pipeline", "fathom transcript analysis", "fathom action items sync".

> fathom-common-errors

Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".

┌ stats

installs/wk0

░░░░░░░░░░

github stars2.1K

██████████

first seenMar 23, 2026

└────────────

┌ repo

jeremylongshore/claude-code-plugins-plus-skills

by jeremylongshore

└────────────