> clickhouse-core-workflow-a

Design ClickHouse schemas with MergeTree engines, ORDER BY keys, and partitioning. Use when creating new tables, choosing engines, designing sort keys, or modeling data for analytical workloads. Trigger: "clickhouse schema design", "clickhouse table design", "clickhouse ORDER BY", "clickhouse partitioning", "MergeTree table".

fetch
$curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/clickhouse-core-workflow-a?format=md"
SKILL.mdclickhouse-core-workflow-a

ClickHouse Schema Design (Core Workflow A)

Overview

Design ClickHouse tables with correct engine selection, ORDER BY keys, partitioning, and codec choices for analytical workloads.

Prerequisites

  • @clickhouse/client connected (see clickhouse-install-auth)
  • Understanding of your query patterns (what you filter and group on)

Instructions

Step 1: Choose the Right Engine

EngineBest ForDedup?Example
MergeTreeGeneral analytics, append-only logsNoClickstream, IoT
ReplacingMergeTreeMutable rows (upserts)Yes (on merge)User profiles, state
SummingMergeTreePre-aggregated countersSums numericsPage view counts
AggregatingMergeTreeMaterialized view targetsMerges statesDashboards
CollapsingMergeTreeStateful row updatesCollapses +-1Shopping carts

ClickHouse Cloud uses SharedMergeTree — it is a drop-in replacement for MergeTree on Cloud. You do not need to change your DDL.

Step 2: Design the ORDER BY (Sort Key)

The ORDER BY clause is the single most important schema decision. It defines:

  • Primary index — sparse index over sort-key granules (8192 rows default)
  • Data layout on disk — rows sorted physically by these columns
  • Query speed — queries filtering on ORDER BY prefix columns hit fewer granules

Rules of thumb:

  1. Put low-cardinality filter columns first (event_type, status)
  2. Then high-cardinality columns you filter on (user_id, tenant_id)
  3. End with a time column if you use range filters (created_at)
  4. Do NOT put high-cardinality columns you never filter on in ORDER BY
-- Good: filter by tenant, then by time ranges
ORDER BY (tenant_id, event_type, created_at)

-- Bad: UUID first means every query scans the full index
ORDER BY (event_id, created_at)  -- event_id is random UUID

Step 3: Schema Examples

Event Analytics Table

CREATE TABLE analytics.events (
    event_id     UUID DEFAULT generateUUIDv4(),
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    user_id      UInt64,
    session_id   String,
    properties   String CODEC(ZSTD(3)),  -- JSON blob, compress well
    url          String CODEC(ZSTD(1)),
    ip_address   IPv4,
    country      LowCardinality(FixedString(2)),
    created_at   DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, toDate(created_at), user_id)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;

User Profile Table (Upserts)

CREATE TABLE analytics.users (
    user_id      UInt64,
    email        String,
    plan         LowCardinality(String),
    mrr_cents    UInt32,
    properties   String CODEC(ZSTD(3)),
    updated_at   DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(updated_at)   -- keeps latest row per ORDER BY key
ORDER BY user_id;

-- Query with FINAL to get deduplicated results
SELECT * FROM analytics.users FINAL WHERE user_id = 42;

Daily Aggregation Table

CREATE TABLE analytics.daily_stats (
    date         Date,
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    event_count  UInt64,
    unique_users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, event_type, date);

Step 4: Partitioning Guidelines

Partition ExpressionTypical UseParts Per Partition
toYYYYMM(date)Most common — monthlyTarget 10-1000
toMonday(date)Weekly rollupsMore parts, finer drops
toYYYYMMDD(date)Daily TTL dropsMany parts — use carefully
NoneSmall tables (<1M rows)Fine

Warning: Each partition creates separate parts on disk. Over-partitioning (e.g., by user_id) creates millions of tiny parts and kills performance.

Step 5: Codecs and Compression

-- Column-level compression codecs
column1  UInt64 CODEC(Delta, ZSTD(3)),      -- Time series / sequential IDs
column2  Float64 CODEC(Gorilla, ZSTD(1)),   -- Floating point (similar values)
column3  String CODEC(ZSTD(3)),              -- General text / JSON
column4  DateTime CODEC(DoubleDelta, ZSTD),  -- Timestamps (near-sequential)

Applying Schema via Node.js

import { createClient } from '@clickhouse/client';

const client = createClient({ url: process.env.CLICKHOUSE_HOST! });

async function applySchema() {
  await client.command({ query: 'CREATE DATABASE IF NOT EXISTS analytics' });

  await client.command({
    query: `
      CREATE TABLE IF NOT EXISTS analytics.events (
        event_id   UUID DEFAULT generateUUIDv4(),
        tenant_id  UInt32,
        event_type LowCardinality(String),
        user_id    UInt64,
        payload    String CODEC(ZSTD(3)),
        created_at DateTime DEFAULT now()
      )
      ENGINE = MergeTree()
      ORDER BY (tenant_id, event_type, created_at)
      PARTITION BY toYYYYMM(created_at)
    `,
  });

  console.log('Schema applied.');
}

Error Handling

ErrorCauseSolution
ORDER BY expression not in primary keyPRIMARY KEY != ORDER BYRemove explicit PRIMARY KEY or align
Too many parts (300+)Over-partitioningUse coarser partition expression
Cannot convert String to UInt64Wrong data typeMatch insert types to schema
TTL expression type mismatchTTL on non-date columnTTL must reference DateTime column

Resources

Next Steps

For inserting and querying data, see clickhouse-core-workflow-b.

┌ stats

installs/wk0
░░░░░░░░░░
github stars1.7K
██████████
first seenMar 23, 2026
└────────────

┌ repo

jeremylongshore/claude-code-plugins-plus-skills
by jeremylongshore
└────────────