> databricks-reference-architecture

Implement Databricks reference architecture with best-practice project layout. Use when designing new Databricks projects, reviewing architecture, or establishing standards for Databricks applications. Trigger with phrases like "databricks architecture", "databricks best practices", "databricks project structure", "how to organize databricks", "databricks layout".

fetch

$curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/databricks-reference-architecture?format=md"

SKILL.md•databricks-reference-architecture

Databricks Reference Architecture

Overview

Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and the medallion pattern. Covers workspace organization, three-level namespace governance, compute strategy, CI/CD with Asset Bundles, and project structure for team collaboration.

Prerequisites

Databricks workspace with Unity Catalog enabled
Understanding of medallion architecture (bronze/silver/gold)
Databricks CLI configured
Terraform or Asset Bundles for infrastructure

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    UNITY CATALOG                                  │
│                                                                   │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌───────────┐  │
│  │  Bronze    │  │  Silver    │  │   Gold     │  │ ML Models │  │
│  │  Catalog   │─▶│  Catalog   │─▶│  Catalog   │  │ (MLflow)  │  │
│  │  (raw)     │  │  (clean)   │  │  (curated) │  │           │  │
│  └────────────┘  └────────────┘  └────────────┘  └───────────┘  │
│       ▲                                    │                      │
│  ┌────────────┐                   ┌────────────────┐             │
│  │ Auto Loader│                   │ Model Serving  │             │
│  │ Ingestion  │                   │ Endpoints      │             │
│  └────────────┘                   └────────────────┘             │
├─────────────────────────────────────────────────────────────────┤
│  Compute: Job Clusters │ SQL Warehouses │ Instance Pools        │
├─────────────────────────────────────────────────────────────────┤
│  Security: Row Filters │ Column Masks │ Secret Scopes │ SCIM   │
├─────────────────────────────────────────────────────────────────┤
│  CI/CD: Asset Bundles │ GitHub Actions │ dev/staging/prod       │
└─────────────────────────────────────────────────────────────────┘

Project Structure

databricks-platform/
├── src/
│   ├── ingestion/
│   │   ├── bronze_raw_events.py       # Auto Loader streaming
│   │   ├── bronze_api_data.py         # REST API batch ingestion
│   │   └── bronze_file_uploads.py     # Manual file uploads
│   ├── transformation/
│   │   ├── silver_clean_events.py     # Cleansing + dedup
│   │   ├── silver_schema_enforce.py   # Schema validation
│   │   └── silver_scd2.py            # Slowly changing dimensions
│   ├── aggregation/
│   │   ├── gold_daily_metrics.py      # Business KPIs
│   │   ├── gold_user_features.py      # ML feature engineering
│   │   └── gold_reporting.py          # BI-ready views
│   └── ml/
│       ├── training/
│       │   └── train_churn_model.py
│       └── inference/
│           └── batch_scoring.py
├── tests/
│   ├── conftest.py                    # Spark fixtures
│   ├── unit/                          # Local Spark tests
│   └── integration/                   # Databricks Connect tests
├── resources/
│   ├── etl_jobs.yml                   # ETL job definitions
│   ├── ml_jobs.yml                    # ML pipeline definitions
│   └── maintenance.yml                # OPTIMIZE/VACUUM schedules
├── databricks.yml                     # Asset Bundle root config
├── pyproject.toml
└── requirements.txt

Instructions

Step 1: Unity Catalog Hierarchy

-- One catalog per environment (or shared with schema isolation)
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;

-- Medallion schemas per catalog
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_models;

-- Permissions: engineers write bronze/silver, analysts read gold
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.silver TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-engineers`;

GRANT USAGE ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-analysts`;

Step 2: Asset Bundle Configuration

# databricks.yml
bundle:
  name: data-platform

workspace:
  host: ${DATABRICKS_HOST}

include:
  - resources/*.yml

variables:
  catalog:
    default: dev_catalog
  alert_email:
    default: dev@company.com

targets:
  dev:
    default: true
    mode: development
    workspace:
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev

  staging:
    variables:
      catalog: staging_catalog

  prod:
    mode: production
    variables:
      catalog: prod_catalog
      alert_email: oncall@company.com
    workspace:
      root_path: /Shared/.bundle/${bundle.name}/prod

Step 3: Compute Strategy

# resources/etl_jobs.yml
resources:
  jobs:
    daily_etl:
      name: "daily-etl-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      max_concurrent_runs: 1

      tasks:
        - task_key: bronze
          notebook_task:
            notebook_path: src/ingestion/bronze_raw_events.py
          job_cluster_key: etl

        - task_key: silver
          depends_on: [{task_key: bronze}]
          notebook_task:
            notebook_path: src/transformation/silver_clean_events.py
          job_cluster_key: etl

        - task_key: gold
          depends_on: [{task_key: silver}]
          notebook_task:
            notebook_path: src/aggregation/gold_daily_metrics.py
          job_cluster_key: etl

      job_clusters:
        - job_cluster_key: etl
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            autoscale:
              min_workers: 1
              max_workers: 4
            aws_attributes:
              availability: SPOT_WITH_FALLBACK
              first_on_demand: 1
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: "true"
              spark.databricks.delta.autoCompact.enabled: "true"

Step 4: Medallion Pipeline Pattern

# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name

spark = SparkSession.builder.getOrCreate()

# Bronze: Auto Loader for incremental file ingestion
raw = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/events/schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .load("s3://data-lake/raw/events/")
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
)

(raw.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/bronze/events/data")
    .toTable("prod_catalog.bronze.raw_events"))

Step 5: Table Maintenance Schedule

# resources/maintenance.yml
resources:
  jobs:
    weekly_optimize:
      name: "maintenance-optimize-${bundle.target}"
      schedule:
        quartz_cron_expression: "0 0 2 ? * SUN"
        timezone_id: "UTC"
      tasks:
        - task_key: optimize_tables
          notebook_task:
            notebook_path: src/maintenance/optimize_tables.py
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "m5.xlarge"
            num_workers: 1

# src/maintenance/optimize_tables.py
tables_to_optimize = [
    ("prod_catalog.silver.orders", ["order_date", "region"]),
    ("prod_catalog.silver.events", ["event_date"]),
    ("prod_catalog.gold.daily_metrics", []),
]

for table, z_cols in tables_to_optimize:
    if z_cols:
        spark.sql(f"OPTIMIZE {table} ZORDER BY ({', '.join(z_cols)})")
    else:
        spark.sql(f"OPTIMIZE {table}")
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
    print(f"Maintained: {table}")

Output

Unity Catalog hierarchy with env-isolated catalogs and medallion schemas
Asset Bundle with dev/staging/prod targets and variable overrides
Medallion pipeline (Auto Loader > MERGE > aggregations)
RBAC grants separating engineer write from analyst read-only
Table maintenance schedule (weekly OPTIMIZE + VACUUM)

Error Handling

Issue	Cause	Solution
Schema evolution failure	New source columns	Auto Loader handles with `schemaEvolutionMode`
Permission denied on schema	Missing `USAGE` on parent catalog	`GRANT USAGE ON CATALOG` first
Concurrent write conflict	Multiple jobs writing same table	`max_concurrent_runs: 1` in job config
Cluster timeout	Long-running tasks	Set `timeout_seconds` per task

Examples

Validate Data Flow

SELECT 'bronze' AS layer, COUNT(*) AS rows FROM prod_catalog.bronze.raw_events
UNION ALL SELECT 'silver', COUNT(*) FROM prod_catalog.silver.events
UNION ALL SELECT 'gold', COUNT(*) FROM prod_catalog.gold.daily_metrics;

Resources

> related_skills --same-repo

> fathom-cost-tuning

Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".

> fathom-core-workflow-b

Sync Fathom meeting data to CRM and build automated follow-up workflows. Use when integrating Fathom with Salesforce, HubSpot, or custom CRMs, or creating automated post-meeting email summaries. Trigger with phrases like "fathom crm sync", "fathom salesforce", "fathom follow-up", "fathom post-meeting workflow".

> fathom-core-workflow-a

Build a meeting analytics pipeline with Fathom transcripts and summaries. Use when extracting insights from meetings, building CRM sync, or creating automated meeting follow-up workflows. Trigger with phrases like "fathom analytics", "fathom meeting pipeline", "fathom transcript analysis", "fathom action items sync".

> fathom-common-errors

Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".

┌ stats

installs/wk0

░░░░░░░░░░

github stars2.4K

██████████

first seenMar 23, 2026

└────────────

┌ repo

jeremylongshore/claude-code-plugins-plus-skills

by jeremylongshore

└────────────