> coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

fetch
$curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability?format=md"
SKILL.mdcoreweave-observability

CoreWeave Observability

GPU Metrics (DCGM Exporter)

CKS clusters come with DCGM exporter pre-installed. Key metrics:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU core utilization %
DCGM_FI_DEV_FB_USEDGPU memory used (MB)
DCGM_FI_DEV_FB_FREEGPU memory free (MB)
DCGM_FI_DEV_POWER_USAGEPower consumption (W)
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)

Prometheus Alert Rules

groups:
  - name: coreweave-gpu
    rules:
      - alert: GPUUtilizationLow
        expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "GPU utilization below 20% for 30min -- consider scaling down"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "GPU memory >95% -- risk of OOM"

      - alert: InferencePodDown
        expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
        for: 2m
        labels: { severity: critical }

Resources

Next Steps

For incident response, see coreweave-incident-runbook.

┌ stats

installs/wk0
░░░░░░░░░░
github stars1.7K
██████████
first seenMar 23, 2026
└────────────

┌ repo

jeremylongshore/claude-code-plugins-plus-skills
by jeremylongshore
└────────────