> coreweave-core-workflow-b

Run distributed GPU training jobs on CoreWeave with multi-node PyTorch. Use when training models across multiple GPUs, setting up distributed training, or running fine-tuning jobs on CoreWeave H100 clusters. Trigger with phrases like "coreweave training", "coreweave multi-gpu", "distributed training coreweave", "fine-tune on coreweave".

fetch

$curl "https://skillshub.wtf/jeremylongshore/claude-code-plugins-plus-skills/coreweave-core-workflow-b?format=md"

SKILL.md•coreweave-core-workflow-b

CoreWeave Core Workflow: GPU Training

Overview

Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.

Prerequisites

CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
Shared storage (CoreWeave PVC or NFS)
Training container with PyTorch and NCCL

Instructions

Step 1: Single-Node Multi-GPU Training

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-finetune
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: ghcr.io/myorg/trainer:latest
          command: ["torchrun"]
          args:
            - "--nproc_per_node=8"
            - "train.py"
            - "--model_name=meta-llama/Llama-3.1-8B"
            - "--batch_size=4"
            - "--epochs=3"
          resources:
            limits:
              nvidia.com/gpu: "8"
              memory: 512Gi
              cpu: "64"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: training-data
        - name: checkpoints
          persistentVolumeClaim:
            claimName: model-checkpoints
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu.nvidia.com/class
                    operator: In
                    values: ["A100_NVLINK_A100_SXM4_80GB"]

Step 2: Persistent Storage for Training Data

# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 500Gi
  storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-checkpoints
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 200Gi
  storageClassName: shared-ssd-ord1

Step 3: Monitor Training Progress

# Watch training logs
kubectl logs -f job/llm-finetune

# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi

# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
  cat /checkpoints/training_log.json | tail -5

Error Handling

Error	Cause	Solution
NCCL timeout	Network issue between GPUs	Use NVLink nodes (SXM4/SXM5)
OOMKilled	Batch size too large	Reduce batch size or use gradient accumulation
Checkpoint save failed	PVC full	Increase storage or prune old checkpoints
Job evicted	Preemption	Use on-demand nodes for training

Resources

Next Steps

For troubleshooting, see coreweave-common-errors.

> related_skills --same-repo

> fathom-cost-tuning

Optimize Fathom API usage and plan selection. Trigger with phrases like "fathom cost", "fathom pricing", "fathom plan".

> fathom-core-workflow-b

Sync Fathom meeting data to CRM and build automated follow-up workflows. Use when integrating Fathom with Salesforce, HubSpot, or custom CRMs, or creating automated post-meeting email summaries. Trigger with phrases like "fathom crm sync", "fathom salesforce", "fathom follow-up", "fathom post-meeting workflow".

> fathom-core-workflow-a

Build a meeting analytics pipeline with Fathom transcripts and summaries. Use when extracting insights from meetings, building CRM sync, or creating automated meeting follow-up workflows. Trigger with phrases like "fathom analytics", "fathom meeting pipeline", "fathom transcript analysis", "fathom action items sync".

> fathom-common-errors

Diagnose and fix Fathom API errors including auth failures and missing data. Use when API calls fail, transcripts are empty, or webhooks are not firing. Trigger with phrases like "fathom error", "fathom not working", "fathom api failure", "fix fathom".

┌ stats

installs/wk0

░░░░░░░░░░

github stars2.4K

██████████

first seenMar 23, 2026

└────────────

┌ repo

jeremylongshore/claude-code-plugins-plus-skills

by jeremylongshore

└────────────