> k8s-incident

Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.

fetch
$curl "https://skillshub.wtf/rohitg00/kubectl-mcp-server/k8s-incident?format=md"
SKILL.mdk8s-incident

Kubernetes Incident Response

Runbooks and diagnostic workflows for common Kubernetes incidents.

When to Apply

Use this skill when:

  • User mentions: "incident", "outage", "emergency", "down", "not working"
  • Operations: emergency response, production issues, service degradation
  • Keywords: "urgent", "broken", "fix", "restore", "recover"

Priority Rules

PriorityRuleImpactTools
1Check control plane firstCRITICALget_pods(namespace="kube-system")
2Assess node healthCRITICALget_nodes
3Gather events before changesHIGHget_events
4Document timelineHIGHManual notes
5Rollback if safeMEDIUMrollback_deployment

Quick Reference

IncidentFirst ToolNext Steps
Pod failureget_pod_logs(previous=True)describe_pod, get_events
Node downdescribe_nodeCheck kubelet logs
Service unreachableget_endpointsget_network_policies
Control planeget_pods(namespace="kube-system")Check API server logs

Incident Triage

Quick Health Check

get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)

Severity Assessment

IndicatorSeverityAction
Multiple nodes NotReadyCriticalEscalate immediately
kube-system pods failingCriticalControl plane issue
Single pod CrashLoopMediumDebug pod
High latencyMediumCheck resources

Runbook: Pod Failures

CrashLoopBackOff

get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)

Common Causes:

  • OOMKilled → Increase memory limits
  • Exit code 1 → Application error in logs
  • Exit code 137 → Killed by OOM or SIGKILL
  • Exit code 143 → Graceful SIGTERM

ImagePullBackOff

describe_pod(name, namespace)
get_secrets(namespace)

Pending Pod

describe_pod(name, namespace)
get_nodes()
get_events(namespace)

Runbook: Node Issues

Node NotReady

describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")

Node DiskPressure

describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")

Runbook: Network Issues

Service Not Accessible

get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)

DNS Resolution Failures

get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")

With Cilium

cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

PVC Pending

describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)

Pod Stuck in ContainerCreating

describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)

Runbook: Control Plane Issues

API Server Unavailable

get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")

etcd Issues

get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

Force Delete Pod

delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

rollback_deployment(name, namespace, revision=0)

Helm Rollback

rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.

Multi-Cluster Incident Response

Check all clusters:

for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

Document Timeline

  1. When did the incident start?
  2. What was the impact?
  3. What was the root cause?
  4. What fixed it?

Prevent Recurrence

  • Add monitoring/alerting
  • Improve resource limits
  • Add readiness probes
  • Document runbook

Related Skills

┌ stats

installs/wk0
░░░░░░░░░░
github stars847
██████████
first seenMar 17, 2026
└────────────

┌ repo

rohitg00/kubectl-mcp-server
by rohitg00
└────────────

┌ tags

└────────────