etcd Is the Brain of Your Cluster — Here's My 10-Minute Backup Routine

May 7, 20262 min read

kubernetesetcdbackupsproductioncka

Before the 10-minute backup setup, you need to understand why this backup matters more than any other in your cluster.

Why etcd matters more than anything else

etcd stores the desired state of your entire cluster. Every kubectl apply, kubectl delete, and kubectl edit you've ever run — the resulting state lives in etcd. It's the single source of truth for everything Kubernetes knows about your cluster.

If it's lost or corrupted, your cluster doesn't have that information anymore. It can't process new requests. It doesn't know what should be running or where.

So how do you make sure you're never in that situation? Snapshots.

The snapshot command

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=<trusted-ca-file> --cert=<cert-file> --key=<key-file> \
  snapshot save <backup-file-location>

The cacert, cert, and key file paths can be pulled from the etcd pod spec — kubectl describe pod -n kube-system <etcd-pod> will show them.

Takes about 10 minutes to set up properly the first time. Seconds to run after that.

One thing to note: encrypt the snapshot

etcd snapshots contain every Kubernetes Secret in plaintext.

Don't leave them sitting on the same node as etcd or in an unencrypted bucket. Move them offsite — S3 with KMS, or whatever your equivalent is. A snapshot that doesn't survive a node failure isn't really a backup.

How often?

Daily via cron is the baseline in production. For high-churn clusters with frequent deploys or lots of CRDs being installed, bump it to hourly.

Disk is cheap. Rebuilding your cluster from memory isn't.

The restore

A backup means nothing without knowing how to restore. Here's how:

etcdutl --data-dir <data-dir-location> snapshot restore snapshot.db

But be careful. Never restore while API servers are running.

The correct order is:

Stop all API server instances
Restore state in all etcd instances
Restart the API server instances

Skipping this order causes data inconsistencies that are worse than the original problem.

High-availability clusters

In production clusters you typically run 3 or 5 etcd instances for high availability — but each one still needs to be backed up independently.

This is the one backup that matters more than any other in the cluster.

Do you back up your etcd regularly? If you aren't, you're one corrupted node away from starting over.

Part of the series

K8s with Divine →

Kubernetes deep dives covering the operational details most engineers miss — eviction order, resource requests, DaemonSets, and more.

Originally shared on LinkedIn.