Writing
Blog
Three series. Real problems from production, what broke, and what held.
K8s with Divine
Upgrading Kubernetes in Production Without Downtime. The Order of Operations Is Everything.
The commands are the easy part. Sequencing them so workloads stay live is what matters. Pre-flight checks, control plane first, drain with PDBs, plus the gotchas managed-K8s docs leave out.
6 min read
AWS Daily with Divine
Your Secrets Manager Bill Has Email Addresses In It. Look Here First.
Most teams default to Secrets Manager for every config value. Parameter Store is free for most of that. The cost difference is roughly 40x per entry.
5 min read
K8s with Divine
I Built a Production-Shaped EKS Cluster with Terraform. Here's Everything That Bit Me.
From-scratch EKS with Terraform. Subnet placement, OIDC + IRSA, cross-account ENIs, and the two settings that hide until kubectl times out from your laptop.
7 min read
K8s with Divine
Kubernetes Service Accounts Should Be Boring. Most Teams Make Them a Risk.
Every pod gets a service account token mounted by default. That token is an identity, and identities can be escalated. Here's how to lock it down.
2 min read
AWS Daily with Divine
Cost Explorer Shows $800/Month in Data Transfer You Can't Explain. Look Here First.
Most teams pay NAT Gateway data-processing charges on S3 traffic without realizing it. The fix is an S3 Gateway Endpoint, and it's free.
3 min read
K8s with Divine
etcd Is the Brain of Your Cluster, Here's My 10-Minute Backup Routine
etcd is your cluster's source of truth, every Secret, deployment, and config lives there. Here's the 10-minute backup routine I set once and never skip.
2 min read
K8s with Divine
Kubernetes Will Evict Your Pods in a Specific Order
Most engineers think it's random. It's not. Pod eviction order is determined by QoS class, and it decides what gets killed first when a node runs out of resources.
2 min read
Notes from Production
How I Think About Blast Radius Before I Ship Anything to Production
Four questions before every deploy: What fails if this breaks? Who is affected? How fast can we detect it? How fast can we recover?
2 min read
AWS Daily with Divine
RDS Multi-AZ Failover Took 6 Minutes. Your SLA Requires 2.
Multi-AZ promotes the standby in 60 to 120 seconds. DNS caching, connection pools, and missing retries quietly stretch recovery beyond your SLA.
2 min read
K8s with Divine
DaemonSets Aren't Just for Logging, Three Production Use Cases
Most engineers think DaemonSets are for logs. They're for any node-level concern, monitoring, network policy, runtime security.
2 min read
AWS Daily with Divine
CloudWatch Alarms Are Firing. You Open the Dashboard and See Nothing.
Three reasons your alarm fired without leaving evidence on the metric graph, and why ignoring them trains engineers to stop taking alarms seriously.
2 min read
K8s with Divine
We Didn't Set Resource Requests on Our Pods in Production. Here's Exactly What Happened.
Without requests, the scheduler places pods blindly and noisy neighbors take down everything on the node. Without limits, there's no ceiling.
2 min read
Notes from Production
From 0 to 7,500 Users on a WhatsApp Banking Platform, What Broke, What Held
Three months building Kira AI in production. The technical decisions mattered less than I expected. The product and operational decisions mattered more.
1 min read
AWS Daily with Divine
Auto Scaling Is Adding Instances. Response Times Are Still Climbing.
Scaling kicks in, new instances launch, but response times keep rising and you can't understand why. The gap between InService and actually ready is where this lives.
2 min read
AWS Daily with Divine
API Gateway Latency Spikes Every 30 Minutes Like Clockwork
If your latency spikes happen randomly, it's something else. If they happen every 25–30 minutes during low traffic, it's almost certainly Lambda cold starts.
2 min read
K8s with Divine
PVC to PV Is a One-to-One Relationship, Here's What That Means in Production
Two storage behaviors catch people completely off guard in production: the sizing trap, and the Released state trap. Know both before they bite you.
2 min read
AWS Daily with Divine
VPC Peering Configured. Route Tables Look Correct. Instances Still Can't Communicate.
Four things must be in place for VPC peering to actually work, and the most common culprit by far is the second one.
2 min read
K8s with Divine
The First Thing I Check When a Pod Is in CrashLoopBackOff. It's Not the Logs.
Logs only help if the app started long enough to produce output. The exit code tells you why it died at the OS level, before logs even existed.
1 min read