Blog

Writing

Three series. Real problems from production, what broke, and what held.

AWS Daily with Divine

7 posts

K8s with Divine

9 posts

Notes from Production

2 posts

K8s with Divine
Upgrading Kubernetes in Production Without Downtime. The Order of Operations Is Everything.
May 14, 2026
The commands are the easy part. Sequencing them so workloads stay live is what matters. Pre-flight checks, control plane first, drain with PDBs, plus the gotchas managed-K8s docs leave out.
6 min read
AWS Daily with Divine
Your Secrets Manager Bill Has Email Addresses In It. Look Here First.
May 13, 2026
Most teams default to Secrets Manager for every config value. Parameter Store is free for most of that. The cost difference is roughly 40x per entry.
5 min read
K8s with Divine
I Built a Production-Shaped EKS Cluster with Terraform. Here's Everything That Bit Me.
May 12, 2026
From-scratch EKS with Terraform. Subnet placement, OIDC + IRSA, cross-account ENIs, and the two settings that hide until kubectl times out from your laptop.
7 min read
K8s with Divine
Kubernetes Service Accounts Should Be Boring. Most Teams Make Them a Risk.
May 12, 2026
Every pod gets a service account token mounted by default. That token is an identity, and identities can be escalated. Here's how to lock it down.
2 min read
AWS Daily with Divine
Cost Explorer Shows $800/Month in Data Transfer You Can't Explain. Look Here First.
May 9, 2026
Most teams pay NAT Gateway data-processing charges on S3 traffic without realizing it. The fix is an S3 Gateway Endpoint, and it's free.
3 min read
K8s with Divine
etcd Is the Brain of Your Cluster, Here's My 10-Minute Backup Routine
May 7, 2026
etcd is your cluster's source of truth, every Secret, deployment, and config lives there. Here's the 10-minute backup routine I set once and never skip.
2 min read
K8s with Divine
Kubernetes Will Evict Your Pods in a Specific Order
Apr 15, 2026
Most engineers think it's random. It's not. Pod eviction order is determined by QoS class, and it decides what gets killed first when a node runs out of resources.
2 min read
Notes from Production
How I Think About Blast Radius Before I Ship Anything to Production
Apr 13, 2026
Four questions before every deploy: What fails if this breaks? Who is affected? How fast can we detect it? How fast can we recover?
2 min read
AWS Daily with Divine
RDS Multi-AZ Failover Took 6 Minutes. Your SLA Requires 2.
Apr 12, 2026
Multi-AZ promotes the standby in 60 to 120 seconds. DNS caching, connection pools, and missing retries quietly stretch recovery beyond your SLA.
2 min read
K8s with Divine
DaemonSets Aren't Just for Logging, Three Production Use Cases
Apr 11, 2026
Most engineers think DaemonSets are for logs. They're for any node-level concern, monitoring, network policy, runtime security.
2 min read
AWS Daily with Divine
CloudWatch Alarms Are Firing. You Open the Dashboard and See Nothing.
Apr 10, 2026
Three reasons your alarm fired without leaving evidence on the metric graph, and why ignoring them trains engineers to stop taking alarms seriously.
2 min read
K8s with Divine
We Didn't Set Resource Requests on Our Pods in Production. Here's Exactly What Happened.
Apr 9, 2026
Without requests, the scheduler places pods blindly and noisy neighbors take down everything on the node. Without limits, there's no ceiling.
2 min read
Notes from Production
From 0 to 7,500 Users on a WhatsApp Banking Platform, What Broke, What Held
Apr 8, 2026
Three months building Kira AI in production. The technical decisions mattered less than I expected. The product and operational decisions mattered more.
1 min read
AWS Daily with Divine
Auto Scaling Is Adding Instances. Response Times Are Still Climbing.
Apr 6, 2026
Scaling kicks in, new instances launch, but response times keep rising and you can't understand why. The gap between InService and actually ready is where this lives.
2 min read
AWS Daily with Divine
API Gateway Latency Spikes Every 30 Minutes Like Clockwork
Apr 4, 2026
If your latency spikes happen randomly, it's something else. If they happen every 25–30 minutes during low traffic, it's almost certainly Lambda cold starts.
2 min read
K8s with Divine
PVC to PV Is a One-to-One Relationship, Here's What That Means in Production
Apr 3, 2026
Two storage behaviors catch people completely off guard in production: the sizing trap, and the Released state trap. Know both before they bite you.
2 min read
AWS Daily with Divine
VPC Peering Configured. Route Tables Look Correct. Instances Still Can't Communicate.
Apr 2, 2026
Four things must be in place for VPC peering to actually work, and the most common culprit by far is the second one.
2 min read
K8s with Divine
The First Thing I Check When a Pod Is in CrashLoopBackOff. It's Not the Logs.
Mar 22, 2026
Logs only help if the app started long enough to produce output. The exit code tells you why it died at the OS level, before logs even existed.
1 min read

Blog

Upgrading Kubernetes in Production Without Downtime. The Order of Operations Is Everything.

Your Secrets Manager Bill Has Email Addresses In It. Look Here First.

I Built a Production-Shaped EKS Cluster with Terraform. Here's Everything That Bit Me.

Kubernetes Service Accounts Should Be Boring. Most Teams Make Them a Risk.

Cost Explorer Shows $800/Month in Data Transfer You Can't Explain. Look Here First.

etcd Is the Brain of Your Cluster, Here's My 10-Minute Backup Routine

Kubernetes Will Evict Your Pods in a Specific Order

How I Think About Blast Radius Before I Ship Anything to Production

RDS Multi-AZ Failover Took 6 Minutes. Your SLA Requires 2.

DaemonSets Aren't Just for Logging, Three Production Use Cases

CloudWatch Alarms Are Firing. You Open the Dashboard and See Nothing.

We Didn't Set Resource Requests on Our Pods in Production. Here's Exactly What Happened.

From 0 to 7,500 Users on a WhatsApp Banking Platform, What Broke, What Held

Auto Scaling Is Adding Instances. Response Times Are Still Climbing.

API Gateway Latency Spikes Every 30 Minutes Like Clockwork

PVC to PV Is a One-to-One Relationship, Here's What That Means in Production

VPC Peering Configured. Route Tables Look Correct. Instances Still Can't Communicate.

The First Thing I Check When a Pod Is in CrashLoopBackOff. It's Not the Logs.