Blog

Notes from production: architecture decisions, incident lessons and deep dives on the cloud-native stack.

3 posts found

May 14, 20262 min read

Prometheus at Scale: Hub-and-Spoke with Thanos

How we monitor a multi-cloud Kubernetes fleet with Prometheus agents, a central remote-write hub, and Thanos for HA and long-term storage — and cut MTTD by 90%.

#observability
#prometheus
#thanos
#kubernetes

Mar 2, 20262 min read

Kafka on Kubernetes: Lessons from Production

Rack-aware placement, tiered retention, consumer-lag SLOs and the failure modes nobody warns you about when you run Kafka for time-series AI workloads.

#kafka
#kubernetes
#distributed-systems
#streaming

Jan 20, 20262 min read

Feature-Branch Environments: Platform Engineering That Developers Actually Use

How we replaced 'works on my machine' with ephemeral per-branch environments — automated creation, dependency wiring, database lifecycle and teardown.

#platform-engineering
#kubernetes
#ci-cd
#developer-experience