Prometheus at Scale: Hub-and-Spoke with Thanos
How we monitor a multi-cloud Kubernetes fleet with Prometheus agents, a central remote-write hub, and Thanos for HA and long-term storage — and cut MTTD by 90%.
- #observability
- #prometheus
- #thanos
- #kubernetes
Every Kubernetes cluster you add multiplies your monitoring problem. One Prometheus per cluster is easy; querying across twelve of them, in three clouds, with HA and a year of retention is where teams start hurting.
Here's the architecture that has worked for us in production.
The shape: hub and spoke
Each workload cluster runs Prometheus in agent mode — it scrapes locally and remote-writes everything to a central hub. No local storage beyond the WAL, no local querying, nothing to page you about in the spokes.
# spoke cluster: prometheus agent
prometheus:
prometheusSpec:
mode: agent
remoteWrite:
- url: https://metrics-hub.internal/api/v1/receive
headers:
X-Scope-Cluster: prod-aws-mumbai
queueConfig:
maxSamplesPerSend: 5000
capacity: 20000The hub runs Prometheus as a remote-write receiver, fronted by a load balancer, with external labels identifying each tenant cluster.
Why not federation?
We tried it. Federation pulls aggregated series on a scrape interval, which means you lose granularity exactly when you need it — during an incident. Remote write streams raw samples continuously, and agent mode keeps the spoke footprint tiny (we run spokes with 512Mi limits).
Thanos for the hard parts
The hub alone gives you a single pane of glass, but it's also a single point of failure. Thanos fixes the three remaining problems:
| Problem | Thanos component |
|---|---|
| HA / deduplication | Sidecar + Querier with replica labels |
| Long-term retention | Store Gateway over object storage |
| Downsampling old data | Compactor (5m/1h resolutions) |
Run two hub replicas with replica external labels, let the Querier
deduplicate, and ship blocks to S3/GCS/Azure Blob every two hours. Retention
on the hub drops to days; object storage handles the year.
thanos:
objstoreConfig:
type: S3
config:
bucket: metrics-longterm
endpoint: s3.ap-south-1.amazonaws.comAlerting: keep it close to the data
One thing we deliberately did not centralize: critical alerts. Rules like
KubeletDown or disk-pressure evaluate on the hub, but each spoke keeps a
tiny set of "is my pipeline alive?" rules with a dead-man's-switch. If a spoke
stops remote-writing, the hub notices; if the hub dies, the spokes still
scream through a secondary Alertmanager path.
Results
- MTTD down ~90% — one query surface, no "which Grafana do I check?"
- MTTR down ~50% — cross-cluster correlation in a single dashboard
- Spoke overhead small enough that nobody argues about running it everywhere
The pattern scales sideways: onboarding a new cluster is one Helm values file and one external label. That's the real win — monitoring stopped being a per-cluster project and became a platform capability.