From Improvement to Reinvention: Deconstructing the Three Philosophies and Selection Truths of Prometheus Monitoring Architecture

Looking back at the years spent navigating the observability space—especially around building metrics systems—it feels like a long architectural pilgrimage. From the early days of babysitting a standalone Prometheus and worrying about disk space, to introducing Thanos in an attempt to achieve “infinite storage,” and now rebuilding the entire monitoring hub with Mimir, these experiences are scattered in memory, with some details already starting to blur.

Recently, I took some time to systematically revisit the pitfalls I’ve encountered and the technical decisions I’ve made over the years. Suddenly, it struck me: this isn’t just a story of technical iteration; it’s a series of philosophical choices made when facing pain points at different scales. What I once thought were “upgrades” turned out to be fundamentally different species. This post serves as a salvage summary of those fading experiences, discussing what I see as three architectural patterns and why, at a certain scale, Mimir becomes the “right” choice.

Pattern 1: The Purist — Standalone Prometheus

Standalone Prometheus Architecture

Architectural Philosophy

This is Prometheus’s original design philosophy: simple, independent, decentralized. Each Prometheus Server independently handles scraping (Pull), storage (Local TSDB), and querying.

Core Characteristics

  • Compute-Storage Coupling: Scraping and storage happen within the same process, making deployment extremely simple.
  • Data Autonomy: Each cluster’s data resides locally, with no dependency on external systems.

Limitations

This is not a scalable architecture. As data volume grows, local disks become a bottleneck. Furthermore, the lack of a global view turns multi-cluster management into isolated data silos.

Pattern 2: The Reformist — Thanos Sidecar Mode

Thanos Sidecar Architecture

To address the pain points of the standalone setup, Thanos emerged. It adopts a “non-invasive reform” philosophy: preserving Prometheus’s original architecture as much as possible while enhancing capabilities through sidecar components.

Architectural Essence: Pull-based Scaling

  • Sidecar Mechanism: Thanos is deployed as a Sidecar alongside Prometheus, uploading locally generated TSDB blocks to object storage (S3/GCS), enabling long-term retention.
  • Federated Query: The Querier component acts as a gateway, querying each Prometheus Sidecar and object storage in real-time to aggregate a global view.

Advantages and Trade-offs

  • Advantages: Smooth migration—no changes to existing Prometheus configurations are needed, and fast local data querying is preserved.
  • Trade-offs: High operational complexity. There are many components (Sidecar, Store, Compact, etc.), and querying real-time data depends on the network stability of edge clusters, leading to long query paths and unavoidable long-tail latency. Additionally, in Sidecar mode, Prometheus’s own memory pressure remains.

Pattern 3: Cloud-Native Rebuild — Mimir (Remote Write) Mode

Mimir Remote Write Architecture

Unlike Thanos, Mimir (and its predecessor Cortex) chose a “rebuild from scratch” path, embracing a Push-based philosophy. Instead of enhancing Prometheus, it demotes Prometheus to a simple scraping agent.

Architectural Essence: Centralized Compute-Storage Separation

  • Remote Write Protocol: This is the cornerstone of Mimir’s architecture. Prometheus uses remote_write to push all data in real-time to the central Mimir cluster.
  • Fully Centralized Processing: Mimir takes over all storage, indexing, query computation, and alerting rules. Edge clusters become extremely lightweight, even allowing the use of lighter agents like Grafana Agent.

Advantages and Trade-offs

  • Advantages: Extreme horizontal scalability and multi-tenant isolation. The centralized architecture allows Mimir to perform fine-grained resource scheduling and optimization for both writes and queries.
  • Trade-offs: The architecture becomes heavier, demanding extremely high stability from the central cluster. Furthermore, Remote Write transmits real-time streaming data, which consumes more cross-region network bandwidth compared to Thanos’s approach of uploading compressed blocks.

Deep Dive: Why is Mimir a Cost Killer?

While both Thanos and Mimir leverage cheap object storage, Mimir demonstrates astonishing cost advantages in ultra-large-scale scenarios (e.g., hundreds of millions of metrics per second). This isn’t magic; it stems from fundamental design differences.

1. The Most Critical: Reducing I/O Operations

The “hidden cost” in cloud object storage bills is often not storage capacity, but API call count (PUT/GET requests).

  • Thanos: The Sidecar uploads a Block every 2 hours. While the frequency per instance is low, the total request volume grows linearly with the number of clusters, still adding up significantly.
  • Mimir: Its Ingester component features an intelligent in-memory buffering mechanism. It aggregates a massive number of small write requests into large chunks in memory before writing them to object storage in batches. This drastically reduces the number of PUT requests, saving a huge amount on API call costs in large-scale write scenarios. Additionally, Mimir’s Compactor component silently merges blocks in the background, further reducing the object count and lowering subsequent GET overhead for queries.

2. Query Philosophy: Trading “Compute” for “Storage”

  • Thanos: To accelerate long-range queries (e.g., querying a year of data), Thanos typically relies on downsampling, storing additional low-resolution data copies (e.g., 5m, 1h). This not only increases compute overhead but directly doubles storage costs.
  • Mimir: It introduces a highly powerful sharded query engine (Split-and-Merge). When querying a year of data, Mimir splits it into dozens of sub-tasks and executes them in parallel. This architecture makes downsampling non-essential for high-performance queries. In most scenarios, you can store only the raw data and still achieve sub-second query responses, directly saving approximately 50% on storage space.

3. Extreme Compression in Storage Format

Mimir has deeply optimized the TSDB index. Compared to the native Prometheus index format, Mimir’s index files are smaller, further reducing storage capacity requirements.


Selection Guide: It’s a Choice, Not an Upgrade

In summary, moving from Thanos to Mimir is not an inevitable upgrade path but a philosophical choice based on business scale and operational philosophy:

  1. Path A (Steady Reformist): If you manage medium-scale clusters, have a stable Prometheus setup, and want to avoid a disruptive architectural overhaul, or if network bandwidth between the edge and center is expensive, Thanos remains the best choice. It is currently the most popular open-source scaling solution.
  2. Path B (Aggressive Cloud-Native): If you face ultra-large-scale monitoring challenges (e.g., a single view needs to handle hundreds of millions of metrics), need to build a multi-tenant monitoring PaaS platform with hard isolation, or want to optimize object storage costs to the extreme, then Remote Write + Mimir is the undisputed ultimate solution. It represents the future direction of monitoring architecture evolving towards centralization and service-orientation.

Ultimately, the key is to choose the right “scalpel” based on your actual business pain points.


Want updates? Subscribe via RSS


Related Content