-
Notifications
You must be signed in to change notification settings - Fork 561
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Re-order architectural documentation (#1007)
* Made fixes and improvements to the architecture doc Signed-off-by: Jack Baldry <jack.baldry@grafana.com> Co-authored-by: aldernero <vernon.w.miller@gmail.com> * Draft "About the architecture" section * Add weighting to refactored architecture docs * Remove reference to architecture in blocks storage * Ensure title matches first heading Signed-off-by: Jack Baldry <jack.baldry@grafana.com>
- Loading branch information
Showing
23 changed files
with
331 additions
and
310 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
--- | ||
title: "About the architecture" | ||
description: "Overview of the architecture of Grafana Mimir." | ||
weight: 1000 | ||
--- | ||
|
||
# About the architecture | ||
|
||
Grafana Mimir has a service-based architecture. | ||
The system has multiple horizontally scalable microservices that run separately and in parallel. | ||
|
||
<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit --> | ||
|
||
 | ||
|
||
## Microservices | ||
|
||
Most microservices are stateless and don't require any data persisted between process restarts. | ||
|
||
Some microservices are stateful and rely on non-volatile storage to prevent data loss between process restarts. | ||
|
||
A dedicate page describes each microservice in detail. | ||
|
||
`{{< section >}}` | ||
|
||
<!-- START from blocks-storage/_index.md --> | ||
|
||
### The write path | ||
|
||
**Ingesters** receive incoming samples from the distributors. Each push request belongs to a tenant, and the ingester appends the received samples to the specific per-tenant TSDB stored on the local disk. The received samples are both kept in-memory and written to a write-ahead log (WAL) and used to recover the in-memory series in case the ingester abruptly terminates. The per-tenant TSDB is lazily created in each ingester as soon as the first samples are received for that tenant. | ||
|
||
The in-memory samples are periodically flushed to disk - and the WAL truncated - when a new TSDB block is created, which by default occurs every 2 hours. Each newly created block is then uploaded to the long-term storage and kept in the ingester until the configured `-blocks-storage.tsdb.retention-period` expires, in order to give [queriers](./querier.md) and [store-gateways](./store-gateway.md) enough time to discover the new block on the storage and download its index-header. | ||
|
||
In order to effectively use the **WAL** and being able to recover the in-memory series upon ingester abruptly termination, the WAL needs to be stored to a persistent disk which can survive in the event of an ingester failure (ie. AWS EBS volume or GCP persistent disk when running in the cloud). For example, if you're running the Mimir cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the ingesters. The location on the filesystem where the WAL is stored is the same where local TSDB blocks (compacted from head) are stored and cannot be decoupled. See also the [timeline of block uploads](production-tips/#how-to-estimate--querierquery-store-after) and [disk space estimate](production-tips/#ingester-disk-space). | ||
|
||
#### Distributor series sharding and replication | ||
|
||
Due to the replication factor N (typically 3), each time series is stored by N ingesters, and each ingester writes its own block to the long-term storage. [Compactor](./compactor.md) merges blocks from multiple ingesters into a single block, and removes duplicate samples. After blocks compaction, the storage utilization is significantly reduced. | ||
|
||
For more information, see [Compactor](./compactor.md) and [Production tips](./production-tips.md). | ||
|
||
### The read path | ||
|
||
[Queriers](./querier.md) and [store-gateways](./store-gateway.md) periodically iterate over the storage bucket to discover blocks recently uploaded by ingesters. | ||
|
||
For each discovered block, queriers only download the block's `meta.json` file (containing some metadata including min and max timestamp of samples within the block), while store-gateways download the `meta.json` as well as the index-header, which is a small subset of the block's index used by the store-gateway to lookup series at query time. | ||
|
||
Queriers use the blocks metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the store-gateway instances holding the required blocks. | ||
|
||
For more information, please refer to the following dedicated sections: | ||
|
||
<!-- END from blocks-storage/_index.md --> | ||
|
||
<!-- START from architecture.md --> | ||
|
||
## The role of Prometheus | ||
|
||
Prometheus instances scrape samples from various targets and then push them to Mimir (using Prometheus' [remote write API](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations)). That remote write API emits batched [Snappy](https://google.github.io/snappy/)-compressed [Protocol Buffer](https://developers.google.com/protocol-buffers/) messages inside the body of an HTTP `PUT` request. | ||
|
||
Mimir requires that each HTTP request bear a header specifying a tenant ID for the request. Request authentication and authorization are handled by an external reverse proxy. | ||
|
||
Incoming samples (writes from Prometheus) are handled by the [distributor](#distributor) while incoming reads (PromQL queries) are handled by the [querier](#querier) or optionally by the [query frontend](#query-frontend). | ||
|
||
## Storage | ||
|
||
The Mimir storage format is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write their series to an on-disk block (defaults to 2h block range periods). Each block is composed oy a few files storing the blocks and the block index. | ||
|
||
The TSDB block files contain samples for multiple series. The series inside the blocks are then indexed by a per-block index, which indexes metric names and labels to time series in the block files. | ||
|
||
Mimir requires an object store for the block files, which can be: | ||
|
||
- [Amazon S3](https://aws.amazon.com/s3) | ||
- [Google Cloud Storage](https://cloud.google.com/storage/) | ||
- [Microsoft Azure Storage](https://azure.microsoft.com/en-us/services/storage/) | ||
- [OpenStack Swift](https://wiki.openstack.org/wiki/Swift) | ||
- [Local Filesystem](https://thanos.io/storage.md/#filesystem) (single node only) | ||
|
||
For more information, see [Blocks storage](./blocks-storage/_index.md). | ||
|
||
<!-- END from architecture.md --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
title: "(Optional) Alertmanager" | ||
description: "Overview of the alertmanager microservice." | ||
weight: 80 | ||
--- | ||
|
||
# (Optional) Alertmanager | ||
|
||
The **alertmanager** is an **optional service** responsible for accepting alert notifications from the [ruler]({{<relref "./ruler.md">}}), deduplicating and grouping them, and routing them to the correct notification channel, such as email, PagerDuty or OpsGenie. | ||
|
||
The Mimir alertmanager is built on top of the [Prometheus Alertmanager](https://prometheus.io/docs/alerting/alertmanager/), adding multi-tenancy support. Like the [ruler]({{<relref "./ruler.md">}}), the alertmanager requires a database to store the per-tenant configuration. | ||
|
||
Alertmanager is **semi-stateful**. | ||
The Alertmanager persists information about silences and active alerts to its disk. | ||
If all of the alertmanager nodes failed simultaneously there would be a loss of data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
title: "Distributor" | ||
description: "Overview of the distributor microservice." | ||
weight: 20 | ||
--- | ||
|
||
# Distributor | ||
|
||
The **distributor** service is responsible for handling incoming samples from Prometheus. It's the first stop in the write path for series samples. Once the distributor receives samples from Prometheus, each sample is validated for correctness and to ensure that it is within the configured tenant limits, falling back to defaults in case limits have not been overridden for the specific tenant. Valid samples are then split into batches and sent to multiple [ingesters]({{<relref "./ingester.md">}}) in parallel. | ||
|
||
The validation done by the distributor includes: | ||
|
||
- The metric labels name are formally correct | ||
- The configured max number of labels per metric is respected | ||
- The configured max length of a label name and value is respected | ||
- The timestamp is not older/newer than the configured min/max time range | ||
|
||
Distributors are **stateless** and can be scaled up and down as needed. | ||
|
||
## High Availability Tracker | ||
|
||
The distributor features a **High Availability (HA) Tracker**. When enabled, the distributor deduplicates incoming samples from redundant Prometheus servers. This allows you to have multiple HA replicas of the same Prometheus servers, writing the same series to Mimir and then deduplicate these series in the Mimir distributor. | ||
|
||
The HA Tracker deduplicates incoming samples based on a cluster and replica label. The cluster label uniquely identifies the cluster of redundant Prometheus servers for a given tenant, while the replica label uniquely identifies the replica within the Prometheus cluster. Incoming samples are considered duplicated (and thus dropped) if received by any replica which is not the current primary within a cluster. | ||
|
||
The HA Tracker requires a key-value (KV) store to coordinate which replica is currently elected. The distributor will only accept samples from the current leader. Samples with one or no labels (of the replica and cluster) are accepted by default and never deduplicated. | ||
|
||
The supported KV stores for the HA tracker are: | ||
|
||
- [Consul](https://www.consul.io) | ||
- [Etcd](https://etcd.io) | ||
|
||
Note: Memberlist is not supported. Memberlist-based KV stores propagate updates using the gossip protocol, which is very slow for HA purposes: the result is that different distributors may see a different Prometheus server elected as an HA replica, which is definitely not desirable. | ||
|
||
For more information, please refer to [config for sending HA pairs data to Mimir](guides/ha-pair-handling.md) in the documentation. | ||
|
||
## Hashing | ||
|
||
Distributors use consistent hashing, in conjunction with a configurable replication factor, to determine which ingester instance(s) should receive a given series. | ||
|
||
The hash is calculated using the metric name, labels and tenant ID. | ||
|
||
There is a trade-off associated with including labels in the hash. Writes are more balanced across ingesters, but each query needs to talk to all ingesters since a metric could be spread across multiple ingesters given different label sets. | ||
|
||
### The hash ring | ||
|
||
A hash ring (stored in a key-value store) is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters]({{<relref "./ingester.md">}}) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit integer. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester which owns the token range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor. | ||
|
||
To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the [hash of the series](#hashing). When the replication factor is larger than 1, the subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result. | ||
|
||
The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25. | ||
|
||
The supported KV stores for the hash ring are: | ||
|
||
- [Consul](https://www.consul.io) | ||
- [Etcd](https://etcd.io) | ||
- Gossip [memberlist](https://github.com/hashicorp/memberlist) | ||
|
||
#### Quorum consistency | ||
|
||
Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can setup a stateless load balancer in front of it. | ||
|
||
To ensure consistent query results, Mimir uses [Dynamo-style](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before successfully responding to the Prometheus write request. | ||
|
||
## Load balancing across distributors | ||
|
||
We recommend randomly load balancing write requests across distributor instances. For example, if you're running Mimir in a Kubernetes cluster, you could run the distributors as a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/). |
Oops, something went wrong.