Skip to content

Commit

Permalink
Re-order architectural documentation (#1007)
Browse files Browse the repository at this point in the history
* Made fixes and improvements to the architecture doc

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>
Co-authored-by: aldernero <vernon.w.miller@gmail.com>

* Draft "About the architecture" section
* Add weighting to refactored architecture docs
* Remove reference to architecture in blocks storage
* Ensure title matches first heading

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>
  • Loading branch information
jdbaldry authored Feb 3, 2022
1 parent 1f504dc commit 0080ea8
Show file tree
Hide file tree
Showing 23 changed files with 331 additions and 310 deletions.
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -310,12 +310,12 @@ web-build: web-pre
web-deploy:
./tools/website/web-deploy.sh

# Generates the config file documentation.
doc: ## Generates the config file documentation.
doc: clean-doc
go run ./tools/doc-generator ./docs/sources/configuration/config-file-reference.template > ./docs/sources/configuration/config-file-reference.md
go run ./tools/doc-generator ./docs/sources/blocks-storage/compactor.template > ./docs/sources/blocks-storage/compactor.md
go run ./tools/doc-generator ./docs/sources/blocks-storage/store-gateway.template > ./docs/sources/blocks-storage/store-gateway.md
go run ./tools/doc-generator ./docs/sources/blocks-storage/querier.template > ./docs/sources/blocks-storage/querier.md
go run ./tools/doc-generator ./docs/sources/architecture/compactor.template > ./docs/sources/architecture/compactor.md
go run ./tools/doc-generator ./docs/sources/architecture/store-gateway.template > ./docs/sources/architecture/store-gateway.md
go run ./tools/doc-generator ./docs/sources/architecture/querier.template > ./docs/sources/architecture/querier.md
go run ./tools/doc-generator ./docs/sources/operations/encrypt-data-at-rest.template > ./docs/sources/operations/encrypt-data-at-rest.md
embedmd -w docs/sources/configuration/prometheus-frontend.md
embedmd -w docs/sources/requests-mirroring-to-secondary-cluster.md
Expand Down
238 changes: 0 additions & 238 deletions docs/sources/architecture.md

This file was deleted.

80 changes: 80 additions & 0 deletions docs/sources/architecture/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: "About the architecture"
description: "Overview of the architecture of Grafana Mimir."
weight: 1000
---

# About the architecture

Grafana Mimir has a service-based architecture.
The system has multiple horizontally scalable microservices that run separately and in parallel.

<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->

![Architecture of Grafana Mimir](../images/architecture.png)

## Microservices

Most microservices are stateless and don't require any data persisted between process restarts.

Some microservices are stateful and rely on non-volatile storage to prevent data loss between process restarts.

A dedicate page describes each microservice in detail.

`{{< section >}}`

<!-- START from blocks-storage/_index.md -->

### The write path

**Ingesters** receive incoming samples from the distributors. Each push request belongs to a tenant, and the ingester appends the received samples to the specific per-tenant TSDB stored on the local disk. The received samples are both kept in-memory and written to a write-ahead log (WAL) and used to recover the in-memory series in case the ingester abruptly terminates. The per-tenant TSDB is lazily created in each ingester as soon as the first samples are received for that tenant.

The in-memory samples are periodically flushed to disk - and the WAL truncated - when a new TSDB block is created, which by default occurs every 2 hours. Each newly created block is then uploaded to the long-term storage and kept in the ingester until the configured `-blocks-storage.tsdb.retention-period` expires, in order to give [queriers](./querier.md) and [store-gateways](./store-gateway.md) enough time to discover the new block on the storage and download its index-header.

In order to effectively use the **WAL** and being able to recover the in-memory series upon ingester abruptly termination, the WAL needs to be stored to a persistent disk which can survive in the event of an ingester failure (ie. AWS EBS volume or GCP persistent disk when running in the cloud). For example, if you're running the Mimir cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the ingesters. The location on the filesystem where the WAL is stored is the same where local TSDB blocks (compacted from head) are stored and cannot be decoupled. See also the [timeline of block uploads](production-tips/#how-to-estimate--querierquery-store-after) and [disk space estimate](production-tips/#ingester-disk-space).

#### Distributor series sharding and replication

Due to the replication factor N (typically 3), each time series is stored by N ingesters, and each ingester writes its own block to the long-term storage. [Compactor](./compactor.md) merges blocks from multiple ingesters into a single block, and removes duplicate samples. After blocks compaction, the storage utilization is significantly reduced.

For more information, see [Compactor](./compactor.md) and [Production tips](./production-tips.md).

### The read path

[Queriers](./querier.md) and [store-gateways](./store-gateway.md) periodically iterate over the storage bucket to discover blocks recently uploaded by ingesters.

For each discovered block, queriers only download the block's `meta.json` file (containing some metadata including min and max timestamp of samples within the block), while store-gateways download the `meta.json` as well as the index-header, which is a small subset of the block's index used by the store-gateway to lookup series at query time.

Queriers use the blocks metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the store-gateway instances holding the required blocks.

For more information, please refer to the following dedicated sections:

<!-- END from blocks-storage/_index.md -->

<!-- START from architecture.md -->

## The role of Prometheus

Prometheus instances scrape samples from various targets and then push them to Mimir (using Prometheus' [remote write API](https://prometheus.io/docs/prometheus/latest/storage/#remote-storage-integrations)). That remote write API emits batched [Snappy](https://google.github.io/snappy/)-compressed [Protocol Buffer](https://developers.google.com/protocol-buffers/) messages inside the body of an HTTP `PUT` request.

Mimir requires that each HTTP request bear a header specifying a tenant ID for the request. Request authentication and authorization are handled by an external reverse proxy.

Incoming samples (writes from Prometheus) are handled by the [distributor](#distributor) while incoming reads (PromQL queries) are handled by the [querier](#querier) or optionally by the [query frontend](#query-frontend).

## Storage

The Mimir storage format is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write their series to an on-disk block (defaults to 2h block range periods). Each block is composed oy a few files storing the blocks and the block index.

The TSDB block files contain samples for multiple series. The series inside the blocks are then indexed by a per-block index, which indexes metric names and labels to time series in the block files.

Mimir requires an object store for the block files, which can be:

- [Amazon S3](https://aws.amazon.com/s3)
- [Google Cloud Storage](https://cloud.google.com/storage/)
- [Microsoft Azure Storage](https://azure.microsoft.com/en-us/services/storage/)
- [OpenStack Swift](https://wiki.openstack.org/wiki/Swift)
- [Local Filesystem](https://thanos.io/storage.md/#filesystem) (single node only)

For more information, see [Blocks storage](./blocks-storage/_index.md).

<!-- END from architecture.md -->
15 changes: 15 additions & 0 deletions docs/sources/architecture/alertmanager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "(Optional) Alertmanager"
description: "Overview of the alertmanager microservice."
weight: 80
---

# (Optional) Alertmanager

The **alertmanager** is an **optional service** responsible for accepting alert notifications from the [ruler]({{<relref "./ruler.md">}}), deduplicating and grouping them, and routing them to the correct notification channel, such as email, PagerDuty or OpsGenie.

The Mimir alertmanager is built on top of the [Prometheus Alertmanager](https://prometheus.io/docs/alerting/alertmanager/), adding multi-tenancy support. Like the [ruler]({{<relref "./ruler.md">}}), the alertmanager requires a database to store the per-tenant configuration.

Alertmanager is **semi-stateful**.
The Alertmanager persists information about silences and active alerts to its disk.
If all of the alertmanager nodes failed simultaneously there would be a loss of data.
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
---
title: "Compactor"
linkTitle: "Compactor"
weight: 4
slug: compactor
description: "Overview of the compactor microservice."
weight: 10
---

<!-- DO NOT EDIT THIS FILE - This file has been automatically generated from its .template -->

<!-- START from blocks storage -->

The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the [bucket index](./bucket-index.md) updated and, for this reason, it's a required component.

The `alertmanager` and `ruler` components can also use object storage to store its configurations and rules uploaded by users. In that case a separate bucket should be created to store alertmanager configurations and rules: using the same bucket between ruler/alertmanager and blocks will cause issue with the **[compactor](./compactor.md)**.

<!-- END from blocks storage -->

The **compactor** is an service which is responsible to:

- Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
Expand All @@ -25,7 +32,7 @@ The **vertical compaction** merges all the blocks of a tenant uploaded by ingest

The **horizontal compaction** triggers after the vertical compaction and compacts several blocks with adjacent 2-hour range periods into a single larger block. Even though the total size of block chunks doesn't change after this compaction, it may still significantly reduce the size of the index and the index-header kept in memory by store-gateways.

![Compactor - horizontal and vertical compaction](/images/blocks-storage/compactor-horizontal-and-vertical-compaction.png)
![Compactor - horizontal and vertical compaction](../images/compactor-horizontal-and-vertical-compaction.png)

<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->

Expand All @@ -48,7 +55,7 @@ Given the split blocks, the compactor then runs the **merge** stage for each sha

The merge stage is then run for subsequent compaction time ranges (eg. 12h, 24h), compacting together blocks belonging to the same shard (_not shown in the picture below_).

![Compactor - split-and-merge compaction strategy](/images/blocks-storage/compactor-split-and-merge.png)
![Compactor - split-and-merge compaction strategy](../images/compactor-split-and-merge.png)

<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
---
title: "Compactor"
linkTitle: "Compactor"
weight: 4
slug: compactor
description: "Overview of the compactor microservice."
weight: 10
---

{{ .GeneratedFileWarning }}

<!-- START from blocks storage -->

The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the [bucket index](./bucket-index.md) updated and, for this reason, it's a required component.

The `alertmanager` and `ruler` components can also use object storage to store its configurations and rules uploaded by users. In that case a separate bucket should be created to store alertmanager configurations and rules: using the same bucket between ruler/alertmanager and blocks will cause issue with the **[compactor](./compactor.md)**.

<!-- END from blocks storage -->

The **compactor** is an service which is responsible to:

- Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
Expand All @@ -25,7 +32,7 @@ The **vertical compaction** merges all the blocks of a tenant uploaded by ingest

The **horizontal compaction** triggers after the vertical compaction and compacts several blocks with adjacent 2-hour range periods into a single larger block. Even though the total size of block chunks doesn't change after this compaction, it may still significantly reduce the size of the index and the index-header kept in memory by store-gateways.

![Compactor - horizontal and vertical compaction](/images/blocks-storage/compactor-horizontal-and-vertical-compaction.png)
![Compactor - horizontal and vertical compaction](../images/compactor-horizontal-and-vertical-compaction.png)

<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->

Expand All @@ -48,7 +55,7 @@ Given the split blocks, the compactor then runs the **merge** stage for each sha

The merge stage is then run for subsequent compaction time ranges (eg. 12h, 24h), compacting together blocks belonging to the same shard (_not shown in the picture below_).

![Compactor - split-and-merge compaction strategy](/images/blocks-storage/compactor-split-and-merge.png)
![Compactor - split-and-merge compaction strategy](../images/compactor-split-and-merge.png)

<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->

Expand Down
67 changes: 67 additions & 0 deletions docs/sources/architecture/distributor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
title: "Distributor"
description: "Overview of the distributor microservice."
weight: 20
---

# Distributor

The **distributor** service is responsible for handling incoming samples from Prometheus. It's the first stop in the write path for series samples. Once the distributor receives samples from Prometheus, each sample is validated for correctness and to ensure that it is within the configured tenant limits, falling back to defaults in case limits have not been overridden for the specific tenant. Valid samples are then split into batches and sent to multiple [ingesters]({{<relref "./ingester.md">}}) in parallel.

The validation done by the distributor includes:

- The metric labels name are formally correct
- The configured max number of labels per metric is respected
- The configured max length of a label name and value is respected
- The timestamp is not older/newer than the configured min/max time range

Distributors are **stateless** and can be scaled up and down as needed.

## High Availability Tracker

The distributor features a **High Availability (HA) Tracker**. When enabled, the distributor deduplicates incoming samples from redundant Prometheus servers. This allows you to have multiple HA replicas of the same Prometheus servers, writing the same series to Mimir and then deduplicate these series in the Mimir distributor.

The HA Tracker deduplicates incoming samples based on a cluster and replica label. The cluster label uniquely identifies the cluster of redundant Prometheus servers for a given tenant, while the replica label uniquely identifies the replica within the Prometheus cluster. Incoming samples are considered duplicated (and thus dropped) if received by any replica which is not the current primary within a cluster.

The HA Tracker requires a key-value (KV) store to coordinate which replica is currently elected. The distributor will only accept samples from the current leader. Samples with one or no labels (of the replica and cluster) are accepted by default and never deduplicated.

The supported KV stores for the HA tracker are:

- [Consul](https://www.consul.io)
- [Etcd](https://etcd.io)

Note: Memberlist is not supported. Memberlist-based KV stores propagate updates using the gossip protocol, which is very slow for HA purposes: the result is that different distributors may see a different Prometheus server elected as an HA replica, which is definitely not desirable.

For more information, please refer to [config for sending HA pairs data to Mimir](guides/ha-pair-handling.md) in the documentation.

## Hashing

Distributors use consistent hashing, in conjunction with a configurable replication factor, to determine which ingester instance(s) should receive a given series.

The hash is calculated using the metric name, labels and tenant ID.

There is a trade-off associated with including labels in the hash. Writes are more balanced across ingesters, but each query needs to talk to all ingesters since a metric could be spread across multiple ingesters given different label sets.

### The hash ring

A hash ring (stored in a key-value store) is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters]({{<relref "./ingester.md">}}) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit integer. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester which owns the token range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor.

To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the [hash of the series](#hashing). When the replication factor is larger than 1, the subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result.

The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25.

The supported KV stores for the hash ring are:

- [Consul](https://www.consul.io)
- [Etcd](https://etcd.io)
- Gossip [memberlist](https://github.com/hashicorp/memberlist)

#### Quorum consistency

Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can setup a stateless load balancer in front of it.

To ensure consistent query results, Mimir uses [Dynamo-style](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before successfully responding to the Prometheus write request.

## Load balancing across distributors

We recommend randomly load balancing write requests across distributor instances. For example, if you're running Mimir in a Kubernetes cluster, you could run the distributors as a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/).
Loading

0 comments on commit 0080ea8

Please sign in to comment.