Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: Small fixes to Aiven article #6481

Merged
merged 2 commits into from
Jun 28, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/blog/2023-06-08-thanos-at-aiven.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Aiven’s Journey to Thanos from M3DB
date: "2023-06-27"
author: Jonah Kowall (https://github.com/jkowall) , Michael Hoffmann https://github.com/MichaHoffmann , Alexander Rickardsson https://github.com/alxric
author: Jonah Kowall (https://github.com/jkowall), Michael Hoffmann https://github.com/MichaHoffmann, Alexander Rickardsson https://github.com/alxric
---

## About Aiven
Expand Down Expand Up @@ -32,15 +32,15 @@ This is the high level overview of our current Thanos architecture.

![Aiven Thanos Architecture](img/aiven-thanos-architecture.png)

As you can see in our architecture, we are using Telegraf since it supports monitoring the many technologies which Aiven provides in a smaller footprint. Although we support Prometheus scraping for our users, internally we push metrics to M3DB via influx line protocol. We use additional technologies together, as you can also see. We had several options where we could introduce Thanos into the mix. Furthermore, we decided to continue using Telegraf, but sending directly via remote write, versus using the Influx protocol. This created one of our first challenges, which is that some of our metrics are delayed due to the number of clouds we support. When a metric is written in the future, the ingesters would crash. It just so happened someone was just fixing this upstream (https://github.com/thanos-io/thanos/pull/6195).
As you can see in our architecture, we are using Telegraf since it supports monitoring the many technologies which Aiven provides in a smaller footprint. Although we support Prometheus scraping for our users, internally we push metrics to M3DB via influx line protocol. We use additional technologies together, as you can also see. We had several options where we could introduce Thanos into the mix. Furthermore, we decided to continue using Telegraf, but sending directly via remote write, versus using the Influx protocol. This created one of our first challenges, which is that some of our metrics are delayed due to the number of clouds we support. When a metric is written in the future, the ingesters would crash. It just so happened someone was just fixing this upstream in [PR #6195](https://github.com/thanos-io/thanos/pull/6195).

The first step was setting up a single standalone system for Thanos and testing to see how it handled a small percentage of our metric traffic. We ended up writing a script in starlark to do this sampling and configured a subset of our fleet (automated). We build a lot of test automation when we have a system in production, including running chaos testing. Our automation testing included a writing one sample, making cluster changes, and then verifying the sample was persisted. Due to an off-by-one issue in head compaction, this problem is mostly inconsequential to normal operation, but failed initially. This issue was fixed in (https://github.com/thanos-io/thanos/pull/6183).
The first step was setting up a single standalone system for Thanos and testing to see how it handled a small percentage of our metric traffic. We ended up writing a script in starlark to do this sampling and configured a subset of our fleet (automated). We build a lot of test automation when we have a system in production, including running chaos testing. Our automation testing included a writing one sample, making cluster changes, and then verifying the sample was persisted. Due to an off-by-one issue in head compaction, this problem is mostly inconsequential to normal operation, but failed initially. This issue was fixed in [PR #6183](https://github.com/thanos-io/thanos/pull/6183).

## Testing at scale

After we tackled these challenges, we decided to begin building a scale out implementation to see how it handled higher amounts of metric traffic. As part of this, we implemented more components and scaled out the components.

We take great care to manage the hash ring in a way that ensures no failed writes during a cluster failover. During the initial development phase we would encounter situations where, very briefly during startup, we would have too few endpoints in the hash ring to satisfy the replication requirements. This caused Thanos to lock up during construction of the hash ring. We fixed this deadlock in (https://github.com/thanos-io/thanos/pull/6168), but also fixed our management to not produce such hash ring configurations. Our services run as systemd units with a default stop timeout of 90 seconds. While this is sufficient for most databases, we manage, it proved insufficient for Thanos ingesting receivers that need to compact and upload decently sized tsdb heads. We noticed this in our trial runs under production traffic and increased the limits as a consequence.
We take great care to manage the hash ring in a way that ensures no failed writes during a cluster failover. During the initial development phase we would encounter situations where, very briefly during startup, we would have too few endpoints in the hash ring to satisfy the replication requirements. This caused Thanos to lock up during construction of the hash ring. We fixed this deadlock in [PR #6168](https://github.com/thanos-io/thanos/pull/6168), but also fixed our management to not produce such hash ring configurations. Our services run as systemd units with a default stop timeout of 90 seconds. While this is sufficient for most databases, we manage, it proved insufficient for Thanos ingesting receivers that need to compact and upload decently sized tsdb heads. We noticed this in our trial runs under production traffic and increased the limits as a consequence.

The community has been critical for us. One example we ran into around this portion of the implementation which we received some help from the other users and maintainers on the CNCF Slack. We also ran into some other issues which would create replication loops causing a crash as well. We addressed this issue by moving to a routing-ingesting receiver topology as suggested by the community.

Expand Down Expand Up @@ -78,13 +78,13 @@ We are also paying roughly 25% for the storage costs. M3DB has a total of 54TB o
* Thanos with 2 years retention: $27,955
* Thanos with 3 years retention: $33,423

As you can see, the cost savings are significant here. Plus, there is ongoing work for further cost optimizations in Thanos. For example, [Alexander Rickardsson](https://github.com/alxric) implemented AZ awareness in Thanos upstream now, which reduced our replication factor from 3 to 2 on Google Cloud Platform. https://github.com/thanos-io/thanos/pull/6369#event-9357572391
As you can see, the cost savings are significant here. Plus, there is ongoing work for further cost optimizations in Thanos. For example, [Alexander Rickardsson](https://github.com/alxric) implemented AZ awareness in Thanos upstream now, which reduced our replication factor from 3 to 2 on Google Cloud Platform: [PR #6369](https://github.com/thanos-io/thanos/pull/6369)

## Performance Gains

The performance normally was much better in our ongoing reporting and alerting needs as well. Today we are using vmalert to drive our alerting pipeline, since M3DB is limited, there is no such thing as an alertmanager integration. This brings me to another issue we found with vmalert it would sometimes execute rules twice within the same group evaluation period. This, by default, would realign the result timestamp with the group evaluation start time, which would lead to failed and rejected writes. The timestamp issue was caused by samples with same timestamp but different value ), this was fixed by disabling this query time alignment ( datasource.queryTimeAlignment ).
The performance normally was much better in our ongoing reporting and alerting needs as well. Today we are using vmalert to drive our alerting pipeline, since M3DB is limited, there is no such thing as an alertmanager integration. This brings me to another issue we found with vmalert it would sometimes execute rules twice within the same group evaluation period. This, by default, would realign the result timestamp with the group evaluation start time, which would lead to failed and rejected writes. The timestamp issue was caused by samples with same timestamp but different value, this was fixed by disabling this query time alignment ( datasource.queryTimeAlignment ).

With vmalert you are running queries actively at intervals. We could compare as is to see the performance differences and found out that most of the queries which were not doing regex were way faster
With vmalert you are running queries actively at intervals. We could compare as is to see the performance differences and found out that most of the queries which were not doing regex were way faster:

![Thanos vs M3DB Performance](img/thanosvsm3perf.png)

Expand All @@ -94,4 +94,4 @@ Additionally, we allowed unlimited retention of metrics since the cost of object

[Michael Hoffmann](https://github.com/MichaHoffmann) is working on other optimizations in the query engine that we will be contributing upstream. You can follow some of this work on the [query engine](https://github.com/thanos-io/promql-engine).

With such remarkable results, we are excited about the next steps, which is to be the first Thanos service. We will also continue to contribute to Thanos, and hope to have dedicated members of our OSPO in the future working in the Thanos community. A very vibrant and helpful community, we have already made contributions to. We will start on the productization work later this year, so stay tuned for our public beta next year. Furthermore, we are also seeking private beta testers who are interested in a cost-effective, truly open source Prometheus service which can run across 11 clouds and over 160+ regions, please reach out to me.
With such remarkable results, we are excited about the next steps, which is to be the first Thanos service managed as a product on Aiven. We will also continue to contribute to Thanos, and hope to have dedicated members of our OSPO in the future working in the Thanos community. A very vibrant and helpful community, we have already made contributions to. We will start on the productization work later this year, so stay tuned for our public beta next year. Furthermore, we are also seeking private beta testers who are interested in a cost-effective, truly open source Prometheus service which can run across 11 clouds and over 160+ regions, please reach out to [me on Twitter](https://twitter.com/jkowall).