proposal: increase default value for `max_samples_per_send` #5166

valyala · 2019-01-31T19:30:29Z

Proposal

Default value for max_samples_per_send - 100 - is too low for any non-idle Prometheus setup with remote_write enabled. It results in too frequent requests to remote storage if Prometheus scrapes more than a few hundred metrics per second. High requests' frequency wastes resources on both Prometheus and remote storage sides, so users have to increase max_samples_per_send after the first attempt to write metrics to remote storage.

It would be great if default value for max_samples_per_send is increased from 100 to 1000 or even 10K. This would simplify remote_write configuration for the majority of users.

The text was updated successfully, but these errors were encountered:

cstyan · 2019-02-01T00:48:36Z

Default value for max_samples_per_send - 100 - is too low for any non-idle Prometheus setup with remote_write enabled. It results in too frequent requests to remote storage if Prometheus scrapes more than a few hundred metrics per second.

Can you elaborate? I don't think we're seeing any issues like this with any of our Prometheus instances.

valyala · 2019-02-01T02:34:08Z

Github users usually use high values for max_samples_per_send. This suggests that the default value is too low.
See this and this issue. They contain suggestions for increasing max_samples_per_send to 1000 in order to fix performance issues.

juliusv · 2019-02-04T00:44:26Z

Yeah, 100 seems a bit low.

bboreham · 2019-02-04T16:23:29Z

Another datapoint: Weaveworks customer config is set to 1000.

@valyala a lot of those configs on GitHub also have max_shards: 10000 which suggests they haven't thought this through...

valyala · 2019-02-04T17:51:27Z

Yeah, default max_shards should be lowered to appropriate value when increasing default max_samples_per_send value

beorn- · 2019-04-17T14:29:51Z

The system CPU usage was quite high after adding a significant amount of metrics.

We were about to scale the platform and we noticed the CPU system usage was high. After profiling the system we've noticed that the kernel was spending a very significant amount of time handling lookups in established TCP connections kernel hashtable.

After checking things i've seen that in our setup we were pushing 150k datapoints/s which meant 1500 tcp connections/s : Hence a serious amount of TIME_WAIT connections (perfectly normal).

Load was through the roof (more than 30 on a 12 cores server) and rules evaluation time exploded( 30s instead of the usual milliseconds)

We have ended up with

    queue_config:
      capacity: 300000
      max_shards: 100
      max_samples_per_send: 10000

It now works fine with 2-3 load and very good rules evaluations time

One problem still remains. it eats up more memory, if i'm not mistaken and the prometheus code is not really meant for big max_samples_per_send values. Maybe fixing keep-alive through http 1.1 might be a quick win too ?

beorn- · 2019-04-17T20:02:25Z

About keep-alived connections @elwinar and I came up with elwinar@a153ee9

It seems to fix the issue.

bboreham · 2019-04-17T22:26:00Z

FYI “http pipelining” means something different to keep-alive, and is not relevant here. See https://en.m.wikipedia.org/wiki/HTTP_pipelining

This doesn’t impact your suggestion; I just like to keep the terminology clear.

beorn- · 2019-04-17T22:36:33Z

I stand corrected. To avoid any unneeded confusion i have edited my past comments. Thanks @bboreham .

From the documentation: > The default HTTP client's Transport may not > reuse HTTP/1.x "keep-alive" TCP connections if the Body is > not read to completion and closed. This effectively enable keep-alive for the fixed requests. Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>

csmarchbanks · 2019-08-15T23:48:51Z

@tomwilkie I am curious, what are the strong reasons for 1k shards are you mention here?

csmarchbanks · 2020-04-20T14:57:55Z

We came to this in our bug scrub today. I would happily change defaults to a better value to help many people, but would like to find a value people can agree on to avoid continuously changing it.

Generally, I agree that 100 is too low, however, I have seen systems that break trying to send 1k+ samples per request. I would prefer to have defaults that work for most people than optimize for some use-cases. What would everyone think of a default of 500 samples per request, and max shards of 200 to keep total throughput similar?

valyala · 2020-04-20T15:12:59Z

What would everyone think of a default of 500 samples per request, and max shards of 200 to keep total throughput similar?

This sounds reasonable!

beorn- · 2020-04-20T15:13:56Z

Based on a study i've made for our needs. LTS services are handling this (sometimes very) differently based on their architecture.

if the LTS API handles well batching, then reducing max_shards and raising the max_samples_per_send var from 500 up to 1000 was a minimum on all the setups i have tried : victoria metrics, m3, cortex, metrictank (when it supported prom remote_write) and another i can't remember.

i ended up to keep victoria metrics, even though it has some quirks. the configuration is

    queue_config:
      capacity: 300000
      max_shards: 100
      max_samples_per_send: 10000

Not a single problem ever since

csmarchbanks · 2020-04-21T13:48:27Z

@beorn- From your above comment, it looks like your comment above of 1500 req/s would have been using 100 max samples per send I would expect 500 to be much better. In addition, significant improvement in sharding has been implemented in the last year, I believe during April of last year was the time where a bug was causing shards to constantly swap between min and max causing all sorts of isues. Would you be willing to try a newer version of Prometheus with 500 max_samples_per_send to get more recent data?

beorn- · 2020-04-21T14:24:35Z

sure. i'll give it a try asap.

NB: we're currently pushing 225k datapoints/s per prom instance

csmarchbanks · 2020-05-05T20:30:45Z

@beorn- Any updates on trying 500 samples per send?

beorn- · 2020-05-07T09:39:43Z

i'll try right away. just popped out of my mind sorry !

beorn- · 2020-05-07T09:46:26Z

Does not seem to have any effect whatsoever

valyala mentioned this issue Feb 12, 2019

Support / Document Prometheus Remote Storage prometheus-operator/prometheus-operator#352

Closed

simonpasquier added the component/remote storage label Feb 21, 2019

bboreham mentioned this issue Feb 25, 2019

Default to bigger remote_write sends #5267

Merged

brian-brazil mentioned this issue Apr 17, 2019

Release 2.8+ remote storage doesn't work on ext4 bare metal, running RH7 #5424

Closed

cstyan mentioned this issue Nov 18, 2019

Remote Write Meta Issue #6333

Open

8 tasks

amckinley mentioned this issue Jul 3, 2020

Cortex cluster generating "Duplicate sample for timestamp" errors constantly cortexproject/cortex#2832

Closed

roidelapluie mentioned this issue Aug 20, 2020

Add hetzner service discovery #7822

Merged

csmarchbanks closed this as completed in #5267 Sep 9, 2020

jakubgs mentioned this issue Nov 19, 2020

Cortex failing to handle samples out of nowhere cortexproject/cortex#3513

Closed

cstyan mentioned this issue May 10, 2021

reevaluate remote write queue config defaults #8808

Closed

grandich mentioned this issue Jun 3, 2021

Remote write on slower CPUs not catching up when high ingest rate #8890

Closed

nickbp mentioned this issue Sep 21, 2021

cortex: Avoid returning 500-599 errors to remote_write clients opstrace/opstrace#1409

Closed

prometheus locked as resolved and limited conversation to collaborators Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: increase default value for `max_samples_per_send` #5166

proposal: increase default value for `max_samples_per_send` #5166

valyala commented Jan 31, 2019

cstyan commented Feb 1, 2019

valyala commented Feb 1, 2019

juliusv commented Feb 4, 2019

bboreham commented Feb 4, 2019

valyala commented Feb 4, 2019

beorn- commented Apr 17, 2019 •

edited

Loading

beorn- commented Apr 17, 2019 •

edited

Loading

bboreham commented Apr 17, 2019

beorn- commented Apr 17, 2019

csmarchbanks commented Aug 15, 2019

csmarchbanks commented Apr 20, 2020

valyala commented Apr 20, 2020

beorn- commented Apr 20, 2020

csmarchbanks commented Apr 21, 2020

beorn- commented Apr 21, 2020 •

edited

Loading

csmarchbanks commented May 5, 2020

beorn- commented May 7, 2020

beorn- commented May 7, 2020

proposal: increase default value for max_samples_per_send #5166

proposal: increase default value for max_samples_per_send #5166

Comments

valyala commented Jan 31, 2019

Proposal

cstyan commented Feb 1, 2019

valyala commented Feb 1, 2019

juliusv commented Feb 4, 2019

bboreham commented Feb 4, 2019

valyala commented Feb 4, 2019

beorn- commented Apr 17, 2019 • edited Loading

beorn- commented Apr 17, 2019 • edited Loading

bboreham commented Apr 17, 2019

beorn- commented Apr 17, 2019

csmarchbanks commented Aug 15, 2019

csmarchbanks commented Apr 20, 2020

valyala commented Apr 20, 2020

beorn- commented Apr 20, 2020

csmarchbanks commented Apr 21, 2020

beorn- commented Apr 21, 2020 • edited Loading

csmarchbanks commented May 5, 2020

beorn- commented May 7, 2020

beorn- commented May 7, 2020

proposal: increase default value for `max_samples_per_send` #5166

proposal: increase default value for `max_samples_per_send` #5166

beorn- commented Apr 17, 2019 •

edited

Loading

beorn- commented Apr 17, 2019 •

edited

Loading

beorn- commented Apr 21, 2020 •

edited

Loading