-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: increase default value for max_samples_per_send
#5166
Comments
Can you elaborate? I don't think we're seeing any issues like this with any of our Prometheus instances. |
|
Yeah, 100 seems a bit low. |
Another datapoint: Weaveworks customer config is set to 1000. @valyala a lot of those configs on GitHub also have |
Yeah, default |
The system CPU usage was quite high after adding a significant amount of metrics. We were about to scale the platform and we noticed the CPU system usage was high. After profiling the system we've noticed that the kernel was spending a very significant amount of time handling lookups in established TCP connections kernel hashtable. After checking things i've seen that in our setup we were pushing Load was through the roof (more than 30 on a 12 cores server) and rules evaluation time exploded( 30s instead of the usual milliseconds) We have ended up with
It now works fine with 2-3 load and very good rules evaluations time One problem still remains. it eats up more memory, if i'm not mistaken and the prometheus code is not really meant for big |
About keep-alived connections @elwinar and I came up with elwinar@a153ee9 It seems to fix the issue. |
FYI “http pipelining” means something different to keep-alive, and is not relevant here. See https://en.m.wikipedia.org/wiki/HTTP_pipelining This doesn’t impact your suggestion; I just like to keep the terminology clear. |
I stand corrected. To avoid any unneeded confusion i have edited my past comments. Thanks @bboreham . |
From the documentation: > The default HTTP client's Transport may not > reuse HTTP/1.x "keep-alive" TCP connections if the Body is > not read to completion and closed. This effectively enable keep-alive for the fixed requests. Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
From the documentation: > The default HTTP client's Transport may not > reuse HTTP/1.x "keep-alive" TCP connections if the Body is > not read to completion and closed. This effectively enable keep-alive for the fixed requests. Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
From the documentation: > The default HTTP client's Transport may not > reuse HTTP/1.x "keep-alive" TCP connections if the Body is > not read to completion and closed. This effectively enable keep-alive for the fixed requests. Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
@tomwilkie I am curious, what are the strong reasons for 1k shards are you mention here? |
We came to this in our bug scrub today. I would happily change defaults to a better value to help many people, but would like to find a value people can agree on to avoid continuously changing it. Generally, I agree that 100 is too low, however, I have seen systems that break trying to send 1k+ samples per request. I would prefer to have defaults that work for most people than optimize for some use-cases. What would everyone think of a default of 500 samples per request, and max shards of 200 to keep total throughput similar? |
This sounds reasonable! |
Based on a study i've made for our needs. LTS services are handling this (sometimes very) differently based on their architecture. if the LTS API handles well batching, then reducing i ended up to keep victoria metrics, even though it has some quirks. the configuration is
Not a single problem ever since |
@beorn- From your above comment, it looks like your comment above of 1500 req/s would have been using 100 max samples per send I would expect 500 to be much better. In addition, significant improvement in sharding has been implemented in the last year, I believe during April of last year was the time where a bug was causing shards to constantly swap between min and max causing all sorts of isues. Would you be willing to try a newer version of Prometheus with 500 max_samples_per_send to get more recent data? |
sure. i'll give it a try asap. NB: we're currently pushing 225k datapoints/s per prom instance |
@beorn- Any updates on trying 500 samples per send? |
i'll try right away. just popped out of my mind sorry ! |
Does not seem to have any effect whatsoever |
Proposal
Default value for
max_samples_per_send
- 100 - is too low for any non-idle Prometheus setup withremote_write
enabled. It results in too frequent requests to remote storage if Prometheus scrapes more than a few hundred metrics per second. High requests' frequency wastes resources on both Prometheus and remote storage sides, so users have to increasemax_samples_per_send
after the first attempt to write metrics to remote storage.It would be great if default value for
max_samples_per_send
is increased from 100 to 1000 or even 10K. This would simplifyremote_write
configuration for the majority of users.The text was updated successfully, but these errors were encountered: