-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasource proxy returning "too many outstanding requests" #4613
Comments
Hello @applike-ss, thanks for raising the issue! More details here https://grafana.com/docs/grafana/latest/administration/configuration/#max_idle_connections. |
Thanks for the hint. I tried it and at first it looked like it works, but after the next 5-10 requests (changed the filter condition for use a different log stream) the same behaviour came back. Just from watching the behaviour it looks like some resource is not being freed. It only seems to effect the time series graph. Not sure now if it is actually grafana or if it is loki. Do you know if grafana can cause this error message at all? |
👋 @applike-ss Can you share your Loki configuration ? I think you're hitting a default limit somewhere. |
Here's my current loki config:
It contains some helm placeholders, just FYI |
Increase |
@cyriltovena Thanks a lot! That one did it :-) |
@trevorwhitney @slim-bean should we change the default of this ? |
@cyriltovena @slim-bean We hit this limit consistently with a grafana query longer than 30m after update to loki 2.4.2 The default for |
Sounds like a good first PR to update the default. |
👍 I feel good about changing the default as long as 2048 won't negatively impact a smaller cluster? |
After updating loki to 2.4.2 I ran into this issue aswell? Why is this happening after an update just now? EDIT: After changing |
@zLucPlayZ @trevorwhitney I had to tune a ton of options to get our single instance loki to respond to bigger queries (up to 3 months) from a grafana dashboard. I tuned the variables until I run into the proxy timeout limit. some defaults are way off. Below is our current loki-config.yml: # Loki Config file
# based on https://github.com/grafana/loki/blob/master/cmd/loki/loki-docker-config.yaml
# Documentation: https://grafana.com/docs/loki/latest/configuration/
# The module to run Loki with. Supported values
# all, distributor, ingester, querier, query-frontend, table-manager.
# [target: <string> | default = "all"]
target: all
# Enables authentication through the X-Scope-OrgID header, which must be present
# if true. If false, the OrgID will always be set to "fake".
auth_enabled: false
# Configures the server of the launched module(s).
server:
http_listen_port: 3100
http_server_read_timeout: 60s # allow longer time span queries
http_server_write_timeout: 60s # allow longer time span queries
grpc_server_max_recv_msg_size: 33554432 # 32MiB (int bytes), default 4MB
grpc_server_max_send_msg_size: 33554432 # 32MiB (int bytes), default 4MB
# Log only messages with the given severity or above. Supported values [debug,
# info, warn, error]
# CLI flag: -log.level
log_level: info
# Configures the ingester and how the ingester will register itself to a
# key value store.
ingester:
lifecycler:
final_sleep: 0s
chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-11-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb:
directory: /data/loki/index
filesystem:
directory: /data/loki/chunks
boltdb_shipper:
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
cache_ttl: 72h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
compactor:
working_directory: /data/loki/boltdb-shipper-compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
retention_period: 91d
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
# Per-user ingestion rate limit in sample size per second. Units in MB.
# CLI flag: -distributor.ingestion-rate-limit-mb
ingestion_rate_mb: 8 # <float> | default = 4]
# Per-user allowed ingestion burst size (in sample size). Units in MB.
# The burst size refers to the per-distributor local rate limiter even in the
# case of the "global" strategy, and should be set at least to the maximum logs
# size expected in a single push request.
# CLI flag: -distributor.ingestion-burst-size-mb
ingestion_burst_size_mb: 16 # <int> | default = 6]
# Maximum byte rate per second per stream,
# also expressible in human readable forms (1MB, 256KB, etc).
# CLI flag: -ingester.per-stream-rate-limit
per_stream_rate_limit: 5MB # <string|int> | default = "3MB"
# Maximum burst bytes per stream,
# also expressible in human readable forms (1MB, 256KB, etc).
# This is how far above the rate limit a stream can "burst" before the stream is limited.
# CLI flag: -ingester.per-stream-rate-limit-burst
per_stream_rate_limit_burst: 15MB # <string|int> | default = "15MB"
# The limit to length of chunk store queries. 0 to disable.
# CLI flag: -store.max-query-length
max_query_length: 2165h # <duration> | default = 721h
# Limit how far back in time series data and metadata can be queried,
# up until lookback duration ago.
# This limit is enforced in the query frontend, the querier and the ruler.
# If the requested time range is outside the allowed range, the request will not fail,
# but will be modified to only query data within the allowed time range.
# The default value of 0 does not set a limit.
# CLI flag: -querier.max-query-lookback
max_query_lookback: 90d
# # no longer used by default. retention is done by compactor
# table_manager:
# retention_deletes_enabled: true
# retention_period: 91d
querier:
max_concurrent: 20
frontend:
# Maximum number of outstanding requests per tenant per frontend; requests
# beyond this error with HTTP 429.
# CLI flag: -querier.max-outstanding-requests-per-tenant
max_outstanding_per_tenant: 2048 # default = 100]
# Compress HTTP responses.
# CLI flag: -querier.compress-http-responses
compress_responses: true # default = false]
# Log queries that are slower than the specified duration. Set to 0 to disable.
# Set to < 0 to enable on all queries.
# CLI flag: -frontend.log-queries-longer-than
log_queries_longer_than: 20s
frontend_worker:
grpc_client_config:
# The maximum size in bytes the client can send.
# CLI flag: -<prefix>.grpc-max-send-msg-size
max_send_msg_size: 33554432 # 32MiB, default = 16777216]
max_recv_msg_size: 33554432
ingester_client:
grpc_client_config:
# The maximum size in bytes the client can send.
# CLI flag: -<prefix>.grpc-max-send-msg-size
max_send_msg_size: 33554432 # 32mb, default = 16777216]
max_recv_msg_size: 33554432
query_scheduler:
max_outstanding_requests_per_tenant: 2048
grpc_client_config:
# The maximum size in bytes the client can send.
# CLI flag: -<prefix>.grpc-max-send-msg-size
max_send_msg_size: 33554432 # 32mb, default = 16777216]
max_recv_msg_size: 33554432
query_range:
split_queries_by_interval: 0 # 720h # 30d
ruler:
storage:
type: local
local:
directory: /data/loki/rules # volume, directory to scan for rules
rule_path: /data/loki/rules-temp # volume, store temporary rule files
alertmanager_url: "https://alertmanager.example.com"
enable_alertmanager_v2: true
alertmanager_client:
basic_auth_username: "{{ loki_alertmanager_username }}"
basic_auth_password: "{{ loki_alertmanager_password }}"
# Common config to be shared between multiple modules.
# If a more specific config is given in other sections, the related config under this section
# will be ignored.
common:
path_prefix: /data/loki
# storage:
# filesystem:
# chunks_directory: /data/loki/chunks
# rules_directory: /data/loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
|
Thanks for sharing your configuration! Im still a bit confused what exactly is meant by |
If i'm not mistaken then this splits the data fetching between the queriers if the query loads more data then can fit into 1 interval. |
The problem comes with the latest version 2.4.2: https://github.com/grafana/loki/pull/5077/files#diff-025adfc5a8f641b9f5a1869996e3297b6c17f13933f52354cd9b375548ad7970R399 |
@dfoxg yeah, we debated on if we should change that or not. My understanding is that's it's not ideal for small, or single node clusters, but we we deferred to optimize production workloads in which it would be an improvement. Just to make sure though, if we increase the default for @pgassmann I'm curious why you set |
@trevorwhitney after setting |
@trevorwhitney Without disabling We run loki as a single docker container with docker compose and an additional nginx proxy for authentication. Our ansible role is public: https://github.com/teamapps-org/ansible-collection-teamapps-general/tree/main/roles/loki/templates Loki is installed with filesystem storage on a CPX31 instance on Hetzner Cloud with 4 cpus, 8gb ram, 160gb disk. That should be enough resources for a lot of logs before we need a more complex setup. The amount of tuning required to make use of the available resources is not aligned with the simplicity of the setup. |
I did not yet try it with setting |
Encountered this after upgrade to v2.4.2 on a multi-instance setup.
Perhaps this issue should be reopened or a new one created. I was seeing many errors in the journal of the form |
@pgassmann for a single instance setup, I think setting @setpill Could you also try setting The change we made to set this to |
@setpill are your queries failing when you see these log lines? If so, is the http code you're seeing a |
+1 for this solution, works for me: |
@trevorwhitney In addition to or instead of |
@setpill in addition. |
@trevorwhitney What would I be hoping to accomplish with that? The system already works, courtesy of |
Oh, my misunderstanding, I thought you were still having issues. If everything is working I don't think you need to change anything. |
Hi i'm having the same issue using loki deployed with helm (v2.10.1). Is there a way to set these config parameters using values.yaml? |
My values.yml looks like following
|
Hi, I had this problem during my storage migration from filesystem to minio s3 in Loki 2.4.2 got "vector_core::stream::driver: Service call failed. error=ServerError { code: 429 } request_id=5020 |
@LinTechSo would you mind creating a new issue, you can reference this one, but as it's rather old I think you're more likely to get some help on a new issue. |
@trevorwhitney |
I figure I can increase |
@dpkirchner I am in a similar situation as you, did you find a way to see outstanding requests info? |
@MA-MacDonald No, unfortunately. I may need to move away from Loki because of this and a handful of other unsolvable issues so I probably won't be able to figure it out. |
@dpkirchner i will probably also start considering other solutions, this is so buggy, some of the definitions are not updated, the last working version of the loki-stack is 2.1 something. I think the devs have no clue whats really going on out there with their software How can it be possible that you get a "too many outstanding" requests error after an upgrade on a mainline and get even comments from devs that it isn't a bug, pretty hilarious and sad at the same time For whomever reads this, don't invest your time trying to resolve this, i've invested like 5 days on this. We reinstalled almost every version to figure our whats working at the end of the day. You cant run old versions of Grafana it has priviledge escalation CVE's https://www.cvedetails.com/product/47055/Grafana-Grafana.html?vendor_id=18548 if you read this and your an existing user, put your grafana behind a firewall at least or use the ingress blocklist! |
query_range and query_scheduler for the sake of grafana/loki#4613 'custom_config' is for all future sections which may be added and aren't available at the moment of publishing this role
query_range and query_scheduler for the sake of grafana/loki#4613 'custom_config' is for all future sections which may be added and aren't available at the moment of publishing this role
We also had a lot of "403 too many outstanding requests" on loki 2.5.0 and 2.4.2. |
Hi, any updates ? |
Tried a lot of configs on v2.6.1. Nothing helped. |
Why is this closed? I tried this
and it seemed to work for a time, but now not anymore. EDIT: upping |
Al final despues de dar muchas vueltas y probar varias configuraciones lo que me soluciono el error de "too many outstanding request", fue el cambio de parallelise_shardable_queries de true a false. Excelente!!!! |
My god I wish these configurations were explained better. It's unclear which configuration works with which configuration hand in hand. It seems It seems Loki has over 784 configuration options and 12 microservices (and counting ++). Isn't it bit ironic that Loki tried to simplify operational burden of logging infrastructure, yet created this monstrosity.
|
What happened:
We were trying out loki and the integration into grafana. When adding a dashboard with not only logs but also time series visualization, we encountered "too many outstanding requests" (HTTP 429) responses from grafana in the network monitor.
The exclamation mark symbol was shown in the panels on the top left side with the same text "too many outstanding requests".
However a quick google search for exactly that term in combination with loki OR grafana reveiled nothing that seemed to be the same issue.
What you expected to happen:
I would expect to be able to configure how many requests can be processed simultaneously and also to find possible response codes in the documentation. There should be a possibility to troubleshoot the issue without diving into the code.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: