Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasource proxy returning "too many outstanding requests" #4613

Closed
applike-ss opened this issue Nov 2, 2021 · 42 comments
Closed

Datasource proxy returning "too many outstanding requests" #4613

applike-ss opened this issue Nov 2, 2021 · 42 comments

Comments

@applike-ss
Copy link

What happened:
We were trying out loki and the integration into grafana. When adding a dashboard with not only logs but also time series visualization, we encountered "too many outstanding requests" (HTTP 429) responses from grafana in the network monitor.
The exclamation mark symbol was shown in the panels on the top left side with the same text "too many outstanding requests".
However a quick google search for exactly that term in combination with loki OR grafana reveiled nothing that seemed to be the same issue.

What you expected to happen:
I would expect to be able to configure how many requests can be processed simultaneously and also to find possible response codes in the documentation. There should be a possibility to troubleshoot the issue without diving into the code.

How to reproduce it (as minimally and precisely as possible):

  • Setup Prometheus stack with grafana
  • Setup Loki distributed
  • Ingest lots of events for ~24h
  • Add a dashboard with logs and time series visualization
  • Set timeframe to 24h or more

Anything else we need to know?:

Environment:

  • Grafana version: v8.1.5
  • Data source type & version: loki 2.3.0
  • OS Grafana is installed on: docker image grafana/grafana:8.1.5 on k8s
  • User OS & Browser: Linux / Chrome
  • Grafana plugins: Just core plugins
  • Others: -
@dsotirakis
Copy link

Hello @applike-ss, thanks for raising the issue!
Can you please try increasing max_idle_connections constant to see if it fixes your issue?

More details here https://grafana.com/docs/grafana/latest/administration/configuration/#max_idle_connections.

@applike-ss
Copy link
Author

Thanks for the hint. I tried it and at first it looked like it works, but after the next 5-10 requests (changed the filter condition for use a different log stream) the same behaviour came back. Just from watching the behaviour it looks like some resource is not being freed. It only seems to effect the time series graph. Not sure now if it is actually grafana or if it is loki. Do you know if grafana can cause this error message at all?

@cyriltovena
Copy link
Contributor

cyriltovena commented Nov 2, 2021

👋 @applike-ss Can you share your Loki configuration ? I think you're hitting a default limit somewhere.

@applike-ss
Copy link
Author

applike-ss commented Nov 2, 2021

Here's my current loki config:

auth_enabled: true

server:
  http_listen_port: 3100

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - loki-loki-distributed-memberlist

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
  chunk_encoding: snappy
  chunk_idle_period: 15m
  chunk_block_size: 2621440
  chunk_retain_period: 1m
  max_chunk_age: 1h
  max_transfer_retries: 0

limits_config:
  enforce_metric_name: false
  max_cache_freshness_per_query: 10m
  max_query_length: 12000h
  max_query_parallelism: 256
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 30

schema_config:
  configs:
    - from: 2020-05-15
      store: aws
      object_store: s3
      schema: v11
      index:
        prefix: my-{{ .Values.environment }}-loki_
        period: 720h
        tags:
          Project: foo-project
          Environment: {{ .Values.environment }}
          Component: loki

storage_config:
  aws:
    s3: s3://eu-central-1/my-loki-bucket
    dynamodb:
      dynamodb_url: dynamodb://eu-central-1

chunk_store_config:
  # store for 1 year
  max_look_back_period: 8760h

table_manager:
  index_tables_provisioning:
    enable_ondemand_throughput_mode: true
    enable_inactive_throughput_on_demand_mode: true
  chunk_tables_provisioning:
    enable_ondemand_throughput_mode: true
    enable_inactive_throughput_on_demand_mode: true
  retention_deletes_enabled: true
  retention_period: 8640h

query_range:
  align_queries_with_step: true
  parallelise_shardable_queries: false
  max_retries: 5
  split_queries_by_interval: 15m
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        max_size_items: 1024
        validity: 24h

frontend_worker:
  frontend_address: loki-loki-distributed-query-frontend:9095
  parallelism: 2

frontend:
  log_queries_longer_than: 5s
  compress_responses: true
  tail_proxy_url: http://loki-loki-distributed-querier:3100

compactor:
  shared_store: s3

ruler:
  storage:
    type: s3
    s3:
      s3: s3://eu-central-1/my-loki-bucket
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  alertmanager_url: https://alertmanager.xx
  external_url: https://alertmanager.xx

It contains some helm placeholders, just FYI

@cyriltovena
Copy link
Contributor

cyriltovena commented Nov 2, 2021

Increase max_outstanding_per_tenant to 2048 see https://grafana.com/docs/loki/latest/configuration/#query_frontend_config

@cyriltovena cyriltovena transferred this issue from grafana/grafana Nov 2, 2021
@applike-ss
Copy link
Author

@cyriltovena Thanks a lot! That one did it :-)

@cyriltovena
Copy link
Contributor

@trevorwhitney @slim-bean should we change the default of this ?

@pgassmann
Copy link
Contributor

@cyriltovena @slim-bean We hit this limit consistently with a grafana query longer than 30m after update to loki 2.4.2

The default for max_outstanding_per_tenant is still on 100 in loki which is far away from the recommendation here to set it to 2048

@cyriltovena
Copy link
Contributor

Sounds like a good first PR to update the default.

@trevorwhitney
Copy link
Collaborator

👍 I feel good about changing the default as long as 2048 won't negatively impact a smaller cluster?

@LucHeart
Copy link

LucHeart commented Jan 21, 2022

After updating loki to 2.4.2 I ran into this issue aswell? Why is this happening after an update just now?

EDIT: After changing parallelise_shardable_queries to false, everything works again. Since parallelise should bring advantages with it, how would you fix it, or increase the limit, so I dont run into 429 too many outstanding requests

@pgassmann
Copy link
Contributor

pgassmann commented Jan 21, 2022

@zLucPlayZ @trevorwhitney I had to tune a ton of options to get our single instance loki to respond to bigger queries (up to 3 months) from a grafana dashboard. I tuned the variables until I run into the proxy timeout limit.

some defaults are way off.

Below is our current loki-config.yml:

# Loki Config file

# based on https://github.com/grafana/loki/blob/master/cmd/loki/loki-docker-config.yaml

# Documentation: https://grafana.com/docs/loki/latest/configuration/

# The module to run Loki with. Supported values
# all, distributor, ingester, querier, query-frontend, table-manager.
# [target: <string> | default = "all"]
target: all

# Enables authentication through the X-Scope-OrgID header, which must be present
# if true. If false, the OrgID will always be set to "fake".
auth_enabled: false

# Configures the server of the launched module(s).
server:
  http_listen_port: 3100
  http_server_read_timeout: 60s # allow longer time span queries
  http_server_write_timeout: 60s # allow longer time span queries
  grpc_server_max_recv_msg_size: 33554432 # 32MiB (int bytes), default 4MB
  grpc_server_max_send_msg_size: 33554432 # 32MiB (int bytes), default 4MB

  # Log only messages with the given severity or above. Supported values [debug,
  # info, warn, error]
  # CLI flag: -log.level
  log_level: info

# Configures the ingester and how the ingester will register itself to a
# key value store.
ingester:
  lifecycler:
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-11-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb:
    directory: /data/loki/index

  filesystem:
    directory: /data/loki/chunks

  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 72h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem

compactor:
  working_directory: /data/loki/boltdb-shipper-compactor
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 91d
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

  # Per-user ingestion rate limit in sample size per second. Units in MB.
  # CLI flag: -distributor.ingestion-rate-limit-mb
  ingestion_rate_mb: 8 # <float> | default = 4]

  # Per-user allowed ingestion burst size (in sample size). Units in MB.
  # The burst size refers to the per-distributor local rate limiter even in the
  # case of the "global" strategy, and should be set at least to the maximum logs
  # size expected in a single push request.
  # CLI flag: -distributor.ingestion-burst-size-mb
  ingestion_burst_size_mb: 16 # <int> | default = 6]

  # Maximum byte rate per second per stream,
  # also expressible in human readable forms (1MB, 256KB, etc).
  # CLI flag: -ingester.per-stream-rate-limit
  per_stream_rate_limit: 5MB # <string|int> | default = "3MB"

  # Maximum burst bytes per stream,
  # also expressible in human readable forms (1MB, 256KB, etc).
  # This is how far above the rate limit a stream can "burst" before the stream is limited.
  # CLI flag: -ingester.per-stream-rate-limit-burst
  per_stream_rate_limit_burst: 15MB # <string|int> | default = "15MB"

  # The limit to length of chunk store queries. 0 to disable.
  # CLI flag: -store.max-query-length
  max_query_length: 2165h # <duration> | default = 721h

  # Limit how far back in time series data and metadata can be queried,
  # up until lookback duration ago.
  # This limit is enforced in the query frontend, the querier and the ruler.
  # If the requested time range is outside the allowed range, the request will not fail,
  # but will be modified to only query data within the allowed time range.
  # The default value of 0 does not set a limit.
  # CLI flag: -querier.max-query-lookback
  max_query_lookback: 90d

# # no longer used by default. retention is done by compactor
# table_manager:
#   retention_deletes_enabled: true
#   retention_period: 91d

querier:
  max_concurrent: 20

frontend:
  # Maximum number of outstanding requests per tenant per frontend; requests
  # beyond this error with HTTP 429.
  # CLI flag: -querier.max-outstanding-requests-per-tenant
  max_outstanding_per_tenant: 2048 # default = 100]

  # Compress HTTP responses.
  # CLI flag: -querier.compress-http-responses
  compress_responses: true # default = false]

  # Log queries that are slower than the specified duration. Set to 0 to disable.
  # Set to < 0 to enable on all queries.
  # CLI flag: -frontend.log-queries-longer-than
  log_queries_longer_than: 20s

frontend_worker:
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32MiB, default = 16777216]
    max_recv_msg_size: 33554432

ingester_client:
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32mb, default = 16777216]
    max_recv_msg_size: 33554432

query_scheduler:
  max_outstanding_requests_per_tenant: 2048
  grpc_client_config:
    # The maximum size in bytes the client can send.
    # CLI flag: -<prefix>.grpc-max-send-msg-size
    max_send_msg_size: 33554432 # 32mb, default = 16777216]
    max_recv_msg_size: 33554432

query_range:
  split_queries_by_interval: 0 # 720h # 30d

ruler:
  storage:
    type: local
    local:
      directory: /data/loki/rules # volume, directory to scan for rules
  rule_path: /data/loki/rules-temp # volume, store temporary rule files
  alertmanager_url: "https://alertmanager.example.com"
  enable_alertmanager_v2: true
  alertmanager_client:
    basic_auth_username: "{{ loki_alertmanager_username }}"
    basic_auth_password: "{{ loki_alertmanager_password }}"


# Common config to be shared between multiple modules.
# If a more specific config is given in other sections, the related config under this section
# will be ignored.
common:
  path_prefix: /data/loki
  # storage:
  #   filesystem:
  #     chunks_directory: /data/loki/chunks
  #     rules_directory: /data/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

@LucHeart
Copy link

Thanks for sharing your configuration! Im still a bit confused what exactly is meant by split_queries_by_interval. Is this used to determine offloading work to other nodes in a cluster?

@applike-ss
Copy link
Author

If i'm not mistaken then this splits the data fetching between the queriers if the query loads more data then can fit into 1 interval.

@dfoxg
Copy link

dfoxg commented Jan 21, 2022

The problem comes with the latest version 2.4.2: https://github.com/grafana/loki/pull/5077/files#diff-025adfc5a8f641b9f5a1869996e3297b6c17f13933f52354cd9b375548ad7970R399

@trevorwhitney
Copy link
Collaborator

@dfoxg yeah, we debated on if we should change that or not. My understanding is that's it's not ideal for small, or single node clusters, but we we deferred to optimize production workloads in which it would be an improvement. Just to make sure though, if we increase the default for max_outstanding_per_tenant is this still an issue?

@pgassmann I'm curious why you set query_range.split_queries_by_interval is set to 0? That will result in much bigger query results and much less parallelization. I would instead think that reducing querier.max_concurrent but keeping the default split of 30m would yield better results. Are you running single binary? Is a single node handling the requests, or have you scaled this deployment? How big is the box you're running on?

@dfoxg
Copy link

dfoxg commented Jan 22, 2022

@trevorwhitney after setting max_outstanding_requests_per_tenant to 2048 the error goes away. at least on my system.

@pgassmann
Copy link
Contributor

@trevorwhitney Without disabling query_range.split_queries_by_interval, we immediately hit the too many outstanding requests limit, event with max_outstanding_requests_per_tenant max_outstanding_per_tenant set to 4096 (default 100!). Also, the documentation recommends to set this to a multiple of 24h, but the default is 30min 😕 .

We run loki as a single docker container with docker compose and an additional nginx proxy for authentication. Our ansible role is public: https://github.com/teamapps-org/ansible-collection-teamapps-general/tree/main/roles/loki/templates

Loki is installed with filesystem storage on a CPX31 instance on Hetzner Cloud with 4 cpus, 8gb ram, 160gb disk. That should be enough resources for a lot of logs before we need a more complex setup. The amount of tuning required to make use of the available resources is not aligned with the simplicity of the setup.

@pgassmann
Copy link
Contributor

I did not yet try it with setting parallelise_shardable_queries to false. Perhaps that would be a sensible setting for a single instance setup.

@setpill
Copy link

setpill commented Jan 25, 2022

Encountered this after upgrade to v2.4.2 on a multi-instance setup.

max_outstanding_requests_per_tenant: 2048 did not fix it.
parallelise_shardable_queries: false did fix it.

Perhaps this issue should be reopened or a new one created.

I was seeing many errors in the journal of the form level=error ts=[timestamp] caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"; often coming in bursts of several 10s at a time. These have completely disappeared since disabling parallelisation.

@trevorwhitney
Copy link
Collaborator

@pgassmann for a single instance setup, I think setting parallelise_shadable_queries to false is sensible, as well as setting -querier.split-queries-by-interval=0.

@setpill Could you also try setting -querier.split-queries-by-interval=0.

The change we made to set this to 30m by default again is likely preferable on bigger clusters, where there are a lot of query workers available to process the scheduler queue. In a smaller env, the 30m split may be causing your queue to back up faster than it can be processed. You can check to see if this is the problem by looking at the cortex_query_scheduler_queue_length metric if you're using the scheduler, or the cortex_query_frontend_queue_length metric if you're not.

@trevorwhitney
Copy link
Collaborator

@setpill are your queries failing when you see these log lines? If so, is the http code you're seeing a 499?

@kathy-lee
Copy link

+1 for this solution, works for me:
max_outstanding_requests_per_tenant: 2048 did not fix it.
parallelise_shardable_queries: false did fix it.

@setpill
Copy link

setpill commented Jan 27, 2022

@setpill Could you also try setting -querier.split-queries-by-interval=0.

@trevorwhitney In addition to or instead of parallelise_shardable_queries: false ?

@trevorwhitney
Copy link
Collaborator

@setpill in addition.

@setpill
Copy link

setpill commented Feb 1, 2022

@trevorwhitney What would I be hoping to accomplish with that? The system already works, courtesy of parallelise_shardable_queries: false. I'm happy to help debug, but please let me know what information you're hoping to find out this way.

@trevorwhitney
Copy link
Collaborator

Oh, my misunderstanding, I thought you were still having issues. If everything is working I don't think you need to change anything. parallelise_shardable_queries controls whether the query shards are processed in parallel, whereas split-queries-by-interval controls how big the shards are (the smaller the split, the bigger the query shard). Setting the latter to 0 means there's no sharding on the query. On some smaller systems, having bigger shards and therefore less items in the queue that need processing is sometimes better.

@c-seeger
Copy link

Hi i'm having the same issue using loki deployed with helm (v2.10.1). Is there a way to set these config parameters using values.yaml?

@dfoxg
Copy link

dfoxg commented Feb 19, 2022

My values.yml looks like following

config:
  query_scheduler:
    max_outstanding_requests_per_tenant: 2048

  query_range:
    parallelise_shardable_queries: false
    split_queries_by_interval: 0

@LinTechSo
Copy link
Contributor

Hi, I had this problem during my storage migration from filesystem to minio s3 in Loki 2.4.2

got "vector_core::stream::driver: Service call failed. error=ServerError { code: 429 } request_id=5020
" error from my log shipper during log transformation to Loki endpoint.
related to #5714
any idea plz ?
@pgassmann
@trevorwhitney
@dsotirakis
@zLucPlayZ

@trevorwhitney
Copy link
Collaborator

@LinTechSo would you mind creating a new issue, you can reference this one, but as it's rather old I think you're more likely to get some help on a new issue.

@LinTechSo
Copy link
Contributor

LinTechSo commented Apr 4, 2022

@trevorwhitney
Hi, thanks. i created a new issue
#5714

@dpkirchner
Copy link

I figure I can increase max_outstanding_per_tenant to some number, but is there a way for me to see what the outstanding requests are doing? I'm the only person that uses Loki here and I have just two windows open, so I'm curious what is using up all of the request slots. I am wildly guessing it's related to alerts.

@MA-MacDonald
Copy link

@dpkirchner I am in a similar situation as you, did you find a way to see outstanding requests info?

@dpkirchner
Copy link

@MA-MacDonald No, unfortunately. I may need to move away from Loki because of this and a handful of other unsolvable issues so I probably won't be able to figure it out.

@Alcatros
Copy link

@dpkirchner i will probably also start considering other solutions, this is so buggy, some of the definitions are not updated, the last working version of the loki-stack is 2.1 something. I think the devs have no clue whats really going on out there with their software

How can it be possible that you get a "too many outstanding" requests error after an upgrade on a mainline and get even comments from devs that it isn't a bug, pretty hilarious and sad at the same time

For whomever reads this, don't invest your time trying to resolve this, i've invested like 5 days on this. We reinstalled almost every version to figure our whats working at the end of the day. You cant run old versions of Grafana it has priviledge escalation CVE's https://www.cvedetails.com/product/47055/Grafana-Grafana.html?vendor_id=18548 if you read this and your an existing user, put your grafana behind a firewall at least or use the ingress blocklist!

weakcamel added a commit to weakcamel/ansible-role-loki that referenced this issue Jun 10, 2022
query_range and query_scheduler for the sake of
grafana/loki#4613

'custom_config' is for all future sections which may be added
and aren't available at the moment of publishing this role
weakcamel added a commit to weakcamel/ansible-role-loki that referenced this issue Jun 10, 2022
query_range and query_scheduler for the sake of
grafana/loki#4613

'custom_config' is for all future sections which may be added
and aren't available at the moment of publishing this role
@wuestkamp
Copy link

We also had a lot of "403 too many outstanding requests" on loki 2.5.0 and 2.4.2.
Moved back to loki 2.4.1 and problem is gone.

@LinTechSo
Copy link
Contributor

Hi, any updates ?
Loki version 2.5.0

@woloss
Copy link

woloss commented Sep 7, 2022

Tried a lot of configs on v2.6.1. Nothing helped.
Switched to 2.4.1.

@mattixpet
Copy link

mattixpet commented Feb 14, 2023

Why is this closed? I tried this

query_scheduler:
  max_outstanding_requests_per_tenant: 2048

query_range:
  parallelise_shardable_queries: false
  split_queries_by_interval: 0

and it seemed to work for a time, but now not anymore.

EDIT: upping max_outstanding_requests_per_tenant to 4096 seems to have fixed this for now.

@gabrieldear2008
Copy link

After updating loki to 2.4.2 I ran into this issue aswell? Why is this happening after an update just now?

EDIT: After changing parallelise_shardable_queries to false, everything works again. Since parallelise should bring advantages with it, how would you fix it, or increase the limit, so I dont run into 429 too many outstanding requests

Al final despues de dar muchas vueltas y probar varias configuraciones lo que me soluciono el error de "too many outstanding request", fue el cambio de parallelise_shardable_queries de true a false.

Excelente!!!!

@shinebayar-g
Copy link

shinebayar-g commented Dec 14, 2023

My god I wish these configurations were explained better. It's unclear which configuration works with which configuration hand in hand. It seems frontend.max_outstanding_per_tenant default value is increased to 2048. But still no luck. I am coming from helm scalable installation. Out of the box I can't even query for more than 2 days. It's running 10 pods. (simple setup btw).

It seems Loki has over 784 configuration options and 12 microservices (and counting ++).

Isn't it bit ironic that Loki tried to simplify operational burden of logging infrastructure, yet created this monstrosity.
In case if you're running to this error, this seems to be helping.

  • Increase limits_config.split_queries_by_interval from Helm chart's default 15m to original default 30m. Or go higher like 1h or 2h if you have low traffic log volume. Look at Grafana's This query will process approximately XXX MiB text. Unless you're querying >100MB for 30m, you don't really need to split those queries. Idk.
  • Increase query_scheduler.max_outstanding_requests_per_tenant from default 100 to 2048. Idk why it stayed at 100 when querier.max_outstanding_per_tenant default is at 2048, no idea. But it definitely helped.
  • Increase querier.max_concurrent from default 10 to 20. So it runs 20 parallel request per reader (If you're running 3x reader it's gonna make 60 requests in parallel?). You may skip this option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests