use spawn_blocking for parsing #5235

xuorig · 2024-05-24T14:10:05Z

Note

Superseded by #5582

I'm investigating an issue where it looks like query parsing / validation becomes extremely latent, leading the router to stop serving requests entirely.

Given that we see parsing take more than a second frequently (possibly something to investigate on its own), it seems wise to not block a worker while we do it. This PR uses spawn_blocking to allow the workers to server other requests. Note that in similar scenarios this could lead us to exhaust the amount of blocking threads for tokio, but at least the runtime would remain unblocked.

~~I've also swapped the tokio mutex for a std::sync::Mutex as the async lock did not seem to help / wasn't needed.~~ removed.

Few other questions:

Is specific back pressure for the spawn blocking needed at this level? I'm thinking this is fine for now, back pressure can happen as a concurrency limiter / rate limiter at ingress.
Would a wait map similar to planning make sense eventually here?

router-perf · 2024-05-24T14:10:37Z

apollo-router/src/services/layers/query_analysis.rs

Geal · 2024-05-24T14:27:18Z

maybe don't change the mutex for now, we are looking at another solution in #5204 because it looks like the mutex is creating some queueing behaviour between queries
spawn_blocking will by itself backpressure the queries that need to be parsed, because at some point all threads in the blocking pool are used, but the other executor threads will still be able to handle queries which are already present in the cache

xuorig · 2024-05-24T14:33:09Z

maybe don't change the mutex for now, we are looking at another solution in #5204

ACK, changed it back

xuorig · 2024-05-24T14:43:40Z

There's parsing in warm up as well, which could cause similar issues but I guess it would rarely / never cause all workers to be blocked? Any opinion on whether we want to tackle this in the same PR?

Geal · 2024-05-27T08:32:29Z

the warm up process would only block one worker at a time, so it's a bit better, but it should probably be done too, because some deployments do not make a lot of CPU threads available for the router

xuorig · 2024-05-28T19:24:48Z

Handling warm_up as well by spawn_blocking within QueryAnalysisLayer::parse_document instead, which is used by both.

garypen

I have some concerns about the impact of the change.

max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?
I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?
At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

I suppose, between 2 and 3, the important thing would be consistent treatment. Either manufacture a spec error in both places or do special case logging if the Err is a JoinError that panicked.

xuorig · 2024-06-03T15:06:41Z

max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?

After 512 queuing starts happening, which arguably is still a lot better than the current state. In normal scenarios this queue should drain very quickly, but in worst case scenarios like the one we're seeing here at least already parsed queries can still get through rather than blocking the runtime.

I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?

No strong opinion on that one, the PR doesn't really change the failure mode of parsing panicking as far as I know. Not sure if this is something that should be "caught"?

I guess it would panic on cancellation as well, which we could handle separately with is_panic as you mentionned.

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

Might be missing something here but I don't think spawn_blocking returns an error here? It either panics or returns the result from parsing?

Overall it might make more sense to put this work in something like rayon / dedicated thread pool rather than spawn_blocking, but this seems like a good in between improvement.

garypen · 2024-06-04T09:57:57Z

max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?

After 512 queuing starts happening, which arguably is still a lot better than the current state. In normal scenarios this queue should drain very quickly, but in worst case scenarios like the one we're seeing here at least already parsed queries can still get through rather than blocking the runtime.

I agree that it's an improvement, but I was wondering if it could be improved even further.

I think it's ok to go with the 512 default and avoid the extra thinking about configuration, but thought it was worth mentioning.

I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?

No strong opinion on that one, the PR doesn't really change the failure mode of parsing panicking as far as I know. Not sure if this is something that should be "caught"?

I was wondering if we'd added a new failure mode (i.e.: spawn_blocking() itself fails), but I guess, after thinking about the implementation, that we haven't. In which case, ignore this and the next comment.

I guess it would panic on cancellation as well, which we could handle separately with is_panic as you mentionned.

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

Might be missing something here but I don't think spawn_blocking returns an error here? It either panics or returns the result from parsing?

Overall it might make more sense to put this work in something like rayon / dedicated thread pool rather than spawn_blocking, but this seems like a good in between improvement.

Yup.

xuorig · 2024-06-04T13:32:07Z

Looks like the router is not starting in integration tests:

{"timestamp":"2024-05-30T20:17:25.620548594Z","level":"ERROR","message":"Not connected to GraphOS. In order to enable these features for a self-hosted instance of Apollo Router, the Router must be connected to a graph in GraphOS (using APOLLO_KEY and APOLLO_GRAPH_REF) that provides a license for the following features:\n\nConfiguration yaml:\n* Advanced telemetry\n  .telemetry..instruments\n\n* Advanced telemetry\n  .telemetry..graphql\n\nSee https://go.apollo.dev/o/elp for more information.","target":"apollo_router::state_machine","resource":{}}
{"timestamp":"2024-05-30T20:17:25.620756559Z","level":"INFO","message":"stopped","target":"apollo_router::state_machine","resource":{}}
{"timestamp":"2024-05-30T20:17:25.621150735Z","level":"ERROR","message":"license violation","target":"apollo_router::executable","resource":{}}

Related to GraphOS / license maybe? Same thing locally. Any ideas?

xuorig · 2024-06-04T13:59:51Z

Ah, right I don't have a valid TEST_APOLLO_GRAPH_REF and probably don't have the right circleci env either.

garypen · 2024-06-05T07:32:11Z

Ah, right I don't have a valid TEST_APOLLO_GRAPH_REF and probably don't have the right circleci env either.

Recent changes to our CI testing strategy have broken tests for forked PRs. I hope this will be fixed soon.

Geal · 2024-06-06T08:08:59Z

I think it's ok to go with the 512 default and avoid the extra thinking about configuration, but thought it was worth mentioning.

I'm ok with that too, but we should note that somewhere and think of a follow up issue. The alternative right now without this PR is that validation is hogging all of the executor threads, which amounts to the same result

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

In caching_query_planner.rs it is just the warm up phase. If a query does not pass validation anymore (maybe the schema changed), then we can ignore it because we won't make a plan for it. And the result of validation would still be recorded by the query analysis cache

Geal · 2024-06-10T07:44:25Z

could you merge dev again? Apparently I can't do it from here, and dev now has this commit which will make the tests pass

Geal · 2024-06-24T16:47:00Z

@xuorig can you merge dev?

xuorig · 2024-06-25T16:08:41Z

Done, just merged @Geal

BrynCooke · 2024-07-05T08:07:33Z

Let's merge with dev again. I think there was another redis test that needed to be disabled.

Geal · 2024-07-05T08:26:37Z

@BrynCooke continue on that one, you can push on it #5582

abernix · 2024-07-05T14:48:02Z

I'm going to close this in lieu of #5582, where we can hopefully help move this along. 😄 🌴 I'll edit the description, too, to point that way. Thanks for opening this!

abernix · 2024-07-05T14:49:50Z

Just as a note, there were some outstanding questions in the original PR body above — let's try to answer those on the other PR:

Few other questions:

Is specific back pressure for the spawn blocking needed at this level? I'm thinking this is fine for now, back pressure can happen as a concurrency limiter / rate limiter at ingress.

Would a wait map similar to planning make sense eventually here?

abernix · 2024-07-09T17:34:44Z

#5582 was landed.

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [apollographql/router](https://togithub.com/apollographql/router) | minor | `v1.51.0` -> `v1.52.0` | --- ### Release Notes <details> <summary>apollographql/router (apollographql/router)</summary> ### [`v1.52.0`](https://togithub.com/apollographql/router/releases/tag/v1.52.0) [Compare Source](https://togithub.com/apollographql/router/compare/v1.51.0-rc.0...v1.52.0-rc.0) #### 🚀 Features ##### Provide helm support for when router's health_check's default path is not being used([Issue #5652](https://togithub.com/apollographql/router/issues/5652)) When helm chart is defining the liveness and readiness check probes, if the router has been configured to use a non-default health_check path, use that rather than the default ( /health ) By [Jon Christiansen](https://togithub.com/theJC) in [https://github.com/apollographql/router/pull/5653](https://togithub.com/apollographql/router/pull/5653) ##### Support new span and metrics formats for entity caching ([PR #5625](https://togithub.com/apollographql/router/pull/5625)) Metrics of the router's entity cache have been converted to the latest format with support for custom telemetry. The following example configuration shows the the `cache` instrument, the `cache` selector in the subgraph service, and the `cache` attribute of a subgraph span: ```yaml telemetry: instrumentation: instruments: default_requirement_level: none cache: apollo.router.operations.entity.cache: attributes: entity.type: true subgraph.name: subgraph_name: true supergraph.operation.name: supergraph_operation_name: string subgraph: only_cache_hit_on_subgraph_products: type: counter value: cache: hit unit: hit description: counter of subgraph request cache hit on subgraph products condition: all: - eq: - subgraph_name: true - products - gt: - cache: hit - 0 attributes: subgraph.name: true supergraph.operation.name: supergraph_operation_name: string ``` To learn more, go to [Entity caching docs](https://www.apollographql.com/docs/router/configuration/entity-caching). By [@Geal](https://togithub.com/Geal) and [@bnjjj](https://togithub.com/bnjjj) in [https://github.com/apollographql/router/pull/5625](https://togithub.com/apollographql/router/pull/5625) ##### Helm: Support renaming key for retrieving APOLLO_KEY secret ([Issue #5661](https://togithub.com/apollographql/router/issues/5661)) A user of the router Helm chart can now rename the key used to retrieve the value of the secret key referenced by `APOLLO_KEY`. Previously, the router Helm chart hardcoded the key name to `managedFederationApiKey`. This didn't support users whose infrastructure required custom key names when getting secrets, such as Kubernetes users who need to use specific key names to access a `secretStore` or `externalSecret`. This change provides a user the ability to control the name of the key to use in retrieving that value. By [Jon Christiansen](https://togithub.com/theJC) in [https://github.com/apollographql/router/pull/5662](https://togithub.com/apollographql/router/pull/5662) #### 🐛 Fixes ##### Prevent Datadog timeout errors in logs ([Issue #2058](https://togithub.com/apollographql/router/issue/2058)) The router's Datadog exporter has been updated to reduce the frequency of logged errors related to connection pools. Previously, the connection pools used by the Datadog exporter frequently timed out, and each timeout logged an error like the following: 2024-07-19T15:28:22.970360Z ERROR OpenTelemetry trace error occurred: error sending request for url (http://127.0.0.1:8126/v0.5/traces): connection error: Connection reset by peer (os error 54) Now, the pool timeout for the Datadog exporter has been changed so that timeout errors happen much less frequently. By [@BrynCooke](https://togithub.com/BrynCooke) in [https://github.com/apollographql/router/pull/5692](https://togithub.com/apollographql/router/pull/5692) ##### Allow service version overrides ([PR #5689](https://togithub.com/apollographql/router/pull/5689)) The router now supports configuration of `service.version` via YAML file configuration. This enables users to produce custom versioned builds of the router. The following example overrides the version to be `1.0`: ```yaml telemetry: exporters: tracing: common: resource: service.version: 1.0 ``` By [@BrynCooke](https://togithub.com/BrynCooke) in [https://github.com/apollographql/router/pull/5689](https://togithub.com/apollographql/router/pull/5689) ##### Populate Datadog `span.kind` ([PR #5609](https://togithub.com/apollographql/router/pull/5609)) Because Datadog traces use `span.kind` to differentiate between different types of spans, the router now ensures that `span.kind` is correctly populated using the OpenTelemetry span kind, which has a 1-2-1 mapping to those set out in [dd-trace](https://togithub.com/DataDog/dd-trace-go/blob/main/ddtrace/ext/span_kind.go). By [@BrynCooke](https://togithub.com/BrynCooke) in [https://github.com/apollographql/router/pull/5609](https://togithub.com/apollographql/router/pull/5609) ##### Remove unnecessary internal metric events from traces and spans ([PR #5649](https://togithub.com/apollographql/router/pull/5649)) The router no longer includes some internal metric events in traces and spans that shouldn't have been included originally. By [@bnjjj](https://togithub.com/bnjjj) in [https://github.com/apollographql/router/pull/5649](https://togithub.com/apollographql/router/pull/5649) ##### Support Datadog span metrics ([PR #5609](https://togithub.com/apollographql/router/pull/5609)) When using the APM view in Datadog, the router now displays span metrics for top-level spans or spans with the `_dd.measured` flag set. The router sets the `_dd.measured` flag by default for the following spans: - `request` - `router` - `supergraph` - `subgraph` - `subgraph_request` - `http_request` - `query_planning` - `execution` - `query_parsing` To enable or disable span metrics for any span, configure `span_metrics` for the Datadog exporter: ```yaml telemetry: exporters: tracing: datadog: enabled: true span_metrics: ### Disable span metrics for supergraph supergraph: false ### Enable span metrics for my_custom_span my_custom_span: true ``` By [@BrynCooke](https://togithub.com/BrynCooke) in [https://github.com/apollographql/router/pull/5609](https://togithub.com/apollographql/router/pull/5609) and [https://github.com/apollographql/router/pull/5703](https://togithub.com/apollographql/router/pull/5703) ##### Use spawn_blocking for query parsing and validation ([PR #5235](https://togithub.com/apollographql/router/pull/5235)) To prevent its executor threads from blocking on large queries, the router now runs query parsing and validation in a Tokio blocking task. By [@xuorig](https://togithub.com/xuorig) in [https://github.com/apollographql/router/pull/5235](https://togithub.com/apollographql/router/pull/5235) #### 🛠 Maintenance ##### chore: Update rhai to latest release (1.19.0) ([PR #5655](https://togithub.com/apollographql/router/pull/5655)) In Rhai 1.18.0, there were changes to how exceptions within functions were created. For details see: https://github.com/rhaiscript/rhai/blob/7e0ac9d3f4da9c892ed35a211f67553a0b451218/CHANGELOG.md?plain=1#L12 We've modified how we handle errors raised by Rhai to comply with this change, which means error message output is affected. The change means that errors in functions will no longer document which function the error occurred in, for example: ```diff - "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)\nin call to function 'process_subgraph_response_string''" + "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)'" ``` Making this change allows us to keep up with the latest version (1.19.0) of Rhai. By [@garypen](https://togithub.com/garypen) in [https://github.com/apollographql/router/pull/5655](https://togithub.com/apollographql/router/pull/5655) ##### Add version in the entity cache hash ([PR #5701](https://togithub.com/apollographql/router/pull/5701)) The hashing algorithm of the router's entity cache has been updated to include the entity cache version. \[!IMPORTANT] If you have previously enabled [entity caching](https://www.apollographql.com/docs/router/configuration/entity-caching), you should expect additional cache regeneration costs when updating to this version of the router while the new hashing algorithm comes into service. By [@bnjjj](https://togithub.com/bnjjj) in [https://github.com/apollographql/router/pull/5701](https://togithub.com/apollographql/router/pull/5701) ##### Improve testing by avoiding cache effects and redacting tracing details ([PR #5638](https://togithub.com/apollographql/router/pull/5638)) We've had some problems with flaky tests and this PR addresses some of them. The router executes in parallel and concurrently. Many of our tests use snapshots to try and make assertions that functionality is continuing to work correctly. Unfortunately, concurrent/parallel execution and static snapshots don't co-operate very well. Results may appear in pseudo-random order (compared to snapshot expectations) and so tests become flaky and fail without obvious cause. The problem becomes particularly acute with features which are specifically designed for highly concurrent operation, such as batching. This set of changes addresses some of the router testing problems by: 1. Making items in a batch test different enough that caching effects are avoided. 2. Redacting various details so that sequencing is not as much of an issue in the otel traces tests. By [@garypen](https://togithub.com/garypen) in [https://github.com/apollographql/router/pull/5638](https://togithub.com/apollographql/router/pull/5638) #### 📚 Documentation ##### Update router naming conventions ([PR #5400](https://togithub.com/apollographql/router/pull/5400)) Renames our router product to distinguish between our non-commercial and commercial offerings. Instead of referring to the **Apollo Router**, we now refer to the following: - **Apollo Router Core** is Apollo’s free-and-open (ELv2 licensed) implementation of a routing runtime for supergraphs. - **GraphOS Router** is based on the Apollo Router Core and fully integrated with GraphOS. GraphOS Routers provide access to GraphOS’s commercial runtime features. By [@shorgi](https://togithub.com/shorgi) in [https://github.com/apollographql/router/pull/5400](https://togithub.com/apollographql/router/pull/5400) #### 🧪 Experimental ##### Enable Rust-based API schema implementation ([PR #5623](https://togithub.com/apollographql/router/pull/5623)) The router has transitioned to solely using a Rust-based API schema generation implementation. Previously, the router used a Javascript-based implementation. After testing for a few months, we've validated the improved performance and robustness of the new Rust-based implementation, so the router now only uses it. By [@goto-bus-stop](https://togithub.com/goto-bus-stop) in [https://github.com/apollographql/router/pull/5623](https://togithub.com/apollographql/router/pull/5623) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View the [repository job log](https://developer.mend.io/github/apollographql/rover).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

use std mutex for query analysis cache, use spawn_blocking for parsing

67c9361

Marc-Andre Giroux added 2 commits May 24, 2024 10:21

changeset

98908a1

fmt

8e5e47f

xuorig commented May 24, 2024

View reviewed changes

apollo-router/src/services/layers/query_analysis.rs Outdated Show resolved Hide resolved

xuorig and others added 2 commits May 24, 2024 10:28

Merge branch 'dev' into spawn-blocking-parser

5ce278e

dont change mutex yet

70f1f1b

xuorig force-pushed the spawn-blocking-parser branch from 42d813c to 70f1f1b Compare May 24, 2024 14:32

Geal requested a review from a team May 27, 2024 08:33

refactor, spawn_blocking in warm up as well

b4625cc

xuorig force-pushed the spawn-blocking-parser branch from 7182428 to b4625cc Compare May 28, 2024 19:25

Marc-Andre Giroux and others added 3 commits May 29, 2024 11:51

lint

df8a6b6

Merge branch 'dev' into spawn-blocking-parser

1aee208

Merge branch 'dev' into spawn-blocking-parser

128c55c

xuorig mentioned this pull request Jun 1, 2024

Extremely slow parse times can cause the router to stop serving requests #5313

Closed

garypen reviewed Jun 3, 2024

View reviewed changes

garypen self-requested a review June 4, 2024 09:58

garypen approved these changes Jun 4, 2024

View reviewed changes

Merge branch 'dev' into spawn-blocking-parser

a7f1cc2

xuorig requested review from a team as code owners June 4, 2024 12:42

xuorig changed the title ~~use std mutex for query analysis cache, use spawn_blocking for parsing~~ use spawn_blocking for parsing Jun 4, 2024

Geal approved these changes Jun 6, 2024

View reviewed changes

abernix mentioned this pull request Jun 10, 2024

Hot reload downtime under load #5185

Open

xuorig and others added 2 commits June 25, 2024 12:06

Merge branch 'dev' into spawn-blocking-parser

9e6ea3e

lint

c3493e9

Geal mentioned this pull request Jul 2, 2024

use spawn_blocking for parsing #5582

Merged

6 tasks

abernix closed this Jul 5, 2024

This was referenced Jul 10, 2024

Revert "use spawn_blocking for parsing" #5643

Merged

Reintroduce "use spawn_blocking for parsing" #5644

Merged

bnjjj mentioned this pull request Jul 30, 2024

prep release: v1.52.0 #5744

Merged

renovate bot mentioned this pull request Sep 10, 2024

fix(deps): update apollo graphql packages abdelrahman-essawy/gpthub#59

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use spawn_blocking for parsing #5235

use spawn_blocking for parsing #5235

xuorig commented May 24, 2024 •

edited by abernix

Loading

router-perf bot commented May 24, 2024

Geal commented May 24, 2024 •

edited

Loading

xuorig commented May 24, 2024

xuorig commented May 24, 2024

Geal commented May 27, 2024

xuorig commented May 28, 2024

garypen left a comment

xuorig commented Jun 3, 2024

garypen commented Jun 4, 2024

xuorig commented Jun 4, 2024

xuorig commented Jun 4, 2024

garypen commented Jun 5, 2024

Geal commented Jun 6, 2024 •

edited

Loading

Geal commented Jun 10, 2024

Geal commented Jun 24, 2024

xuorig commented Jun 25, 2024

BrynCooke commented Jul 5, 2024

Geal commented Jul 5, 2024

abernix commented Jul 5, 2024

abernix commented Jul 5, 2024

abernix commented Jul 9, 2024

use spawn_blocking for parsing #5235

use spawn_blocking for parsing #5235

Conversation

xuorig commented May 24, 2024 • edited by abernix Loading

Superseded by #5582

router-perf bot commented May 24, 2024

Geal commented May 24, 2024 • edited Loading

xuorig commented May 24, 2024

xuorig commented May 24, 2024

Geal commented May 27, 2024

xuorig commented May 28, 2024

garypen left a comment

Choose a reason for hiding this comment

xuorig commented Jun 3, 2024

garypen commented Jun 4, 2024

xuorig commented Jun 4, 2024

xuorig commented Jun 4, 2024

garypen commented Jun 5, 2024

Geal commented Jun 6, 2024 • edited Loading

Geal commented Jun 10, 2024

Geal commented Jun 24, 2024

xuorig commented Jun 25, 2024

BrynCooke commented Jul 5, 2024

Geal commented Jul 5, 2024

abernix commented Jul 5, 2024

abernix commented Jul 5, 2024

abernix commented Jul 9, 2024

xuorig commented May 24, 2024 •

edited by abernix

Loading

Geal commented May 24, 2024 •

edited

Loading

Geal commented Jun 6, 2024 •

edited

Loading