Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add $APOLLO_ROUTER_COMPUTE_THREADS environment variable #6746

Merged
merged 7 commits into from
Feb 11, 2025

Conversation

SimonSapin
Copy link
Contributor

Router creates a thread pool dedicated to "compute jobs" to avoid blocking Tokio threads. By default this pool has as many threads as available CPUs. This can now be changed by setting the $APOLLO_ROUTER_COMPUTE_THREADS environment variable to an integer value. This feature is intentionally undocumented and should be considered "experimental".

Also rename the pre-existing (also undocumented) $APOLLO_ROUTER_NUM_CORES to $APOLLO_ROUTER_IO_THREADS to better reflect what it does.

Drive-by unrelated change: make cargo run default to the main router executable.


Manual testing instructions: in a terminal, run:

APOLLO_ROUTER_COMPUTE_THREADS=4 cargo run -- \
  -c router.yaml -s examples/graphql/supergraph.graphql --dev \
  --log error,apollo_router=info,apollo_router::compute_job=debug

After seeing a log line with GraphQL endpoint exposed at http://127.0.0.1:4000/, in another terminal, run:

curl -s -H 'accept: application/json' -H 'content-type: application/json' 'http://127.0.0.1:4000/' -d '{"query":"{__typename}"}'

(This step is needed because the thread pool is created lazily on first use.) The first terminal should now have a log line with Compute thread pool size: $APOLLO_ROUTER_COMPUTE_THREADS = 4.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
    • Undocumented $APOLLO_ROUTER_NUM_CORES no longer works
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Router creates a thread pool dedicated to "compute jobs", separate from Tokio threads.
By default this pool has as many threads as available CPUs.
This can now be changed by setting this new variable to an integer value.
This feature is intentionally undocumented and should be considered "experimental".

Also rename the pre-existing (also undocumented) $APOLLO_ROUTER_NUM_CORES to $APOLLO_ROUTER_IO_THREADS to better reflect what it does.

Drive-by unrelated change: make `cargo run` default to the main router executable.
@SimonSapin SimonSapin requested review from a team as code owners February 7, 2025 11:01
Copy link
Contributor

github-actions bot commented Feb 7, 2025

@SimonSapin, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Feb 7, 2025

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Feb 7, 2025

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 2cde78877e90d97611812265

@SimonSapin SimonSapin added the backport-1.x Backport this PR to 1.x label Feb 7, 2025
let size = std::thread::available_parallelism()
.expect("available_parallelism() failed")
.get();
tracing::debug!("Compute thread pool size: available_parallelism() = {size}");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log can also help makes sure cgroup limits affect available_parallelism() like we expect when benchmarking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::info!(size, type="compute", source="calculated", "thread pool");

Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting up strings in logs makes them harder to consume or work with from JSON logs.

.ok()
.and_then(|value| value.parse::<usize>().ok())
{
tracing::debug!("Compute thread pool size: $APOLLO_ROUTER_COMPUTE_THREADS = {size}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better format for JSON parseability:

tracing::info!(size, type="compute", source="env", "thread pool");

I'd make it info since it will not be flooding the logs and it seems like useful information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

let size = std::thread::available_parallelism()
.expect("available_parallelism() failed")
.get();
tracing::debug!("Compute thread pool size: available_parallelism() = {size}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::info!(size, type="compute", source="calculated", "thread pool");


// This environment variable is intentionally undocumented.
// See also APOLLO_ROUTER_COMPUTE_THREADS in apollo-router/src/compute_job.rs
if let Some(nb) = std::env::var("APOLLO_ROUTER_IO_THREADS")
.ok()
.and_then(|value| value.parse::<usize>().ok())
{
builder.worker_threads(nb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth logging these details as well:

tracing::info!(size=nb, type="io", source="env", "thread pool");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can tracing be used before we create a Tokio runtime? (If not, I can move things around to log after we’re in async context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I added this log it was captured in the snapshot for test pq_layer_freeform_graphql_with_safelist_log_unknown_true which made the outcome of the test dependent of number of available CPUs. Rather than spend time on figuring out how to make that test filter out that specific log I’ve removed the log for now.

@@ -677,6 +681,11 @@ impl Executable {
}
};

let threads = tokio::runtime::Handle::current().metrics().num_workers();
tracing::info!(threads, type="io", "thread pool");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done here after we’ve initialized enough of the logging infrastructure. Unlike for compute threads there is no source here. The source could be:

  • $APOLLO_ROUTER_IO_THREADS, if specified and apollo_router::executable::main is used
  • Anything, if a custom binary defines its own main creating a Tokio runtime
  • $TOKIO_WORKER_THREADS if specified and nothing set Tokio’s worker_threads()
  • available_parallelism() is the eventual default

@garypen garypen self-requested a review February 11, 2025 11:48
.expect("available_parallelism() failed")
.get()
// This environment variable is intentionally undocumented.
// See also APOLLO_ROUTER_IO_THREADS in apollo-router/src/executable.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See also APOLLO_ROUTER_IO_THREADS in apollo-router/src/executable.rs

}

type Job = Box<dyn FnOnce() + Send + 'static>;

fn queue() -> &'static AgeingPriorityQueue<Job> {
pub(crate) fn queue() -> &'static AgeingPriorityQueue<Job> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deos this need to be pub(crate) now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points!

@SimonSapin SimonSapin enabled auto-merge (squash) February 11, 2025 11:52
Comment on lines 26 to 33
tracing::info!(threads, type="compute", source="env", "thread pool");
threads
} else {
let threads = std::thread::available_parallelism()
.expect("available_parallelism() failed")
.get();
tracing::info!(threads, type="compute", source="available_parallelism", "thread pool");
threads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we emit a metric for the number of threads available and the number of threads being set by the env var?

@SimonSapin SimonSapin merged commit 768f6d4 into dev Feb 11, 2025
15 checks passed
@SimonSapin SimonSapin deleted the simon/ROUTER-951 branch February 11, 2025 14:43
SimonSapin added a commit that referenced this pull request Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-1.x Backport this PR to 1.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants