-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Hook instrumentation significantly increases response time for Promise-heavy calls #1095
Comments
For context, here is a bit of history behind this issue since it's been problematic for a while now. Historically, context propagation APIs for Node have been using monkey patching on asynchronous code to track the asynchronous context. This approach has been used successfully by APM vendors for many years, even though it didn't support native code and could lose the context in some cases. When The main issue with Since the overhead comes from Node and not from the tracer, and we have to use this feature to support automatic context propagation, this is unfortunately an overhead we can't realistically avoid at the tracer level. It's also worth noting that in most cases, the overhead is low enough that it's not problematic. In order to fix this completely, we'll have to fix |
Progress for the v8 change can be tracked here. |
Progress on the Node side can be tracked in nodejs/node#36394 |
In Node.js v16.2.0, significant performance improvements in |
@bengl I dont think there will be any improvement, even worse, I assume there will be some degradation Before:
After:
Here u set the destroy hook |
The |
Fwiw today we tried bumping to node 16.2.0 to get this perf fix, and are seeing errors in our app that does this ~admittedly somewhat esoteric pattern with import { AsyncLocalStorage } from "async_hooks";
...
export const currentFlushSecret = new AsyncLocalStorage<{ flushSecret: number }>();
....
currentFlushSecret.run({ flushSecret: this.flushSecret }, async () => {
// ...call various methods that do...
const { flushSecret } = currentFlushSecret.getStore() || {};
// ...and ensure flushSecret is the expected value
if (flushSecret !== this.flushSecret) {
throw new Error("invalid usage detected");
}
}); Where the idea is that only code that is specifically ran within the Previously this "invalid usage detected" wasn't hit in our application's code, but it is being hit now in our test suite, either insinuating that a) our application code had a bug that node 16.1.0 wasn't catching because this The ^ code is part of an open source project: https://github.com/stephenh/joist-ts/blob/main/packages/orm/src/EntityManager.ts#L202 Although our currently failing tests are in an internal project. If we run the joist-ts public tests on node 16.2.0, they work just fine (and we do have tests that specifically exercise the flush secret behavior), so we don't have a repro yet. We're going to work more tomorrow on getting a repro done, in the public joist project, and verifying whether it is truly a regression in 16.2.0 or else just a bug/misuse of So, I dunno, this is not a super-useful update yet, but mostly wanted to post as an FYI in case anyone on the datadog/nodejs side of things would look at the |
cc @Qard who would have the most insight to determine at a glance if there could be a regression causing this issue. |
Okay @rochdev and @Qard , we believe we have a reproduction in this PR: https://github.com/stephenh/joist-ts/pull/122/commits Notice that it passes on 16.1.0: But fails on 16.2.0: A few notes:
Specifically we log the intent to do an
On the very first line within that
And this works, it returns
And this one is broken, we get back So, at some point between line 715 and line 731, we have lost track of the Happy to chat more, but does this give enough direction to reproduce on your side? Apologies that this project doesn't have the best "Getting Started"/etc documentation, so let me know if there are any gotchas in getting in running. (Also, technically this is likely an issue in node itself, as this joist open source project doesn't use ddtrace, we just happened to notice the issue while getting our internal project that uses both ddtrace and joist onto node 16.2.0, to leverage the ddtrace perf wins in this ticket. Would you like us open a new issue against nodejs directly? Happy to let you guys do that. Disclaimer assuming we're not doing something dumb on our side.) Thanks! |
@stephenh I'm able to reproduce with the project you've shared, but without being able to debug and with a codebase I don't know it's difficult to try to find the issue. Do you think you'd be able to extract only a small part of the code that would still reproduce the issue? Or maybe find the exact line where the context is lost? I'm not super familiar with Jest either which definitely doesn't help 😅 |
@rochdev that's a fair ask! I think we were so pleased with ourselves to having isolated the repro that we didn't push farther to make it as simple as possible. :-) It is terribly ironic you mention jest, because that might be the issue... I've created a minimal ~10 lines of code repro here: https://github.com/stephenh/async-local-storage-repro I'll defer to the readme for steps, but basically if you run code that does: 1) access The matrix is:
|
Possible cause of the new issue since 16.2.0: nodejs/node#38781 |
It looks like a fix should land in Node 16.3.0. |
Closing as this seems to be resolved. If you disagree or there's anything I missed, let me know and I can reopen. |
Sadly, I think this is still a problem with
When JIT is enabled, it significantly reduces the number of spans in our traces. As you can see from the results, enabling I was talking with @rochdev and he had me add a no-op import { createHook } from "node:async_hooks";
createHook({ init() {}, before() {}, after() {}, destroy() {} }).enable(); Just having an It seems like maybe the |
Yep, for exactly this reason there's been a bunch of discussion in Node.js core about a more performant replacement for AsyncLocalStorage which does not use async_hooks internally. There's unfortunately no alternative currently, so your option is accept some overhead or have no observability. It's a lot less overhead than it used to be, but can still be noticeable depending on how heavily an application uses promises. I'm hopeful that we'll have something much better in future Node.js versions though. :) |
Discussion about the above can be found in nodejs/node#46265 |
Describe the bug
We noticed a significant increase in response times for GraphQL queries that produce large payloads after we enabled
dd-trace
for our project.Here's what we've seen:
scope
Settingdd-trace
dd-trace
async_hooks
async_local_storage
async_local_storage
async_local_storage
noop
async_resource
branchasync_resource
const { createHook } = require('async_hooks'); createHook({ init() {} }).enable();
As you can see, it looks like
async_hooks
are the primary reason we see the slowdown, and not necessarily anything specific todd-trace
.The logic that triggers this especially pronounced behavior uses the following:
async/await
vs.Promise
The query itself produces many promises (I'm unsure of the exact number, but probably in the thousands), which we believe is exacerbating the problem.
I was working synchronously with @rochdev and @stephenh on this issue via the Datadog Slack. He mentioned that he'll add some additional detail to this issue.
Environment
node:14.8.0
image running on macOS hostThe text was updated successfully, but these errors were encountered: