-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-api: fix shutdown crashes #38492
Conversation
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com>
Converted to draft as I'm still validating that it fixes the crashes we saw with version 14.x On master it looked like it resolved the issue, but I only ran for 3-4 hours (544 runs). Last time I got about 16 failures in 1311 runs on 14.x and 2 failures in 386 runs on main. Want to run on 14.x and to run for longer to confirm. If it still looks good will move out of WIP. For the run on main, I saw no crashes but valgrind still report some different issues, those will be addressed in follow up PRs. We already discussed one additional fix in the node-api team meeting today that might address the remaining issue. It will be good to get this one in first, as I'm hoping it will resolve the intermittent failures we are seeing in the node-addon-api tests so that we can do some releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the PR description is accurate, the proper fix would be calling Ref() when scheduling the SetImmediate call, and calling Unref() after it's contents have executed.
Either way, this is going to need tests as well
@gabrielschulhof you know the shutdown better than me. During env shutdown does it use the ref/unref or call the destructor directly ? I'd tried moving the iteration of the lists into the section with the ref counting but that seemed to cause it not to be run. |
Already failing on 14.x after not too long, going to try to pull in #38469 to see if that makes a difference. |
Seems like still failing on 14.x 40 crashes in 2125 runs. On quick inspection the stack traces still look like nodejs/node-addon-api#906 (comment). |
I guess my point is, more specifically, if the lifetime management of |
The original code looks like:
Once ~napi_env__() completes there should be nothing that was deferred and needs the env. The original suggestion we'd discussed in the Node-api meeting was moving the FinalizeAll into the Unref before the delete in addition to the changes I made. I had tried that but it broke the test that validated that the finalizer actually ran. My question to Gabriel was if he knew why that might be. I could be the right answer is that we add some Ref/Unrefs, but the code above would also have to be able to handle refs going to 0, running the finalizers, then only deleting the env if refs was still 0, and then possibly running the FinalizeAll again when refs reached 0 again. Gabriel suggested avoiding the SetImmediate during shutdown since we are not actually in GC so it's not necessarily needed. That would still leave issues if though if there were already Finalizers that had been deferred. I think the suggestion of removing the SetImmediate during env teardown was to make sure the finalizers being run all completed before the life of napi_env was over. |
@mhdawson Yeah, I think that's fair. Could you leave a comment in the source explaining why in this specific situation it's okay to skip the (I'll just leave the "request changes" here, because, as I mentioned, this should come with tests.) |
@addaleax will do. I'm still seeing failures on 14.x so want to figure out what the else we might need from main before pulling this out of WIP. |
Signed-off-by: Michael Dawson <mdawson@devrus.com>
Pushed another commit to cover a second case where Finalization was being deferred. |
src/node_api.cc
Outdated
}); | ||
}); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you store the lambda above in a variable and either pass it to SetImmediate
or call it directly, instead of duplicating the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, just pushed a an update which looks like it resolved the crashes on master. Will look at refactoring once I confirm that it fixes the crashses we were seeing on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is not quite the same anymore, but there is still some that should be able to be shared.
Ok, more details in nodejs/node-addon-api#906 but looks like the current version of this patch resolves the crash in main and 14.x (for 14.x I had also pulled back other changes to finalization as well). Now I'll look at refactoring and how we might test. |
@legendecas, @gabrielschulhof I've spent a number of hours trying to create a test and I'm not making any progress. I can trigger the paths where the changes have been made but not the combination where they are triggered and the env has already been deleted. I'm wondering if it takes the gc running at just the wrong time. It would be great of you two could take a look and confirm that the changes make sense to you and if you can give me any suggestions on how a test might trigger the problematic case. |
@mhdawson when I was working on #38000 I encountered the exact same issue. We are unable to produce a stable output with mere node-api c interfaces. What I am thinking about is that maybe we can create a different set of tests, in which we can facilitate both node-api c interface and node c++ interface, so that we can interleave both environments (node-api env and node Environment) to ensure the code running into the path we are expecting, i.e. directly calling into v8 for requesting gc without calling into js or picking any timing on teardown to run arbitrary native code with node Environment. As the new test set is testing on node's own specific implementation, for sure those tests should not be considered as part of the node-api behavior tests, which we can extract from node-core repo to test other node-api implementations as well. If you are thinking this is a good approach, I'd like to start working on it since we can still not test #38000 in a stable sense. |
@legendecas I had been working on a new test in test/js-native-api/test_finalization, copying code from the other tests to try to trigger the case. I think from an addon we can run both node-api code as well as node internals. In the extreme case the test.js file might just call one native method on the addon. Is that what you had in mind or something else ? |
Since I built up the PR to something that resolve the problem, I'm now validating what parts or combinations of parts are needed to resolve the problem. From my attempts to recreate, I'm not sure that the check for is terminating is needed, although it may still make sense. Similarly I'd added a Ref/Unref for both setImmediates not sure if they are both needed to resolve the issue, although may also still make sense. |
Will take a bit as each check after a change requires in the order of 6 hours running on 100 cpus to validate. |
Removed the check for the env teardown, no failures in about 450 runs. Next removed the Ref/UnRef in CallFinalizer in node-api.cc, 1 crash within the first 300 runs without it. So it is needed. Next adding back in the Ref/UnRef in CallFinalizer and commenting it out in the Buffer Finalizer section. 600 runs so far without a failure. Will let run overnight. If no failures that would mean that the minimum change to address the crashes we were seeing in master is the Ref/Unref in CallFinalizer. We'll see. Will discuss in the node-api meeting tomorrow as the other changes may still make sense even if they don' address the specific failure we were encountering. |
Yes. I believe there is no such constraints that we have to use single addon in one test right? In that case we can build two target, one in node-api and one with node internals. |
@legendecas the tests do already have addon tests that are with/without Node-api, so yes, but were you thinking they would be used together in a single test or test independently? I think we could also figure out testing that would use both node-api and node internal in a single addon. It's not something you would do in a real add-on but for a test we could make it work. |
At least for main it looks like only the change in CallFinalizer is required to address the crashes, 2529 runs overnight and no crashes seen. |
Yes, we can use them together in a single test.
IIRC, one single addon dynamic library can only have single add-on entry been valid and recognizable by Node.js. See https://github.com/nodejs/node/blob/master/src/node_binding.cc#L475-478 . So it doesn't seem to be quite feasible to put them two in a single binary? |
@addaleax From the comments above we don't have any ideas of how to add a test at this point, and it has a few approvals. Just wondering if your "request changes" is still mean to block this from landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhdawson I think it’s okay, but it feels like it should be possible to add a cctest that just calls the method in the specfic order that leads to the crash if nothing else works
I'd really like to have a test:
Thanks for being willing to let it move forward. |
CI test since the last one had flakes and is a bit old: https://ci.nodejs.org/job/node-test-pull-request/38218/ |
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Landed in c32c5f0 |
Cherry-picks cleanly on 16.x so we should not take extra work to get it there. For 14.x we'll need to backport a number of changes in the same area. I'll take a look at what we have PRs for/not and if this will land cleanly after those have. |
Just landed cleanly on v14.x-staging. |
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
@targos thanks for the update! |
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: nodejs#38492 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906 Ensure that finalization is not defered during shutdown. The env for the addon is deleted immediately after iterating the list of finalizers to be run. Defering causes crashes as the finalization uses the already deleted env. Signed-off-by: Michael Dawson <mdawson@devrus.com> PR-URL: #38492 Backport-PR-URL: #42512 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Chengzhong Wu <legendecas@gmail.com> Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Refs: nodejs/node-addon-api#906
Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.
Signed-off-by: Michael Dawson mdawson@devrus.com