-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wrong CONTEXT_UNWOUND_TO_CALL from special APC on ARM64 #103731
Fix wrong CONTEXT_UNWOUND_TO_CALL from special APC on ARM64 #103731
Conversation
While testing an unrelated change, I've noticed that some coreclr tests were crashing on arm64 Windows with GCStress 3 intermittently. In all the cases, the special user mode APC callback was on the stack. It turned out that the context that the callback gets from the OS has the CONTEXT_UNWOUND_TO_CALL flag set, which is incorrect, as the context is the location of where the execution was interrupted is not unwound to call. That flag resulted in adjustment of the PC to the previous instruction. In the crashing cases, it adjusted it to a location of a "ret", so the unwinding just loaded PC from LR, which got an incorrect result. The effect of it was that in this case GC stack walk ended up prematurely, which resulted in GC holes. The fix is to clear the flag on the context we get from Windows.
cc: @VSadov |
Tagging subscribers to this area: @mangod9 |
// Windows incorrectly set the CONTEXT_UNWOUND_TO_CALL in the flags of the context it passes to us. | ||
// That results in incorrect compensation of PC at some places and sometimes incorrect unwinding | ||
// and GC holes due to that. | ||
pContext->ContextFlags &= ~CONTEXT_UNWOUND_TO_CALL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do the same on NativeAOT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why only TARGET_ARM64 - because the issue was only ever seen on ARM64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that flag used in NativeAOT, so I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why only TARGET_ARM64 - because the issue was only ever seen on ARM64?
The flag is arm64 only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that flag used in NativeAOT, so I don't think so.
We pass the context to the platform unwinder. If the flag is incorrectly set, could it possibly confuse the platform unwinder as well?
I can't find any good documentation on the flag's purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any uses of this flag in the platform unwinder. This flag is only used by things like EH region matching and we seem to be able to do fine without it in NativeAOT.
@@ -5873,6 +5873,13 @@ void Thread::ApcActivationCallback(ULONG_PTR Parameter) | |||
ActivationReason reason = (ActivationReason)pData->Parameter; | |||
PCONTEXT pContext = pData->ContextRecord; | |||
|
|||
#if defined(TARGET_ARM64) | |||
// Windows incorrectly set the CONTEXT_UNWOUND_TO_CALL in the flags of the context it passes to us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we report it as a bug to Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am planning to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
While testing an unrelated change, I've noticed that some coreclr tests were crashing on arm64 Windows with GCStress 3 intermittently. In all the cases, the special user mode APC callback was on the stack. It turned out that the context that the callback gets from the OS has the CONTEXT_UNWOUND_TO_CALL flag set, which is incorrect, as the context is the location of where the execution was interrupted is not unwound to call.
That flag resulted in adjustment of the PC to the previous instruction. In the crashing cases, it adjusted it to a location of a "ret", so the unwinding just loaded PC from LR, which got an incorrect result. The effect of it was that in this case GC stack walk ended up prematurely, which resulted in GC holes.
The fix is to clear the flag on the context we get from Windows.