Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/8.0] Internal monitor impl not using coop mutex causing deadlocks on Android. #112374

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Feb 10, 2025

Backport of #112358 to release/8.0

/cc @steveisok @lateralusX

Customer Impact

  • Customer reported
  • Found internally

As reported in #111485, customers experienced deadlocks and ultimately crashes on their Android apps. This is due to certain code paths not using a cooperative mutex and ending up in GC code where another thread attempts to lock the same monitor init lock.

Regression

  • Yes
  • No

Normally, android operates in hybrid suspend, so a deadlock like this should never happen. It could be due to a native component they are using taking ownership of a signal handler we use for coop suspend and/or a bug in coop suspend itself.

Testing

Manual testing

Risk

Low

IMPORTANT: If this backport is for a servicing release, please verify that:

  • The PR target branch is release/X.0-staging, not release/X.0.

Package authoring no longer needed in .NET 9

IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.

On Android we have seen ANR issues, like the one described in
#111485. After investigating
several different dumps including all threads it turns out that we
could end up in a deadlock when init a monitor since that code path
didn't use a coop mutex and owner of lock could end up in GC code
while holding that lock, leading to deadlock if another thread was
about to lock the same monitor init lock. In several dumps we see
the following two threads:

Thread 1:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384
sgen_gc_lock+105
mono_gc_wait_for_bridge_processing_internal+70
sgen_gchandle_get_target+288
alloc_mon+358
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

Thread 2:

syscall+28
__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144
NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652
alloc_mon+105
ves_icall_System_Threading_Monitor_Monitor_wait+452
ves_icall_System_Threading_Monitor_Monitor_wait_raw+583

So in this scenario Thread 1 holds monitor_mutex that is not a coop
mutex and end up trying to take GC lock, since it calls,
mono_gc_wait_for_bridge_processing_internal, but since a GC is already
started (waiting on STW to complete), Thread 1 will block holding
monitor_mutex.

Thread 2 will try to lock monitor_mutex as well, and since its not a
coop mutex it will block on OS __futex_wait_ex without changing
Mono thread state to blocking, preventing the STW from processing.

Fix is to switch to coop aware implementation of monitor_mutex.

Normally this should have been resolved on Android since we run
hybrid suspend meaning we should be able to run a signal handler on the
blocking thread that would suspend it meaning that STW would continue,
but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations.

This fix will take care of the deadlock, but if there are issues running
Signals on Android, then threads not attached to runtime using
coop attach methods could end up in similar situations
blocking STW.
Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. please get a code review. we will take for consideration in 8.0.x

@jeffschwMSFT jeffschwMSFT added this to the 8.0.x milestone Feb 11, 2025
@leecow leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Feb 11, 2025
@leecow leecow modified the milestones: 8.0.x, 8.0.14 Feb 11, 2025
@steveisok steveisok changed the base branch from release/8.0 to release/8.0-staging February 12, 2025 13:58
@steveisok
Copy link
Member

/ba-g BA showing known tests

@steveisok steveisok merged commit bf8ffdf into release/8.0-staging Feb 12, 2025
117 of 128 checks passed
@steveisok steveisok deleted the backport/pr-112358-to-release/8.0 branch February 12, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-meta-mono Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants