-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Android Not Responding with [split_config.arm64_v8a.apk!libmonosgen-2.0.so] mono_threads_attach_coop #111485
Comments
@BrzVlad - this seems GC related |
We have tried it using both 8.0 and 9.0 and issue is there. We are currently using .Net 9 |
To confirm, are you using either 9.0.1 or 8.0.12? |
We are previously using 9.0.0 and now we have also tried by using 9.0.1 and the ANR still observed |
Can you try to symbolicate following the steps in #109921 (comment) ? Either that or provide a repro. |
We have tried different GC related configurations as well
We have tried different gcLatency mode and issue observed in all the mode
How to check if it is actually applying or not as we found only the property for IsServerGC and it is false even when setting it to true. (We have checked runtimeconfig.json created in bin folder and it has this value but in log when we printed IsServerGC it displayed false.) Issue is observed with this config as well Let me know if there is any other flags related to GC to try and which is the ideal configuration for GC |
Symbolicate the trace which is as follows: |
@steveisok Is there any update or any other details required by my side |
No, this does seem different than the issue I linked to. Any chance you can provide a repro? |
No, we can't provide that directly but can you check the configurations I have tried and let me know if there is any other area where I look into code if you think any particular component causing this issue |
@steveisok Let me know if there is something i can try on at my end to find root cause asap |
We have exact te same problems with our app. For us it since the .net9 release. We can't find the root cause. |
@lateralusX @kg can you please look into this and see what might be wrong? |
You mention sometimes the app crashes instead of ANRing. Do you have stack traces for any of those crashes? |
Yes, In one of our application crash is also reported multiple times and here is the stacktrace which is reported in play console pid: 0, tid: 24209 >>> com.matrix.essapp <<< backtrace: |
pid: 0, tid: 28274 >>> [Redacted] <<< backtrace: |
I symbolicated this callstack and come up with the following (looks like its using 8.0.8): 0xd7714: try_prepare_objaddr_callvirt_optimization at /__w/1/s/src/mono/mono/mini/method-to-ir.c:0 and a crash in try_prepare_objaddr_callvirt_optimization lines up with the fix for #109921 that should be included in 8.0.12 and 9.0.1. |
If I'm going to successfully symbolicate this callstack I would need to know what version its coming from, doesn't seem to match 8.0.8 as other stack traces in this issue. |
@lateralusX Ok, we will check it by using 9.0.1 in that application for crash reported in play console @lateralusX Can you also check the symbolicate I have provided for ANR as it is major issue in our app |
I also symbolicated the original stack traces reported in this issue for application hang using 8.0.8, mainly looking at threads that had any Mono related frames on stack traces and come up with the following three threads: "id.satatyasight" tid=19235 Native syscall+28 "SGen worker" tid=19257 Native syscall+28 "Finalizer" tid=19259 Native syscall+28 At a first glance this doesn't look suspicious if it was a regular snapshot, but if thread 19235 gets stuck in that location it means it waits for runtime to resume it, since it probably hit a case where it attached while runtime was doing a GC. There is also a similar issue reported here, dotnet/android#9365, in those ANR's there are callstacks of GC blocking on stop the world, but couldn't find any stack trace in above dump that indicates that, have other ANR's included threads that blocks on sgen_stop_world ? For thread 19259, do you see similar waiting on the same lock in other ANR's as well? If that thread blocks on that lock (gc lock) it normally means a GC is running and that would explain why 19235 waits, but then we should have one thread running with a callstack similar to this: syscall+28 Do you get any output in any of the regular output logs? What kind of app is this a MAUI Android app or just using Android SDK? Does the app use any specific features, like calling runtime using unmanaged thunks, reverse delegates (managed delegates passed to native code) or managed functions marked as unmanaged callers only, attaching threads to the runtime using any of the embedding API's, or calling any other mono_* embedding API's? I try to understand where the call to mono_threads_attach_coop originates from. |
@lateralusX I have uploaded the full stacktrace of latest ANR reported in PlayConsole here https://drive.google.com/drive/folders/1NoteQYQPBRfCCFrmhuC3kzQ_q_y6MJp-?usp=drive_link Application Details: Platform: .Net for Android Used components for live view streamig
|
@lateralusX Is there any update on this |
@SejalH96, started to look a little on your latest callstacks and I think I have a lead that could end up with a deadlock during GC, that seems to be validated by at least one of the latest dumps (I will look at the other one as well to see if it points in the same direction). |
@lateralusX According to our analysis, we also have doubt on GC as the Actual ANR displayed when we touch the screen but before that all processes are already stuck after gc call For ex: As per the above log last call to gc is at 17:28:00.671 and after that there is no log till we touch the screen at 17:29:40.104 so it might points to the gc |
Jupp, the logs you provided in the above google drive both points to the same thing, and it's a hang when we try to stop the world that happens when we are about to start a new GC. I will have a fix for that scenario during the day. |
On Android we have seen ANR issues, like the one described in dotnet#111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.
…id. (#112358) On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.
On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.
On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW.
@lateralusX Thank you for the update. Could you let me know when it will be available for testing? |
@SejalH96 fix was merged yesterday for main, backport process started but will take some time before coming through all the way to SDK's. Do you have local repro to test this you could rebuild the runtime locally and test fix locally. Having that said, the fix makes sure we handle the specific deadlock scenario you hit, but on Android, that scenario should have been resolved when runtime fallbacks to pre-emptive suspend mode for threads that are not responding and that indicates that our capabilities to send specific signals to threads has been limited inside this app. Mono relies on two signals for pre-emptive suspend to work on Android, SIGPWR and SIGXCPU and no other code in the process can replace our signal handlers with their own, or you will end up in deadlocks like this on Android in case runtime have threads that won't correctly respond to cooperate suspend orchestration. If threads are attached to the runtime compatible with cooperate suspend model then we would not need to revert back to pre-emptive suspend, but there are still cases where Android SDK as well as 3'rd party code could attach threads to runtime that might depend on pre-emptive suspend working meaning that the threads signal handler needs to point to Mono signal handler for SIGPWR and SIGXCPU. I will do some further investigation on this scenario to make sure our hybrid suspend wouldn't get into a bad situation when having threads that are not full cooperate suspend policy compliant. What FFMPEG library are you using? I can see that FFMPEG sources hijacks one of the signal Mono uses, SIGXCPU and if that code is executed after runtime has been initialized, that will have impact on our pre-emptive suspend mechanism. I also see that other implementation of FFMPEG, like https://github.com/arthenica/ffmpeg-kit, actually have documentations on how to disable that signal when running the library together with Mono or Unity. |
Looked a little deeper related to hybrid suspend policy together with our stop the world (STW) implementation in Mono around its behavior in case a thread won't reply to our cooperate suspend request. Initially I believed that hybrid suspend would try to do a preemptive suspend if threads didn't respond to the cooperate suspend request, but it turns out that the first part of our STW implementation, runtime/src/mono/mono/metadata/sgen-stw.c Line 311 in 7be653f
runtime/src/mono/mono/metadata/sgen-stw.c Line 349 in 7be653f
|
…id. (#112373) Backport of #112358 to release/9.0 On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW. Co-authored-by: lateralusX <lateralusx.github@gmail.com>
…id. (#112374) Backport of #112358 to release/8.0 On Android we have seen ANR issues, like the one described in #111485. After investigating several different dumps including all threads it turns out that we could end up in a deadlock when init a monitor since that code path didn't use a coop mutex and owner of lock could end up in GC code while holding that lock, leading to deadlock if another thread was about to lock the same monitor init lock. In several dumps we see the following two threads: Thread 1: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+14 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+384 sgen_gc_lock+105 mono_gc_wait_for_bridge_processing_internal+70 sgen_gchandle_get_target+288 alloc_mon+358 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 Thread 2: syscall+28 __futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144 NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+652 alloc_mon+105 ves_icall_System_Threading_Monitor_Monitor_wait+452 ves_icall_System_Threading_Monitor_Monitor_wait_raw+583 So in this scenario Thread 1 holds monitor_mutex that is not a coop mutex and end up trying to take GC lock, since it calls, mono_gc_wait_for_bridge_processing_internal, but since a GC is already started (waiting on STW to complete), Thread 1 will block holding monitor_mutex. Thread 2 will try to lock monitor_mutex as well, and since its not a coop mutex it will block on OS __futex_wait_ex without changing Mono thread state to blocking, preventing the STW from processing. Fix is to switch to coop aware implementation of monitor_mutex. Normally this should have been resolved on Android since we run hybrid suspend meaning we should be able to run a signal handler on the blocking thread that would suspend it meaning that STW would continue, but for some reason the signal can't have been executed in this case putting the app under coop suspend limitations. This fix will take care of the deadlock, but if there are issues running Signals on Android, then threads not attached to runtime using coop attach methods could end up in similar situations blocking STW. Co-authored-by: lateralusX <lateralusx.github@gmail.com>
@lateralusX Since you mentioned that we can test this by building the .NET runtime locally, could you please provide the steps to build the runtime? Additionally, what changes are needed to ensure that our .NET 9 project uses the locally built runtime during the build process? |
@lateralusX I have also included other stack traces of ANRs or crashes with a higher report count at https://drive.google.com/drive/folders/1NoteQYQPBRfCCFrmhuC3kzQ_q_y6MJp-?usp=sharing. Could you please check if these are also resolved by this fix or analyze them further? |
@lateralusX Could you find some time to review the additional stack trace? Also, do you have an estimated release date? |
Description
split_config.arm64_v8a.apk!libmonosgen-2.0.so
Reproduction Steps
Random Behaviour and it is observed in release build frequently as many issues reported in play console
Observed after migrating xamarin native application to .Net for Android
Expected behavior
Behaviour of application should remain same as it is before migration
Actual behavior
Random behaviour observed. Some time app got crashed and some time ANR generated while there are no changes in the code after migration
Regression?
Observed after migrating xamarin native application to .Net for Android
We have tried .Net 8 and .Net 9 both and issue observed in both version
Xamarin project does not have this issues.
Known Workarounds
No response
Configuration
No response
Other information
Here is the stack trace reported in play console
The text was updated successfully, but these errors were encountered: