Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

==2843526==ERROR: AddressSanitizer: Joining already joined thread, aborting. #18449

Closed
stellaraccident opened this issue Sep 6, 2024 · 0 comments · Fixed by #18605
Closed
Assignees

Comments

@stellaraccident
Copy link
Collaborator

Found in ASAN in a downstream project: "==2843526==ERROR: AddressSanitizer: Joining already joined thread, aborting."

Note that ASAN aborts instantly making this hard to get a backtrace of. Recommend export ASAN_OPTIONS=sleep_before_dying=10 so there is time to break a debugger.

#0  0x00005555556350f2 in __sanitizer::internal_usleep(unsigned long long) ()
#1  0x000055555562563f in __asan::AsanDie() ()
#2  0x000055555563f8b0 in __sanitizer::Die() ()
#3  0x000055555563b4a7 in __sanitizer::ThreadArgRetval::BeforeJoin(unsigned long) const ()
#4  0x0000555555600e24 in __interceptor_pthread_join ()
#5  0x00007ffff124cb52 in iree_thread_delete (thread=0x50d00016acc0) at /home/stella/src/iree/runtime/src/iree/base/internal/threading_pthreads.c:202
#6  0x00007ffff124c764 in iree_thread_release (thread=0x50d00016acc0) at /home/stella/src/iree/runtime/src/iree/base/internal/threading_pthreads.c:219
#7  0x00007ffff1244159 in iree_hal_deferred_work_queue_destroy (work_queue=0x510000211940) at /home/stella/src/iree/runtime/src/iree/hal/utils/deferred_work_queue.c:602
#8  0x00007ffff11fcb19 in iree_hal_hip_device_destroy (base_device=0x51300000ec80) at /home/stella/src/iree/runtime/src/iree/hal/drivers/hip/hip_device.c:494
#9  0x00007ffff127c371 in iree_hal_device_destroy (device=0x51300000ec80) at /home/stella/src/iree/runtime/src/iree/hal/device.c:22
#10 0x00007ffff127c4b6 in iree_hal_device_release (device=0x51300000ec80) at /home/stella/src/iree/runtime/src/iree/hal/device.c:22
#11 0x00007ffff0e25f55 in shortfin::iree::detail::hal_device_ptr_helper::release (obj=0x51300000ec80) at /home/stella/src/sharktank/libshortfin/src/shortfin/support/iree_helpers.h:168
#12 0x00007ffff0e25ef6 in shortfin::iree::object_ptr<iree_hal_device_t, shortfin::iree::detail::hal_device_ptr_helper>::reset (this=0x50d00016ab98) at /home/stella/src/sharktank/libshortfin/src/shortfin/support/iree_helpers.h:125
#13 0x00007ffff0e22a05 in shortfin::iree::object_ptr<iree_hal_device_t, shortfin::iree::detail::hal_device_ptr_helper>::~object_ptr (this=0x50d00016ab98) at /home/stella/src/sharktank/libshortfin/src/shortfin/support/iree_helpers.h:81
#14 0x00007ffff0e22009 in shortfin::local::Device::~Device (this=0x50d00016ab20) at /home/stella/src/sharktank/libshortfin/src/shortfin/local/device.cc:64
#15 0x00007ffff0eb6915 in shortfin::local::systems::AMDGPUDevice::~AMDGPUDevice (this=0x50d00016ab20) at /home/stella/src/sharktank/libshortfin/src/shortfin/local/systems/amdgpu.h:21
#16 0x00007ffff0eb6939 in shortfin::local::systems::AMDGPUDevice::~AMDGPUDevice (this=0x50d00016ab20) at /home/stella/src/sharktank/libshortfin/src/shortfin/local/systems/amdgpu.h:21
stellaraccident added a commit to nod-ai/shark-ai that referenced this issue Sep 6, 2024
This test is not particularly inspired (and the API needs to be simplified) but it represents the first full system test in the repo.

In order to run the test, it is downloading a mobilenet onnx file from the zoo, upgrading it, and compiling. In the future, I'd like to switch this to a simpler model like MNIST for basic functionality, but I had some issues getting that to work via ONNX import and punted. While a bit inefficient (it will fetch on each pytest run), this will keep things held together until we can do something more comprehensive. Note that my experience here prompted me to file iree-org/iree#18289, as this is way too much code and sharp edges to compile from ONNX (but it does work). Verifies numerics against a silly test image.

Includes some fixes:

* Reworked the system detect marker so that we only run system specific tests (like amdgpu) on opt-in via a `--system amdgpu` pytest arg. This refinement was prompted by an ASAN violation in the HIP runtime code which was tripping me up when enabled by default. Filed here: iree-org/iree#18449
* Fixed a bug revealed when writing the test where an exception thrown from main could trigger a use-after-free because we were clearing workers when shutting down (vs at destruction) when all objects owned at the system level need to have a lifetime no less than the system.
stellaraccident added a commit to nod-ai/shark-ai that referenced this issue Sep 6, 2024
This test is not particularly inspired (and the API needs to be
simplified) but it represents the first full system test in the repo.

In order to run the test, it is downloading a mobilenet onnx file from
the zoo, upgrading it, and compiling. In the future, I'd like to switch
this to a simpler model like MNIST for basic functionality, but I had
some issues getting that to work via ONNX import and punted. While a bit
inefficient (it will fetch on each pytest run), this will keep things
held together until we can do something more comprehensive. Note that my
experience here prompted me to file
iree-org/iree#18289, as this is way too much
code and sharp edges to compile from ONNX (but it does work). Verifies
numerics against a silly test image.

Includes some fixes:

* Reworked the system detect marker so that we only run system specific
tests (like amdgpu) on opt-in via a `--system amdgpu` pytest arg. This
refinement was prompted by an ASAN violation in the HIP runtime code
which was tripping me up when enabled by default. Filed here:
iree-org/iree#18449
* Fixed a bug revealed when writing the test where an exception thrown
from main could trigger a use-after-free because we were clearing
workers when shutting down (vs at destruction) when all objects owned at
the system level need to have a lifetime no less than the system.
AWoloszyn added a commit that referenced this issue Sep 26, 2024
…18605)

This was cuasing ASAN errors as pthread_join was getting called twice.
We don't actually need to explicitly join these threads as
iree_thread_release will join the thread on our behalf anyway.

Fixes #18449

Signed-off-by: Andrew Woloszyn <andrew.woloszyn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants