-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440
Comments
The
This is coming from the previous point: the |
I don't quite follow? I didn't suggest Rather, |
The reasoning behind keeping |
Yes, people should not wholesale replace Here, the latter is true, this was a real-world application that deadlocked in production. Any We track loom nightly in fact, and of course, lightweight locks avoid the pinning, but that won't land until JDK 24 or later. |
This part certainly deserves a fix, I will try to take a look shortly |
@DanielThomas could you please share if you tune the [1] https://github.com/openzipkin/zipkin-reporter-java?tab=readme-ov-file#tuning |
I think so, I believe we're just getting the default from upstream. We're using Brave via Micrometer/Spring. If I remember correctly, every data fetcher in DGS is observed, so these calls are quite frequent. Incidentally, there's some more discussion on the expected effectiveness of lightweight monitors versus j.u.l here: https://mail.openjdk.org/pipermail/loom-dev/2024-July/006885.html |
Yeah, I am looking into the way to address the root cause, meantime (as I workaround, not a solution), could you try please to lower the
Thank you. |
We've moved those applications back off virtual threads and avoiding further adoption for now, because avoiding pinning is the only way to avoid this deadlock. The lock only has to be held and callers queued for the lock in the "wrong" order for this to occur, the lock only needs to be held for very short time. |
I spent some time looking at this and coming up with a test that reproduces this using Brave and Zipkin Reporter. Unfortunately, due to the specific timing needed, I wasn't able to come up with a test that didn't require some changes to main code to coordinate timing of things, so it isn't really something we'd be able to commit and run as part of the build. I also tried changing the brave/brave/src/main/java/brave/Tag.java Line 169 in 67f563c
There may be other examples in code we don't control too. Given that, I'm rather weary of side effects of changing to using a ReentrantLock for this code.
That all said, there is some good news: JEP 491: Synchronize Virtual Threads without Pinning is now targeted for Java 24 (to be GA in March 2025). That should resolve this potential deadlock without changes to our code. I think it's the safer solution unless someone has a better proposal for code changes than what I described above. It's a bit unsatisfying, but perhaps the best solution for this is to say use JRE 24 or later. Thoughts? |
Thanks @shakuzen , yes, exactly, there are tons of places where synchronization happens on the mutable span instance.
Correct, it looks the issue will be addressed for everyone |
So I think unless someone has a proposal for concrete changes to consider to avoid the potential deadlock prior to Java 24 when using virtual threads, we can close this with adding a note somewhere about the potential for deadlock with virtual threads and Java 21-23. I think changes that solve this would end up being breaking in some sense, even though this synchronization of MutableSpan isn't part of the API/ABI. Given that, it would probably require significant design work and be something we'd want to do in a new major version. I personally don't think that's worth it given JEP 491, but if others feel differently, please comment. |
Thanks @shakuzen , I would suggest to wait 24-ea (with JEP-491 integrated) to make sure the issue is gone, and close the issue right after, sounds like a plan? |
I verified that my reproducer with Brave and Zipkin Reporter is fixed with 24-ea build 24 which has JEP 491 integrated with openjdk/jdk@78b8015. @DanielThomas let us know if you find any remaining Brave/Zipkin issue in your testing. |
Describe the Bug
We observed a deadlock with Virtual Threads and Brave/Zipkin. In short, there are two paths to CountBoundedQueue.offer when finishing a span. RealSpan.finish has a synchronized block, where RealScopedSpan.finish does not.
If an unmounted virtual thread using
RealScopedSpan
is next line for the lock, but all carriers are currently occupied by pinned VTs inRealSpan.finish
, the application will deadlock:https://gist.github.com/DanielThomas/dddd850f7e491cac3a2dd734978f4267
Steps to Reproduce
See https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e for a reduced example.
Expected Behaviour
While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to
ReentrantLock
to guard theMutableSpan
to ensure compatibility with virtual threads.The text was updated successfully, but these errors were encountered: