Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Closed
DanielThomas opened this issue Jun 27, 2024 · 14 comments · Fixed by #1448
Closed
Labels

Comments

@DanielThomas
Copy link

Describe the Bug

We observed a deadlock with Virtual Threads and Brave/Zipkin. In short, there are two paths to CountBoundedQueue.offer when finishing a span. RealSpan.finish has a synchronized block, where RealScopedSpan.finish does not.

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

https://gist.github.com/DanielThomas/dddd850f7e491cac3a2dd734978f4267

Steps to Reproduce

See https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e for a reduced example.

Expected Behaviour

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

@reta
Copy link
Contributor

reta commented Jun 27, 2024

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

The RealScopedSpan does not need a synchronized block since it does not do anything with the Span directly, why it is needed?

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

This is coming from the previous point: the RealScopedSpan does not need synchronized block but the underlying span could be mutated later on, and in this case the synchronized (mutableSpan) block is going to be used by the component in question. This pattern does not fit ReentrantLock usage

@DanielThomas
Copy link
Author

I don't quite follow? I didn't suggest RealScopedSpan should be synchronized, just the fact that there's two callers for the CountBoundedQueue lock, one that will pin and one that will suspend, which can lead to this deadlock.

Rather, RealSpan should switch to ReentrantLock instead of synchronized to guard MutableSpan, because it will allow those threads to suspend, rather than pin. The switch to lightweight locking to address the monitor pinning limitation is still some time off, and this is almost certain to cause deadlocks when run on virtual threads currently.

@reta
Copy link
Contributor

reta commented Jun 28, 2024

The reasoning behind keeping synchronized are provided here #1426 (plus, the recent updates on progress of the Java monitors https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html , you may be interested in trying your examples with Loom early access builds https://jdk.java.net/loom/).

@DanielThomas
Copy link
Author

Yes, people should not wholesale replace synchronized with ReentrantLock, pinning itself is not a concern, unless it surrounds other long running blocking operations or has the potential to deadlock.

Here, the latter is true, this was a real-world application that deadlocked in production. Any Lock provides the same memory visibility guarantees as synchronized, so can provide the same semantics, while ensuring that there isn't pinning around the lock in CountBoundedQueue.

We track loom nightly in fact, and of course, lightweight locks avoid the pinning, but that won't land until JDK 24 or later.

@reta
Copy link
Contributor

reta commented Jun 28, 2024

Here, the latter is true, this was a real-world application that deadlocked in production.

This part certainly deserves a fix, I will try to take a look shortly

@reta
Copy link
Contributor

reta commented Jul 7, 2024

@DanielThomas could you please share if you tune themessageTimeout setting in your application [1] or the default value (1s) is being used? Thank you.

[1] https://github.com/openzipkin/zipkin-reporter-java?tab=readme-ov-file#tuning

@DanielThomas
Copy link
Author

I think so, I believe we're just getting the default from upstream. We're using Brave via Micrometer/Spring. If I remember correctly, every data fetcher in DGS is observed, so these calls are quite frequent.

Incidentally, there's some more discussion on the expected effectiveness of lightweight monitors versus j.u.l here:

https://mail.openjdk.org/pipermail/loom-dev/2024-July/006885.html

@reta
Copy link
Contributor

reta commented Jul 9, 2024

Yeah, I am looking into the way to address the root cause, meantime (as I workaround, not a solution), could you try please to lower the messageTimeout so the lock held by AsyncReporter would be much smaller (and consequently, the synchronized may not be able to block all carrier threads):

@Bean
AsyncZipkinSpanHandler asyncZipkinSpanHandler(BytesMessageSender sender,
		BytesEncoder<MutableSpan> mutableSpanBytesEncoder) {
	return AsyncZipkinSpanHandler.newBuilder(sender).messageTimeout(100, TimeUnit.MILLIS).build(mutableSpanBytesEncoder);
}

Thank you.

@DanielThomas
Copy link
Author

We've moved those applications back off virtual threads and avoiding further adoption for now, because avoiding pinning is the only way to avoid this deadlock. The lock only has to be held and callers queued for the lock in the "wrong" order for this to occur, the lock only needs to be held for very short time.

@shakuzen
Copy link
Member

shakuzen commented Nov 7, 2024

I spent some time looking at this and coming up with a test that reproduces this using Brave and Zipkin Reporter. Unfortunately, due to the specific timing needed, I wasn't able to come up with a test that didn't require some changes to main code to coordinate timing of things, so it isn't really something we'd be able to commit and run as part of the build. I also tried changing the synchronized blocks to using a ReentrantLock, which then made the test pass, but unfortunately it isn't really equivalent. Now the MutableSpan itself is being synchronized on, and it is an external object passed to the RealSpan instance, meaning that code outside of RealSpan has access to it and might also synchronize on it. One such example in our own code is in Tag:

synchronized (mSpan) {

There may be other examples in code we don't control too. Given that, I'm rather weary of side effects of changing to using a ReentrantLock for this code.

That all said, there is some good news: JEP 491: Synchronize Virtual Threads without Pinning is now targeted for Java 24 (to be GA in March 2025). That should resolve this potential deadlock without changes to our code. I think it's the safer solution unless someone has a better proposal for code changes than what I described above. It's a bit unsatisfying, but perhaps the best solution for this is to say use JRE 24 or later. Thoughts?

@reta
Copy link
Contributor

reta commented Nov 7, 2024

There may be other examples in code we don't control too. Given that, I'm rather weary of side effects of changing to using a ReentrantLock for this code.

Thanks @shakuzen , yes, exactly, there are tons of places where synchronization happens on the mutable span instance.

That all said, there is some good news: JEP 491: Synchronize Virtual Threads without Pinning is now targeted for Java 24 (to be GA in March 2025).

Correct, it looks the issue will be addressed for everyone

@shakuzen
Copy link
Member

shakuzen commented Nov 8, 2024

So I think unless someone has a proposal for concrete changes to consider to avoid the potential deadlock prior to Java 24 when using virtual threads, we can close this with adding a note somewhere about the potential for deadlock with virtual threads and Java 21-23. I think changes that solve this would end up being breaking in some sense, even though this synchronization of MutableSpan isn't part of the API/ABI. Given that, it would probably require significant design work and be something we'd want to do in a new major version. I personally don't think that's worth it given JEP 491, but if others feel differently, please comment.

@reta
Copy link
Contributor

reta commented Nov 8, 2024

So I think unless someone has a proposal for concrete changes to consider to avoid the potential deadlock prior to Java 24 when using virtual threads, we can close this with adding a note somewhere about the potential for deadlock with virtual threads and Java 21-23.

Thanks @shakuzen , I would suggest to wait 24-ea (with JEP-491 integrated) to make sure the issue is gone, and close the issue right after, sounds like a plan?

@shakuzen
Copy link
Member

I verified that my reproducer with Brave and Zipkin Reporter is fixed with 24-ea build 24 which has JEP 491 integrated with openjdk/jdk@78b8015. @DanielThomas let us know if you find any remaining Brave/Zipkin issue in your testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants