Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

DanielThomas · 2024-06-27T00:51:00Z

Describe the Bug

We observed a deadlock with Virtual Threads and Brave/Zipkin. In short, there are two paths to CountBoundedQueue.offer when finishing a span. RealSpan.finish has a synchronized block, where RealScopedSpan.finish does not.

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

https://gist.github.com/DanielThomas/dddd850f7e491cac3a2dd734978f4267

Steps to Reproduce

See https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e for a reduced example.

Expected Behaviour

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

The text was updated successfully, but these errors were encountered:

reta · 2024-06-27T14:48:48Z

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

The RealScopedSpan does not need a synchronized block since it does not do anything with the Span directly, why it is needed?

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

This is coming from the previous point: the RealScopedSpan does not need synchronized block but the underlying span could be mutated later on, and in this case the synchronized (mutableSpan) block is going to be used by the component in question. This pattern does not fit ReentrantLock usage

DanielThomas · 2024-06-28T00:31:45Z

I don't quite follow? I didn't suggest RealScopedSpan should be synchronized, just the fact that there's two callers for the CountBoundedQueue lock, one that will pin and one that will suspend, which can lead to this deadlock.

Rather, RealSpan should switch to ReentrantLock instead of synchronized to guard MutableSpan, because it will allow those threads to suspend, rather than pin. The switch to lightweight locking to address the monitor pinning limitation is still some time off, and this is almost certain to cause deadlocks when run on virtual threads currently.

reta · 2024-06-28T01:29:01Z

The reasoning behind keeping synchronized are provided here #1426 (plus, the recent updates on progress of the Java monitors https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html , you may be interested in trying your examples with Loom early access builds https://jdk.java.net/loom/).

DanielThomas · 2024-06-28T01:57:03Z

Yes, people should not wholesale replace synchronized with ReentrantLock, pinning itself is not a concern, unless it surrounds other long running blocking operations or has the potential to deadlock.

Here, the latter is true, this was a real-world application that deadlocked in production. Any Lock provides the same memory visibility guarantees as synchronized, so can provide the same semantics, while ensuring that there isn't pinning around the lock in CountBoundedQueue.

We track loom nightly in fact, and of course, lightweight locks avoid the pinning, but that won't land until JDK 24 or later.

reta · 2024-06-28T02:11:05Z

Here, the latter is true, this was a real-world application that deadlocked in production.

This part certainly deserves a fix, I will try to take a look shortly

reta · 2024-07-07T15:18:58Z

@DanielThomas could you please share if you tune themessageTimeout setting in your application [1] or the default value (1s) is being used? Thank you.

[1] https://github.com/openzipkin/zipkin-reporter-java?tab=readme-ov-file#tuning

DanielThomas · 2024-07-08T09:52:18Z

I think so, I believe we're just getting the default from upstream. We're using Brave via Micrometer/Spring. If I remember correctly, every data fetcher in DGS is observed, so these calls are quite frequent.

Incidentally, there's some more discussion on the expected effectiveness of lightweight monitors versus j.u.l here:

https://mail.openjdk.org/pipermail/loom-dev/2024-July/006885.html

reta · 2024-07-09T01:58:25Z

Yeah, I am looking into the way to address the root cause, meantime (as I workaround, not a solution), could you try please to lower the messageTimeout so the lock held by AsyncReporter would be much smaller (and consequently, the synchronized may not be able to block all carrier threads):

@Bean
AsyncZipkinSpanHandler asyncZipkinSpanHandler(BytesMessageSender sender,
		BytesEncoder<MutableSpan> mutableSpanBytesEncoder) {
	return AsyncZipkinSpanHandler.newBuilder(sender).messageTimeout(100, TimeUnit.MILLIS).build(mutableSpanBytesEncoder);
}

Thank you.

DanielThomas · 2024-07-12T01:20:15Z

We've moved those applications back off virtual threads and avoiding further adoption for now, because avoiding pinning is the only way to avoid this deadlock. The lock only has to be held and callers queued for the lock in the "wrong" order for this to occur, the lock only needs to be held for very short time.

shakuzen · 2024-11-07T11:01:39Z

I spent some time looking at this and coming up with a test that reproduces this using Brave and Zipkin Reporter. Unfortunately, due to the specific timing needed, I wasn't able to come up with a test that didn't require some changes to main code to coordinate timing of things, so it isn't really something we'd be able to commit and run as part of the build. I also tried changing the synchronized blocks to using a ReentrantLock, which then made the test pass, but unfortunately it isn't really equivalent. Now the MutableSpan itself is being synchronized on, and it is an external object passed to the RealSpan instance, meaning that code outside of RealSpan has access to it and might also synchronize on it. One such example in our own code is in Tag:

brave/brave/src/main/java/brave/Tag.java

Line 169 in 67f563c

synchronized (mSpan) {

There may be other examples in code we don't control too. Given that, I'm rather weary of side effects of changing to using a ReentrantLock for this code.

That all said, there is some good news: JEP 491: Synchronize Virtual Threads without Pinning is now targeted for Java 24 (to be GA in March 2025). That should resolve this potential deadlock without changes to our code. I think it's the safer solution unless someone has a better proposal for code changes than what I described above. It's a bit unsatisfying, but perhaps the best solution for this is to say use JRE 24 or later. Thoughts?

reta · 2024-11-07T12:33:15Z

There may be other examples in code we don't control too. Given that, I'm rather weary of side effects of changing to using a ReentrantLock for this code.

Thanks @shakuzen , yes, exactly, there are tons of places where synchronization happens on the mutable span instance.

That all said, there is some good news: JEP 491: Synchronize Virtual Threads without Pinning is now targeted for Java 24 (to be GA in March 2025).

Correct, it looks the issue will be addressed for everyone

shakuzen · 2024-11-08T04:36:12Z

So I think unless someone has a proposal for concrete changes to consider to avoid the potential deadlock prior to Java 24 when using virtual threads, we can close this with adding a note somewhere about the potential for deadlock with virtual threads and Java 21-23. I think changes that solve this would end up being breaking in some sense, even though this synchronization of MutableSpan isn't part of the API/ABI. Given that, it would probably require significant design work and be something we'd want to do in a new major version. I personally don't think that's worth it given JEP 491, but if others feel differently, please comment.

reta · 2024-11-08T14:49:34Z

So I think unless someone has a proposal for concrete changes to consider to avoid the potential deadlock prior to Java 24 when using virtual threads, we can close this with adding a note somewhere about the potential for deadlock with virtual threads and Java 21-23.

Thanks @shakuzen , I would suggest to wait 24-ea (with JEP-491 integrated) to make sure the issue is gone, and close the issue right after, sounds like a plan?

shakuzen · 2024-11-25T09:48:25Z

I verified that my reproducer with Brave and Zipkin Reporter is fixed with 24-ea build 24 which has JEP 491 integrated with openjdk/jdk@78b8015. @DanielThomas let us know if you find any remaining Brave/Zipkin issue in your testing.

See #1440

DanielThomas added the bug label Jun 27, 2024

shakuzen added a commit that referenced this issue Nov 27, 2024

Note about potential deadlock with virtual threads

11411d1

See #1440

shakuzen mentioned this issue Nov 27, 2024

Note about potential deadlock with virtual threads #1448

Merged

shakuzen closed this as completed in #1448 Nov 27, 2024

shakuzen closed this as completed in 06b47b1 Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

DanielThomas commented Jun 27, 2024

reta commented Jun 27, 2024

DanielThomas commented Jun 28, 2024

reta commented Jun 28, 2024

DanielThomas commented Jun 28, 2024

reta commented Jun 28, 2024

reta commented Jul 7, 2024

DanielThomas commented Jul 8, 2024

reta commented Jul 9, 2024

DanielThomas commented Jul 12, 2024

shakuzen commented Nov 7, 2024

reta commented Nov 7, 2024

shakuzen commented Nov 8, 2024

reta commented Nov 8, 2024

shakuzen commented Nov 25, 2024

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Comments

DanielThomas commented Jun 27, 2024

Describe the Bug

Steps to Reproduce

Expected Behaviour

reta commented Jun 27, 2024

DanielThomas commented Jun 28, 2024

reta commented Jun 28, 2024

DanielThomas commented Jun 28, 2024

reta commented Jun 28, 2024

reta commented Jul 7, 2024

DanielThomas commented Jul 8, 2024

reta commented Jul 9, 2024

DanielThomas commented Jul 12, 2024

shakuzen commented Nov 7, 2024

reta commented Nov 7, 2024

shakuzen commented Nov 8, 2024

reta commented Nov 8, 2024

shakuzen commented Nov 25, 2024