Remove use of lock in cancelRequests #26

appilon · 2020-08-13T17:52:34Z

Found via ad hoc testing/repro. We are able to trigger a race where the lock acquired in the cancelRequests handler is deadlocked. The reason for deadlock is not fully traced, but from what I can tell, s.mu.Lock is not needed when calling s.cancel. ~~This PR may need to create~~ I will make it concurrent safe lookup of the s.used map with https://golang.org/pkg/sync/#Map.

I am new to this codebase so as I said I'm unsure why the code deadlocks (I wish I could offer concrete repro steps, but it involves the terraform language server/vscode plugin and spamming format on save requests), it seems like the lock is acquired in nextRequest but not freed by the time it hits cancelRequests.

Perhaps something about these lines?
https://github.com/creachadair/jrpc2/blob/master/server.go#L218-L222

But I think it's possible calling the lock in a request handler (given a lock is acquired in nextRequest) is a mistake (which is why I removed it in this PR)?

Relates hashicorp/terraform-ls#258

appilon · 2020-08-13T17:56:37Z

CI is complaining about the data race, I will swap the map to be concurrent safe

creachadair · 2020-08-13T18:03:06Z

I would prefer not to introduce sync.Map. Before we go down that road, I'd like to better understand the workload where you observed the deadlock. Even if you can't share the details, can you talk a bit more about what it does? E.g., is this a case where you have a method handler that is attempting to cancel in-flight requests? Or, more generally, do you have a sense of where the cancellation is being triggered? Is it coming from the client side?

creachadair · 2020-08-13T18:22:44Z

Upon further reflection, you did say the race comes in via cancelRequests, which means it can only happen as a result of calling jrpc2.CancelRequest somewhere.

If that's correct, then I suspect there's an unforeseen interaction revealed by #24. If a handler responding to a notification attempts to cancel another request, it could deadlock with the next batch waiting for that same notification to complete (during which time it holds the server lock. If that sounds related, I will see if I can put together a repro based on that suspicion.

appilon · 2020-08-13T18:34:59Z

@creachadair Yes I thought I had traced the deadlock as we spoke, but I'm seeing our LSP is calling jrpc2.CancelRequest as you mentioned. I will stay tuned for your more detailed repro.

creachadair · 2020-08-13T18:38:39Z

@creachadair Yes I thought I had traced the deadlock as we spoke, but I'm seeing it's our LSP calling jrpc2.CancelRequest as you mentioned. Apologies for the red herring but I will stay tuned for your more detailed repro

If that is the cause, I still agree that it's a bug in jrpc2. Even if we can find a workaround to get you unstuck, I'd like to find a more robust solution too, if it's practical.

creachadair · 2020-08-13T18:50:06Z

I was able to build a repro for a deadlock around this. Specifically, here's the problematic sequence:

A notification (N) arrives and is dispatched to its handler.
While N is busy doing other work, the dispatcher locks to wait for notifications to clear.
N invokes jrpc2.CancelRequest.

Step (3) attempts to acquire the server lock, and deadlocks with the dispatcher. The crux is a notification handler that attempts to cancel other requests in flight.

This problem was made possible by #24. I believe the solution is for the dispatcher to yield the lock while it waits for previous notifications to settle. I will verify that this actually solves the problem and add some tests to defend against regression.

appilon · 2020-08-13T19:17:44Z

I think I 80% follow but I only started looking at all of this today (so I'm still getting up to speed on how this works). Any chance you could file a tracking issue with your fantastic contextual write up and I will close this PR.

creachadair · 2020-08-13T19:33:59Z

Done: #27, and thank you for reporting it.

I will send a PR shortly with the fix and regression tests.

creachadair · 2020-08-13T19:49:38Z

@appilon When you have a chance, can you please patch #28 and see if it fixes the issue your customers reported?

Remove use of lock

37ee352

appilon changed the title ~~Remove use of lock~~ Remove use of lock in cancelRequests Aug 13, 2020

appilon mentioned this pull request Aug 13, 2020

v0.6.0 format on save hangs hashicorp/terraform-ls#258

Closed

creachadair mentioned this pull request Aug 13, 2020

Notification handlers can deadlock with the dispatch loop #27

Closed

appilon closed this Aug 13, 2020

appilon deleted the fix-cancel-deadlock branch August 13, 2020 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove use of lock in cancelRequests #26

Remove use of lock in cancelRequests #26

appilon commented Aug 13, 2020 •

edited

Loading

appilon commented Aug 13, 2020

creachadair commented Aug 13, 2020

creachadair commented Aug 13, 2020

appilon commented Aug 13, 2020 •

edited

Loading

creachadair commented Aug 13, 2020 •

edited

Loading

creachadair commented Aug 13, 2020

appilon commented Aug 13, 2020

creachadair commented Aug 13, 2020

creachadair commented Aug 13, 2020

Remove use of lock in cancelRequests #26

Remove use of lock in cancelRequests #26

Conversation

appilon commented Aug 13, 2020 • edited Loading

appilon commented Aug 13, 2020

creachadair commented Aug 13, 2020

creachadair commented Aug 13, 2020

appilon commented Aug 13, 2020 • edited Loading

creachadair commented Aug 13, 2020 • edited Loading

creachadair commented Aug 13, 2020

appilon commented Aug 13, 2020

creachadair commented Aug 13, 2020

creachadair commented Aug 13, 2020

appilon commented Aug 13, 2020 •

edited

Loading

appilon commented Aug 13, 2020 •

edited

Loading

creachadair commented Aug 13, 2020 •

edited

Loading