Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove use of lock in cancelRequests #26

Closed
wants to merge 1 commit into from
Closed

Remove use of lock in cancelRequests #26

wants to merge 1 commit into from

Conversation

appilon
Copy link

@appilon appilon commented Aug 13, 2020

Found via ad hoc testing/repro. We are able to trigger a race where the lock acquired in the cancelRequests handler is deadlocked. The reason for deadlock is not fully traced, but from what I can tell, s.mu.Lock is not needed when calling s.cancel. This PR may need to create I will make it concurrent safe lookup of the s.used map with https://golang.org/pkg/sync/#Map.

I am new to this codebase so as I said I'm unsure why the code deadlocks (I wish I could offer concrete repro steps, but it involves the terraform language server/vscode plugin and spamming format on save requests), it seems like the lock is acquired in nextRequest but not freed by the time it hits cancelRequests.

Perhaps something about these lines?
https://github.com/creachadair/jrpc2/blob/master/server.go#L218-L222

But I think it's possible calling the lock in a request handler (given a lock is acquired in nextRequest) is a mistake (which is why I removed it in this PR)?

Relates hashicorp/terraform-ls#258

@appilon
Copy link
Author

appilon commented Aug 13, 2020

CI is complaining about the data race, I will swap the map to be concurrent safe

@appilon appilon changed the title Remove use of lock Remove use of lock in cancelRequests Aug 13, 2020
@creachadair
Copy link
Owner

I would prefer not to introduce sync.Map. Before we go down that road, I'd like to better understand the workload where you observed the deadlock. Even if you can't share the details, can you talk a bit more about what it does? E.g., is this a case where you have a method handler that is attempting to cancel in-flight requests? Or, more generally, do you have a sense of where the cancellation is being triggered? Is it coming from the client side?

@creachadair
Copy link
Owner

Upon further reflection, you did say the race comes in via cancelRequests, which means it can only happen as a result of calling jrpc2.CancelRequest somewhere.

If that's correct, then I suspect there's an unforeseen interaction revealed by #24. If a handler responding to a notification attempts to cancel another request, it could deadlock with the next batch waiting for that same notification to complete (during which time it holds the server lock. If that sounds related, I will see if I can put together a repro based on that suspicion.

@appilon
Copy link
Author

appilon commented Aug 13, 2020

@creachadair Yes I thought I had traced the deadlock as we spoke, but I'm seeing our LSP is calling jrpc2.CancelRequest as you mentioned. I will stay tuned for your more detailed repro.

@creachadair
Copy link
Owner

creachadair commented Aug 13, 2020

@creachadair Yes I thought I had traced the deadlock as we spoke, but I'm seeing it's our LSP calling jrpc2.CancelRequest as you mentioned. Apologies for the red herring but I will stay tuned for your more detailed repro

If that is the cause, I still agree that it's a bug in jrpc2. Even if we can find a workaround to get you unstuck, I'd like to find a more robust solution too, if it's practical.

@creachadair
Copy link
Owner

I was able to build a repro for a deadlock around this. Specifically, here's the problematic sequence:

  1. A notification (N) arrives and is dispatched to its handler.
  2. While N is busy doing other work, the dispatcher locks to wait for notifications to clear.
  3. N invokes jrpc2.CancelRequest.

Step (3) attempts to acquire the server lock, and deadlocks with the dispatcher. The crux is a notification handler that attempts to cancel other requests in flight.

This problem was made possible by #24. I believe the solution is for the dispatcher to yield the lock while it waits for previous notifications to settle. I will verify that this actually solves the problem and add some tests to defend against regression.

@appilon
Copy link
Author

appilon commented Aug 13, 2020

I think I 80% follow but I only started looking at all of this today (so I'm still getting up to speed on how this works). Any chance you could file a tracking issue with your fantastic contextual write up and I will close this PR.

@creachadair
Copy link
Owner

Done: #27, and thank you for reporting it.

I will send a PR shortly with the fix and regression tests.

@appilon appilon closed this Aug 13, 2020
@creachadair
Copy link
Owner

@appilon When you have a chance, can you please patch #28 and see if it fixes the issue your customers reported?

@appilon appilon deleted the fix-cancel-deadlock branch August 13, 2020 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants