Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

Merged
merged 4 commits into from
Aug 23, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Aug 17, 2023

clusterresolver LB policy currently deadlocks if the dns resolver reports an update or error inline at build time. This is because the dns resolver is built while holding a lock, and the same lock needs to be grabbed to handle an error or update from the resolver.

This was not caught in our current tests because the dns resolver was being overridden with a fake one. I switched as many tests as possible to use the real dns resolver. I also ensure that the dns resolver pushes update inline at build time because I pass it the actual host:port and not a name to be resolved.

I ran into this issue when I fixing some tests in the cds LB policy.

Fixes #6562

RELEASE NOTES:

  • clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time

@easwars easwars requested a review from zasweq August 17, 2023 23:59
@easwars easwars added this to the 1.58 Release milestone Aug 17, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

Comment on lines 104 to 117
ret.serializer.Schedule(func(context.Context) {
r, err := newDNS(resolver.Target{URL: *u}, ret, resolver.BuildOptions{})
if err == nil {
ret.dnsR = r
return
}

if ret.logger.V(2) {
ret.logger.Infof("Failed to build DNS resolver for target %q: %v", target, err)
}
ret.mu.Lock()
ret.updateReceived = true
ret.mu.Unlock()
ret.topLevelResolver.onUpdate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct thinking about it. My musing is super minor here. Previously, when you called this function and an onUpdate() gets called, it's considered having "received" configuration for this particular discovery mechanism and will trigger fallback if DNS Resolver hasn't sent anything yet. Now, it happens async (and we wait for the dns resolver to build before to set the bool). Oh, I guess we can't consider this discovery mechanism having had a chance to received an update before the dns resolver has a chance to return results inline. I was mainly concerned now there's a new time frame between when we build this resource resolver and when this callback actually executes where previously before it would build the config and trigger fallback (in the case of no update sent so no addrs), and now it just waits until onUpdate() is called here. But this wait seems minor and ok. So seems correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. I realized that I was not handling all cases correctly. Specifically, if a url parse failure happened, with my previous commit, things would still deadlock.

So, I thought about this for a while, and came to the conclusion that the correct place to handle this is in the resourceResolver component. So, my current commit handles this:

  • it makes the onUpdate non blocking, but pushing the call to generateLocked via a callback serializer
  • make it possible for the cluster_resolver LB policy to notify the resourceResolver whether it is being actually closed or not (in stop())

The use of the callback serializer is not absolutely required in the resourceResolver. I could as well have pushed a signal on an unbounded buffer, and have a goroutine read from it. But using a callback serializer just made is easier. Let me know if you have any questions/concerns on this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I see now about url parse error triggering deadlock with onUpdate(). There was a discussion about adding an nack validation in the xDS Client in the xDS Chat, but the discussion got tabled. Then there would no longer need to be this error handling. Let me think some about what layer you put the buffer of callbacks at.

if dr.dnsR != nil {
dr.dnsR.Close()
}
dr.serializerCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the callback serializer makes it seem like once this context is cancelled - "It is guaranteed that no callbacks will be added once this context is canceled" it can't add callbacks (and reading the run() goroutine in that component backs it up: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84. Should this come after the scheduling on line 150?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has been deleted.

The callback serializer's Schedule method returns a value which indicates whether or not the callback was scheduled, and it guarantees that all scheduled callbacks are executed even if the provided context is cancelled. It also provides a Done() method which the caller can block on after cancelling the context to be 100% sure that the callback serializer is done with everything that it had to execute and that it has freed up all its resources.

Copy link
Contributor

@zasweq zasweq Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But wouldn't this recv happen: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84, which would trigger the possibility of grabbing this close mu and subsequent code block (before Schedule() runs): https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L98 (setting closed to true), and then Schedule() fails here: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L71? Anyways, this is now no longer relevant, but I do think there was a misordering of operations here.

@zasweq zasweq assigned easwars and unassigned zasweq Aug 21, 2023
@easwars easwars assigned zasweq and unassigned easwars Aug 22, 2023
Comment on lines +241 to +244
if closing {
rr.serializerCancel()
<-rr.serializer.Done()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to close the serializer without causing any queued unran callbacks to execute? I don't think there is, but should we add that method to that type? It feels like a waste to execute all the potentially operations (i.e. generateLocked()) when we know the whole cluster resolver component is closing anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is currently no way to cause the serializer to shut down without running queued callbacks. In fact, the serializer in its first version used to do that, but when I was making changes for channel idleness, I quickly realized that the more common case was the one where it is guaranteed that all scheduled callbacks are run.

It should be trivial for the user of the serializer to have some logic to turn the callback into a no-op when it knows that the context passed to the serializer has been cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So you think that say if you have 5 generateLocked() calls queued, it's fine to just run them before closing, which would search through all discovery mechanisms. Eh, whatever, minor, in practice won't scale out of proportion.

@@ -280,7 +280,7 @@ func (b *clusterResolverBalancer) handleErrorFromUpdate(err error, fromParent bo
// EDS resource was removed. No action needs to be taken for this, and we
// should continue watching the same EDS resource.
if fromParent && xdsresource.ErrType(err) == xdsresource.ErrorTypeResourceNotFound {
b.resourceWatcher.stop()
b.resourceWatcher.stop(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble following this logic (false plumbed down). Is it to support the use case described in this comment:

// Save the previous childrenMap to stop the children outside the mutex,
(i.e. reusing the lb policy in the future, so the component callback serializer the resource_resolver holds onto is still active?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is basically to handle the comment here:

// stop() is called when the LB policy is closed or when the underlying

stop() is called when:

  • the CDS resource is deleted, or
  • the LB policy is being stopped

And in the first case, when the CDS resource is added back, the cluster_resolver will get a config update with new mechanisms and will need to process it.

I initially thought about cancelling the serializer in stop() unconditionally and recreate it in updateMechanisms(), instead of in newResourceResolver. But this was hard to do because then you need to protect access to the serializer with a mutex, and if we have to do that, then there is no way we can guarantee that onUpdate won't block on the mutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good, thanks for the explanation. It's keeping it's lifecycle coupled with this resource resolver type, which sounds fine since it's a buffer the resource resolver type uses.

@zasweq zasweq assigned easwars and unassigned zasweq Aug 23, 2023
@easwars easwars assigned zasweq and unassigned easwars Aug 23, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zasweq zasweq assigned easwars and unassigned zasweq Aug 23, 2023
@easwars easwars merged commit 4c9777c into grpc:master Aug 23, 2023
1 check passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clusterresolver: deadlock when updating dns discovery mechanism
2 participants