Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

Merged
merged 2 commits into from
Jul 11, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jul 6, 2023

EDS nacks and resource-not-found errors are to be treated by the cluster_resolver LB policy as though it received an EDS update with no endpoints. This would mean that for aggregate clusters, it would result in fallback (if appropriate), and non-aggregate clusters, it would result in the channel moving to TRANSIENT_FAILURE (because the child policy, priority LB, will put the channel in TRANSIENT_FAILURE when there are no priorities in its configuration).

The only difference between how a NACK error and a resource-not-found error is handled by the cluster_resolver LB policy is that in the NACK case, the LB policy ignores the error if there was a previous result. The resource-not-found error is received by the cluster_resolver LB policy only when it does not have a previously received good update. This is because, the xds client will not propagate this error to its watchers (because EDS is not state-of-the-world) if it already has the resource in its cache.

Fixes #6217

RELEASE NOTES:

  • clusterresolver: handle EDS nacks and resource-not-found errors correctly

@easwars easwars requested a review from zasweq July 6, 2023 23:33
@easwars easwars added this to the 1.57 Release milestone Jul 6, 2023
}

// If a previously received good configuration exists, continue to use it.
// Else report an empty update that would result in the child policy
Copy link
Contributor

@zasweq zasweq Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"child priority of priority corresponding to EDS reporting Transient Failure due to the recipient of an empty update." also drop the no priorities or localities, there would still be a hierarchy of priorities, but one of them will report TF. It's actually technically weighted target that triggers the TF, so maybe say causing weighted_target balancer down the xDS tree for the EDS priority...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I see, I just remember having to add TF back from weighted target if addr list is zero in UpdateClientConnState. That's prob why I was under the impression the system was two children with top priority child and second priority child, and first sends TF back. The first one just not being created makes sense too haha :).

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall just minor nits :). Spurred some interesting discussion with Eric.

xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{
XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address),
NodeProto: &v3corepb.Node{Id: nodeID},
}, defaultTestWatchExpiryTimeout, time.Duration(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, mainly musing. The test above polls for a second, and this requires 500 milliseconds to invoke the resource not found error. Is there a way to write this that doesn't introduce testing time as such? I don't see a way around it but curious your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test above polls for a second.

I don't understand what you are referring to here.

Is there a way to write this that doesn't introduce testing time as such?

For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.

Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.

I don't see a way around it but curious your thoughts.

I don't see a way either :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrt timer...could do something like: https://github.com/grpc/grpc-go/blob/master/xds/internal/balancer/outlierdetection/balancer.go#L50 for time.AfterFunc, or https://github.com/grpc/grpc-go/pull/4270/files#diff-cfb871b425a217deb8602c43a41b39ed1378776d5a40bebee381fec0f231a011R51 for a newTimer. But yeah, prob not worth it in this scenario. I also was thinking setting the 500 ms to something like 10ms since I don't think it matters the length.

Comment on lines +859 to +864
// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the
// management server sends a bad update (one that is NACKed by the xDS client).
// Since the cluster_resolver LB policy does not have a previously received good
// update, it is expected to treat this bad update as though it received an
// update with no endpoints. Hence RPCs are expected to fail with "all
// priorities removed" error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following why we want this error to be "AllPrioritiesRemoved". The priority still exists right? I.e. in the failover case where you have two children top level priority reports TF causing failover. Thus, why is it when there's only one discovery mechanism it doesn't have the same behavior, but has specifically this error? Don't we want a TF from weighted target? Or does that cause this error? Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.

Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.

Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, right. As per my other reply, I was under the impression that the system was "Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE". Thus, I made this comment as the error did not make sense to me. The explanation of how the system works sgtm and is cleaner, and the error makes sense. Curious though, is this scoped in a gRFC? Or just an implementation detail per language not explicitly defined. We should I think shoot for implementation consistency between other languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this in the chat room to see what the other languages do and whether this is scoped out in any gRFC.

@zasweq zasweq assigned easwars and unassigned zasweq Jul 8, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to approve since overall this looked good to me, and I'm taking leave. Please get to the comments though :).

Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{
XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address),
NodeProto: &v3corepb.Node{Id: nodeID},
}, defaultTestWatchExpiryTimeout, time.Duration(0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test above polls for a second.

I don't understand what you are referring to here.

Is there a way to write this that doesn't introduce testing time as such?

For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.

Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.

I don't see a way around it but curious your thoughts.

I don't see a way either :)

Comment on lines +859 to +864
// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the
// management server sends a bad update (one that is NACKed by the xDS client).
// Since the cluster_resolver LB policy does not have a previously received good
// update, it is expected to treat this bad update as though it received an
// update with no endpoints. Hence RPCs are expected to fail with "all
// priorities removed" error.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.

Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.

Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).

}

// If a previously received good configuration exists, continue to use it.
// Else report an empty update that would result in the child policy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.

Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

@easwars easwars merged commit bf5b7ae into grpc:master Jul 11, 2023
1 check passed
@easwars easwars deleted the clusterresolver_fix_6217 branch July 11, 2023 02:56
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xds: fix cluster_resolver LB policy behavior upon receipt of NACKed EDS resource
2 participants