-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436
Conversation
} | ||
|
||
// If a previously received good configuration exists, continue to use it. | ||
// Else report an empty update that would result in the child policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"child priority of priority corresponding to EDS reporting Transient Failure due to the recipient of an empty update." also drop the no priorities or localities, there would still be a hierarchy of priorities, but one of them will report TF. It's actually technically weighted target that triggers the TF, so maybe say causing weighted_target balancer down the xDS tree for the EDS priority...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I see, I just remember having to add TF back from weighted target if addr list is zero in UpdateClientConnState. That's prob why I was under the impression the system was two children with top priority child and second priority child, and first sends TF back. The first one just not being created makes sense too haha :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall just minor nits :). Spurred some interesting discussion with Eric.
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{ | ||
XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address), | ||
NodeProto: &v3corepb.Node{Id: nodeID}, | ||
}, defaultTestWatchExpiryTimeout, time.Duration(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, mainly musing. The test above polls for a second, and this requires 500 milliseconds to invoke the resource not found error. Is there a way to write this that doesn't introduce testing time as such? I don't see a way around it but curious your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test above polls for a second.
I don't understand what you are referring to here.
Is there a way to write this that doesn't introduce testing time as such?
For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.
Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.
I don't see a way around it but curious your thoughts.
I don't see a way either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrt timer...could do something like: https://github.com/grpc/grpc-go/blob/master/xds/internal/balancer/outlierdetection/balancer.go#L50 for time.AfterFunc, or https://github.com/grpc/grpc-go/pull/4270/files#diff-cfb871b425a217deb8602c43a41b39ed1378776d5a40bebee381fec0f231a011R51 for a newTimer. But yeah, prob not worth it in this scenario. I also was thinking setting the 500 ms to something like 10ms since I don't think it matters the length.
// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the | ||
// management server sends a bad update (one that is NACKed by the xDS client). | ||
// Since the cluster_resolver LB policy does not have a previously received good | ||
// update, it is expected to treat this bad update as though it received an | ||
// update with no endpoints. Hence RPCs are expected to fail with "all | ||
// priorities removed" error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble following why we want this error to be "AllPrioritiesRemoved". The priority still exists right? I.e. in the failover case where you have two children top level priority reports TF causing failover. Thus, why is it when there's only one discovery mechanism it doesn't have the same behavior, but has specifically this error? Don't we want a TF from weighted target? Or does that cause this error? Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate
to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.
Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.
Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?
That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints
update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, right. As per my other reply, I was under the impression that the system was "Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE". Thus, I made this comment as the error did not make sense to me. The explanation of how the system works sgtm and is cleaner, and the error makes sense. Curious though, is this scoped in a gRFC? Or just an implementation detail per language not explicitly defined. We should I think shoot for implementation consistency between other languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss this in the chat room to see what the other languages do and whether this is scoped out in any gRFC.
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to approve since overall this looked good to me, and I'm taking leave. Please get to the comments though :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Show resolved
Hide resolved
xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{ | ||
XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address), | ||
NodeProto: &v3corepb.Node{Id: nodeID}, | ||
}, defaultTestWatchExpiryTimeout, time.Duration(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test above polls for a second.
I don't understand what you are referring to here.
Is there a way to write this that doesn't introduce testing time as such?
For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.
Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.
I don't see a way around it but curious your thoughts.
I don't see a way either :)
xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Outdated
Show resolved
Hide resolved
// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the | ||
// management server sends a bad update (one that is NACKed by the xDS client). | ||
// Since the cluster_resolver LB policy does not have a previously received good | ||
// update, it is expected to treat this bad update as though it received an | ||
// update with no endpoints. Hence RPCs are expected to fail with "all | ||
// priorities removed" error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate
to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.
Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.
Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?
That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints
update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
// If a previously received good configuration exists, continue to use it. | ||
// Else report an empty update that would result in the child policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
EDS nacks and resource-not-found errors are to be treated by the cluster_resolver LB policy as though it received an EDS update with no endpoints. This would mean that for aggregate clusters, it would result in fallback (if appropriate), and non-aggregate clusters, it would result in the channel moving to
TRANSIENT_FAILURE
(because the child policy, priority LB, will put the channel inTRANSIENT_FAILURE
when there are no priorities in its configuration).The only difference between how a NACK error and a resource-not-found error is handled by the cluster_resolver LB policy is that in the NACK case, the LB policy ignores the error if there was a previous result. The resource-not-found error is received by the cluster_resolver LB policy only when it does not have a previously received good update. This is because, the xds client will not propagate this error to its watchers (because EDS is not state-of-the-world) if it already has the resource in its cache.
Fixes #6217
RELEASE NOTES: