-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relay client should limit max concurrency #1140
base: master
Are you sure you want to change the base?
Relay client should limit max concurrency #1140
Conversation
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
relay/cmd/flags/flags.go
Outdated
@@ -199,7 +199,7 @@ var ( | |||
Usage: "Max number of concurrent GetChunk operations per client", | |||
Required: false, | |||
EnvVar: common.PrefixEnvVar(envVarPrefix, "MAX_CONCURRENT_GET_CHUNK_OPS_CLIENT"), | |||
Value: 1, | |||
Value: 2, // default value should stay in sync with the default value of node.Config.RelayConcurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we increase this further like 8 or 16? (obviously this is just default value and real value will be configured in the cluster, but wondering what the most accommodating value should be for clients calling relay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increased to 8
api/clients/v2/relay_client.go
Outdated
@@ -95,7 +111,17 @@ func (c *relayClient) GetBlob(ctx context.Context, relayKey corev2.RelayKey, blo | |||
return nil, err | |||
} | |||
|
|||
res, err := client.GetBlob(ctx, &relaygrpc.GetBlobRequest{ | |||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we return the error immediately (in case of rate limits at the relay, it will return ResourceExhausted error) instead of blocking and waiting? This would allow the user of this client to handle the error however it wants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't have retry logic when fetching from a relay. The idea here was that it was better to slightly delay a request to a relay than it was to fail to fetch data (and potentially not sign a batch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this choice should be made not inside the client but by the users of the client, and client being a thin wrapper for interacting with the relay.
Should we add this functionality at the node side if we really do want this kind of limiting logic from the client side?
I'm fine with not putting in this limiting mechanism at all from the client side (node) in favor of potentially retrying with backoff in case of ResourceExhausted error. Because the users of this relay client don't know the rate limiting configuration set in the relay, it's highly likely the client side accounting is not going to be correct.
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Why are these changes needed?
In preprod, we observe relays rate limiting concurrent
GetChunks()
requests. This is non-ideal in an environment where the relays are not under intentional attack or high load.The following screenshot is a histogram of the logs reporting concurrency throttling.
This PR introduces two primary changes: