Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay client should limit max concurrency #1140

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

In preprod, we observe relays rate limiting concurrent GetChunks() requests. This is non-ideal in an environment where the relays are not under intentional attack or high load.

The following screenshot is a histogram of the logs reporting concurrency throttling.

Screenshot 2025-01-23 at 9 10 50 AM

This PR introduces two primary changes:

  • increases the maximum permissible concurrency per relay client from 1 to 2
  • adds logic to relay clients that causes them to avoid exceeding maximum permissible concurrency

Signed-off-by: Cody Littley <cody@eigenlabs.org>
@cody-littley cody-littley self-assigned this Jan 23, 2025
api/clients/v2/relay_client.go Outdated Show resolved Hide resolved
api/clients/v2/relay_client.go Outdated Show resolved Hide resolved
@@ -199,7 +199,7 @@ var (
Usage: "Max number of concurrent GetChunk operations per client",
Required: false,
EnvVar: common.PrefixEnvVar(envVarPrefix, "MAX_CONCURRENT_GET_CHUNK_OPS_CLIENT"),
Value: 1,
Value: 2, // default value should stay in sync with the default value of node.Config.RelayConcurrency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we increase this further like 8 or 16? (obviously this is just default value and real value will be configured in the cluster, but wondering what the most accommodating value should be for clients calling relay)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased to 8

@@ -95,7 +111,17 @@ func (c *relayClient) GetBlob(ctx context.Context, relayKey corev2.RelayKey, blo
return nil, err
}

res, err := client.GetBlob(ctx, &relaygrpc.GetBlobRequest{
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we return the error immediately (in case of rate limits at the relay, it will return ResourceExhausted error) instead of blocking and waiting? This would allow the user of this client to handle the error however it wants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't have retry logic when fetching from a relay. The idea here was that it was better to slightly delay a request to a relay than it was to fail to fetch data (and potentially not sign a batch).

Copy link
Contributor

@ian-shim ian-shim Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this choice should be made not inside the client but by the users of the client, and client being a thin wrapper for interacting with the relay.
Should we add this functionality at the node side if we really do want this kind of limiting logic from the client side?
I'm fine with not putting in this limiting mechanism at all from the client side (node) in favor of potentially retrying with backoff in case of ResourceExhausted error. Because the users of this relay client don't know the rate limiting configuration set in the relay, it's highly likely the client side accounting is not going to be correct.

api/clients/v2/relay_client.go Outdated Show resolved Hide resolved
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants