Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal] add network-level gRPC timeouts #16196

Merged
merged 8 commits into from
Dec 9, 2022

Conversation

tdyas
Copy link
Contributor

@tdyas tdyas commented Jul 15, 2022

Add network level gRPC timeouts starting with the check_action_cache function (which maps to the REAPI GetActionResult API).

Fixes #16188.

@tdyas
Copy link
Contributor Author

tdyas commented Jul 15, 2022

This PR currently vendors the Tower TimeoutLayer in order to allow the addition of a timeout metric. I am not enamored with this approach. I am going to experiment with writing a custom layer that can detect the timeout condition from Tower's TimeoutLayer (by sitting above it in the stack) and just emit the metric that way. Thoughts?

@tdyas tdyas added the category:internal CI, fixes for not-yet-released features, etc. label Jul 15, 2022
@tdyas tdyas force-pushed the network_level_grpc_timeouts branch 2 times, most recently from e2edebc to 775bc67 Compare July 21, 2022 00:32
@tdyas tdyas requested review from stuhood and Eric-Arellano July 21, 2022 00:49
@tdyas tdyas marked this pull request as ready for review July 21, 2022 00:49
@tdyas
Copy link
Contributor Author

tdyas commented Jul 21, 2022

(Forgot to mark this for review so I could ask the question in earlier comment. Doing so now. Doh.)

@Eric-Arellano
Copy link
Contributor

Thanks!

This PR currently vendors the Tower TimeoutLayer in order to allow the addition of a timeout metric.

I'm not following why we need to vendor, rather than importing the code?

Either way, should probably add comments that this is vendored, and what source it comes from.

@tdyas
Copy link
Contributor Author

tdyas commented Jul 21, 2022

I'm not following why we need to vendor, rather than importing the code?

Just look at the Pants-specific code in the file that accesses workunit store ...

@tdyas
Copy link
Contributor Author

tdyas commented Jul 21, 2022

... which relates to my comment at #16196 (comment) about finding a different way to be able to increment that counter.

@tdyas
Copy link
Contributor Author

tdyas commented Jul 28, 2022

ping

@tdyas tdyas force-pushed the network_level_grpc_timeouts branch from 775bc67 to ffc9296 Compare July 28, 2022 21:34
@tdyas
Copy link
Contributor Author

tdyas commented Jul 28, 2022

Latest commit ditches the vendored timeout layer and just uses a layer to count errors (which will be timeouts) coming from the Tower TimeoutLayer.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you.

@tdyas tdyas force-pushed the network_level_grpc_timeouts branch from ffc9296 to d1d5782 Compare November 10, 2022 14:12
@tdyas tdyas force-pushed the network_level_grpc_timeouts branch from d1d5782 to 46e7668 Compare December 8, 2022 11:37
Tom Dyas added 8 commits December 8, 2022 23:59
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
@tdyas tdyas force-pushed the network_level_grpc_timeouts branch from 46e7668 to f6b19a7 Compare December 8, 2022 22:59
@tdyas tdyas merged commit 9563aa6 into pantsbuild:main Dec 9, 2022
@tdyas tdyas deleted the network_level_grpc_timeouts branch December 9, 2022 03:45
@Eric-Arellano
Copy link
Contributor

Awesome, thank you!

stuhood added a commit that referenced this pull request Apr 6, 2023
…18695)

#16196 moved the cache read timeout down to the network layer, which
made it much more accurate. But cache lookups also involve a number of
calls to a remote `Store`, which did not have their own timeout.

This change adds an RPC timeout for `Store` accesses to allow for
retries of tar-pitted remote store RPCs, and adjusts the naming of the
`--remote-cache-rpc-timeout-millis` option to make it clear that it
applies to all cache operations (including writes).
stuhood added a commit to stuhood/pants that referenced this pull request Apr 6, 2023
…antsbuild#18695)

pantsbuild#16196 moved the cache read timeout down to the network layer, which
made it much more accurate. But cache lookups also involve a number of
calls to a remote `Store`, which did not have their own timeout.

This change adds an RPC timeout for `Store` accesses to allow for
retries of tar-pitted remote store RPCs, and adjusts the naming of the
`--remote-cache-rpc-timeout-millis` option to make it clear that it
applies to all cache operations (including writes).
stuhood added a commit that referenced this pull request Apr 7, 2023
…(Cherry-pick of #18695) (#18697)

#16196 moved the cache read timeout down to the network layer, which
made it much more accurate. But cache lookups also involve a number of
calls to a remote `Store`, which did not have their own timeout.

This change adds an RPC timeout for `Store` accesses to allow for
retries of tar-pitted remote store RPCs, and adjusts the naming of the
`--remote-cache-rpc-timeout-millis` option to make it clear that it
applies to all cache operations (including writes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote cache read timeout should be applied below the concurrency limit
3 participants