-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159
Comments
The HTTP client does not retry failed downloads or uploads (I only checked for downloads). If it prints WARNING, then it should fall back to local execution. ERROR probably not. The HTTP client is HTTP 1.1 only with a maximum number of concurrent connections. |
I have a fairly reliable repro of at least one subclass of issues against a local nginx over HTTP (not TLS). My BUILD file:
My nginx config:
My Bazel command line:
|
My hypothesis is that it does not handle 'connection:close' headers correctly, or at least that there's a race condition. I think it explains at least the following failure modes:
It may also explain some of the other problems reported here. |
If it doesn't repro, increasing the number of genrules helps. Decreasing the number of jobs seems to make it less likely to trigger. |
Here's a patch for
|
The user promise has a callback that returns the connection to the pool. If the server returns a 'connection: close' HTTP header, then this can currently happen before the connection is closed, in which case the client attempts to reuse the connection, which - of course - fails. This changes the ordering to close the connection *before* completing the user promise. This is at least a partial fix for the linked issue. It is unclear if this is the root cause for all the reported failure modes. Progress on bazelbuild#10159. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528
The user promise has a callback that returns the connection to the pool. If the server returns a 'connection: close' HTTP header, then this can currently happen before the connection is closed, in which case the client attempts to reuse the connection, which - of course - fails. This changes the ordering to close the connection *before* completing the user promise. This is at least a partial fix for the linked issue. It is unclear if this is the root cause for all the reported failure modes. Progress on bazelbuild#10159. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528
The user promise has a callback that returns the connection to the pool. If the server returns a 'connection: close' HTTP header, then this can currently happen before the connection is closed, in which case the client attempts to reuse the connection, which - of course - fails. This changes the ordering to close the connection *before* completing the user promise. This is at least a partial fix for the linked issue. It is unclear if this is the root cause for all the reported failure modes. Progress on #10159. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528 Closes #12055. Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528 PiperOrigin-RevId: 330496714
I'm not sure if my change fixes the 'ssl handshake' issue. Please let me know if you still see issues with my commit (not released yet, should be in 3.6.0). |
Thanks @ulfjack for the fixes!! Unfortunately it's not easy for me to verify the fix, because the issue was only reproducing in high-volume CI builds, and we are currently using grpc in CI. But if you believe the issue is fixed, or at least significantly improved, I'd be perfectly fine with closing this bug - we can always re-open or file a new ticket if we still see issues after 3.6.0. Thanks!! |
Adding @coeuvre to decide whether to close this. I don't have the access to do it myself anyway. |
Closing. Please feel free to re-open if this is still an issue. |
Description of the problem / feature request:
We are running bazel with:
and the build/test parallelism is high:
--jobs=auto
(nproc == 56),--local_test_jobs=40
.We are seeing intermittent errors that seem to be related to the use of https as the remote cache protocol.
Example 1:
Example 2:
Example 3:
I have also seen a curious "null" error once, which might or might not be related:
Brainstorming:
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Unfortunately the errors are intermittent and only happen when there is lot of traffic going in and out of the remote cache. So I don't have an easy reproducer.
What operating system are you running Bazel on?
What's the output of
bazel info release
?If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.We have applied two patches on top of the 1.1.0 release, which should be unrelated. We have built bazel from this commit: https://github.com/scele/bazel/commits/1.1.0-patches1 using
Have you found anything relevant by searching the web?
These issues seem related, but both should be fixed in 1.1.0 and neither of them mentions "handshake timed out" error:
The text was updated successfully, but these errors were encountered: