Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote-cached build fails with javax.net.ssl.SSLException: handshake timed out #10159

Closed
scele opened this issue Nov 4, 2019 · 9 comments
Closed
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged

Comments

@scele
Copy link
Contributor

scele commented Nov 4, 2019

Description of the problem / feature request:

We are running bazel with:

--remote_cache=https://our-aws-cloudfront-distribution.com/backed/by/s3/bucket
--disk_cache=~/path/on/local/disk
--remote_download_minimal
--remote_timeout=300
--remote_retries=2

and the build/test parallelism is high: --jobs=auto (nproc == 56), --local_test_jobs=40.

We are seeing intermittent errors that seem to be related to the use of https as the remote cache protocol.

Example 1:

(06:43:31) [3,407 / 3,521] 19 / 872 tests; Testing //some:target; 73s local ... (87 actions running)
(06:43:35) ERROR: /ssd/home/gitlab-runner/builds/ZQ9e7vGx/0/some/path/BUILD:40:1:  failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(06:43:35) INFO: Elapsed time: 82.648s, Critical Path: 77.62s
(06:43:35) INFO: 1313 processes: 1216 remote cache hit, 97 local.
(06:43:35) FAILED: Build did NOT complete successfully

Example 2:

(06:21:37) WARNING: Reading from Remote Cache:
Symlinks in action outputs are not yet supported by --experimental_remote_download_outputs=minimal
(06:21:52) [2,129 / 3,349] 3 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 19s local, remote-cache ... (78 actions running)
(06:21:58) WARNING: Writing to Remote Cache:
ClosedChannelException
(06:22:16) [2,600 / 3,768] 72 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 43s local, remote-cache ... (60 actions, 51 running)
(06:22:22) WARNING: Writing to Remote Cache:
ExtendedClosedChannelException
(06:22:44) [2,615 / 3,782] 77 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 70s local, remote-cache ... (70 actions, 68 running)
(06:23:15) [2,920 / 4,087] 144 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 101s local, remote-cache ... (85 actions running)
(06:23:51) [3,323 / 4,490] 244 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 137s local, remote-cache ... (65 actions, 60 running)
(06:24:07) ERROR: /ssd/home/gitlab-runner/builds/VFqgTLVf/0/some/path/BUILD:7:1:  failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(06:24:43) [3,519 / 4,662] 261 / 759 tests; Executing genrule @local_config_cuda//cuda:cuda-include; 190s local, remote-cache ... (6 actions running)
(06:25:08) WARNING: Writing to Remote Cache:
SSLEngine closed already
(06:25:45) INFO: Elapsed time: 391.995s, Critical Path: 250.27s
(06:25:45) INFO: 1689 processes: 1152 remote cache hit, 537 local.

Example 3:

(11:46:29) WARNING: Writing to Remote Cache:
400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>DC396580BA6E5186</RequestId><HostId>PFE6JZxpAXa2yYxxIbM/1KVHqHii3XPH67ql3wF3ZhUslJUQTw1OMfg7j8krnc8H9SEXS+VotiE=</HostId></Error>
(11:46:29) WARNING: Writing to Remote Cache:
400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>211CBD741B01AC68</RequestId><HostId>GZrAWg9+c8EZtaWCwq4/z3O9qNIPHc5wsT//ROFGeTNtGQo2NM/gu5zfR5rtexaWtQ+6Tr0ag8c=</HostId></Error>
(11:46:32) WARNING: Reading from Remote Cache:
Symlinks in action outputs are not yet supported by --experimental_remote_download_outputs=minimal
(11:46:44) WARNING: Writing to Remote Cache:
Broken pipe
(11:47:41) WARNING: Writing to Remote Cache:
SSLEngine closed already
(11:47:41) WARNING: Writing to Remote Cache:
ClosedChannelException
(11:51:19) ERROR: /ssd/home/gitlab-runner/builds/RBdZnUE1/0/some/path/BUILD:35:1: C++ compilation of rule '//some/target.so' failed due to unexpected I/O exception: handshake timed out
javax.net.ssl.SSLException: handshake timed out
	at io.netty.handler.ssl.SslHandler.handshake(...)(Unknown Source)
(11:51:21) INFO: Elapsed time: 596.349s, Critical Path: 366.45s
(11:51:21) INFO: 5841 processes: 5681 remote cache hit, 160 local.
(11:51:21) FAILED: Build did NOT complete successfully

I have also seen a curious "null" error once, which might or might not be related:

(06:27:30) ERROR: /ssd/home/gitlab-runner/builds/XgQhJyq7/0/some/path/BUILD:33:1:  failed due to unexpected I/O exception: null
io.netty.channel.ExtendedClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

Brainstorming:

  • Any remote cache connectivity issue should not result in failing the build, but falling back to local execution at worst, right?
  • Note that while Example 3 warns about the 400 error code returned by AWS CloudFront, the ERROR that causes the build to fail is the same handshake timeout.
  • Perhaps some TLS errors are not being correctly retried by bazel's http remote cache backend?
  • Does bazel support HTTP/2 and multiplexing multiple requests over a single connection? Perhaps bazel is creating too many connections to the remote cache which makes things unreliable.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unfortunately the errors are intermittent and only happen when there is lot of traffic going in and out of the remote cache. So I don't have an easy reproducer.

What operating system are you running Bazel on?

Ubuntu 16.04.6 LTS
Linux lpeltonen-dtx 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

What's the output of bazel info release?

release 1.1.0-patches1

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

We have applied two patches on top of the 1.1.0 release, which should be unrelated. We have built bazel from this commit: https://github.com/scele/bazel/commits/1.1.0-patches1 using

bazel build --stamp --embed_label 1.1.0-patches1 //scripts/packages:with-jdk/install.sh

Have you found anything relevant by searching the web?

These issues seem related, but both should be fixed in 1.1.0 and neither of them mentions "handshake timed out" error:

@dslomov dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Nov 25, 2019
@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

The HTTP client does not retry failed downloads or uploads (I only checked for downloads). If it prints WARNING, then it should fall back to local execution. ERROR probably not. The HTTP client is HTTP 1.1 only with a maximum number of concurrent connections.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

I have a fairly reliable repro of at least one subclass of issues against a local nginx over HTTP (not TLS).

My BUILD file:

[genrule(
  name = "foo-%s" % i,
  outs = ["foo-%s.txt" % i],
  cmd = "echo 'hello %s' > $@" % i) for i in range(0, 100)]

My nginx config:

server {
  listen 80;
  server_name localhost;
  location /cache/ {
    # The path to the directory where nginx should store the cache contents.
    root /tmp/nginx-cache/;
    # Allow PUT
    dav_methods PUT;
    # Allow nginx to create the /ac and /cas subdirectories.
    create_full_put_path on;
    # The maximum size of a single file.
    client_max_body_size 1G;
    allow all;
    autoindex on;
    keepalive_requests 1;
  }
}

My Bazel command line:

bazel clean && bazel build --remote_cache=http://localhost/cache/ //foo:all --verbose_failures --jobs=200

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

My hypothesis is that it does not handle 'connection:close' headers correctly, or at least that there's a race condition. I think it explains at least the following failure modes:

(11:47:41) WARNING: Writing to Remote Cache:
SSLEngine closed already
(11:47:41) WARNING: Writing to Remote Cache:
ClosedChannelException

It may also explain some of the other problems reported here.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

If it doesn't repro, increasing the number of genrules helps. Decreasing the number of jobs seems to make it less likely to trigger.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

Here's a patch for HttpDownloadHandler.java that makes the problem go away in my testing:

  private void succeedAndReset(ChannelHandlerContext ctx) {
    try {
      reset(ctx);
      succeedAndResetUserPromise();
    } finally {
//      reset(ctx);
    }
  }

ulfjack added a commit to ulfjack/bazel that referenced this issue Sep 5, 2020
The user promise has a callback that returns the connection to the pool.
If the server returns a 'connection: close' HTTP header, then this can
currently happen before the connection is closed, in which case the client
attempts to reuse the connection, which - of course - fails.

This changes the ordering to close the connection *before* completing the
user promise.

This is at least a partial fix for the linked issue. It is unclear if this
is the root cause for all the reported failure modes.

Progress on bazelbuild#10159.

Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528
ulfjack added a commit to ulfjack/bazel that referenced this issue Sep 5, 2020
The user promise has a callback that returns the connection to the pool.
If the server returns a 'connection: close' HTTP header, then this can
currently happen before the connection is closed, in which case the client
attempts to reuse the connection, which - of course - fails.

This changes the ordering to close the connection *before* completing the
user promise.

This is at least a partial fix for the linked issue. It is unclear if this
is the root cause for all the reported failure modes.

Progress on bazelbuild#10159.

Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528
bazel-io pushed a commit that referenced this issue Sep 8, 2020
The user promise has a callback that returns the connection to the pool.
If the server returns a 'connection: close' HTTP header, then this can
currently happen before the connection is closed, in which case the client
attempts to reuse the connection, which - of course - fails.

This changes the ordering to close the connection *before* completing the
user promise.

This is at least a partial fix for the linked issue. It is unclear if this
is the root cause for all the reported failure modes.

Progress on #10159.

Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528

Closes #12055.

Change-Id: I2897e55c6edda592a6fb5755ddcccd1a89cde528
PiperOrigin-RevId: 330496714
@ulfjack
Copy link
Contributor

ulfjack commented Sep 25, 2020

I'm not sure if my change fixes the 'ssl handshake' issue. Please let me know if you still see issues with my commit (not released yet, should be in 3.6.0).

@scele
Copy link
Contributor Author

scele commented Sep 28, 2020

Thanks @ulfjack for the fixes!!

Unfortunately it's not easy for me to verify the fix, because the issue was only reproducing in high-volume CI builds, and we are currently using grpc in CI.

But if you believe the issue is fixed, or at least significantly improved, I'd be perfectly fine with closing this bug - we can always re-open or file a new ticket if we still see issues after 3.6.0.

Thanks!!

@ulfjack
Copy link
Contributor

ulfjack commented Oct 5, 2020

Adding @coeuvre to decide whether to close this. I don't have the access to do it myself anyway.

@coeuvre
Copy link
Member

coeuvre commented Oct 9, 2020

Closing. Please feel free to re-open if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged
Projects
None yet
Development

No branches or pull requests

4 participants