Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel exit code 38 on long builds communicating with Build Event Service #3570

Closed
sethkoehler opened this issue Aug 16, 2017 · 6 comments
Closed

Comments

@sethkoehler
Copy link

Description of the problem / feature request / question:

On long-running builds (we've seen this on builds as short as 45 minutes, though quite often on builds running 3+ hours) will exit with exit code 38. It appears that something is closing the streaming connection to the Build Event Service, which could make sense given the lengthy times that connection might be open during these long runs.

To avoid this behavior, I believe bazel may need some kind of keep-alive signal to avoid an idle timeout on the connection, and probably also needs proper recovery from the scenario where the TCP connection to BES dies (which seems inevitable given long enough builds).

If possible, provide a minimal example to reproduce the problem:

I've tried a number of things to produce a more minimal example here, but the key appears to be a long-running build (by the time we hit the 5-6 hour mark, this almost always happens) communicating with BES, without much regard to what is building (I was able to get this to reproduce by building "..." on bazel's own source code).

Environment info

  • Operating System: Ubuntu 14.04.1

  • Bazel version (output of bazel info release): release 0.5.3

Have you found anything relevant by searching the web?

(e.g. StackOverflow answers,
GitHub issues,
email threads on the bazel-discuss Google group)

Nothing so far.

@michaeledgar
Copy link
Contributor

To provide a bit more context, consider this issue filed against grpc-java. It reports that Google L3 Load Balancers (which support gRPC) will consistently reap idle TCP connections after 600 seconds. A compatible build event service served behind such a load balancer would fail for any build with a period of 600 seconds with no events to report.

Even if the server configuration is not behind a load balancer, clients can observe unavailability when TCP sessions are idle for long periods. Enterprise customers are often behind aggressive middleboxes and mobile/roaming network providers can also drop idle TCP connections.

@aehlig aehlig assigned buchgr and unassigned aehlig Aug 24, 2017
@aehlig
Copy link
Contributor

aehlig commented Aug 24, 2017

Over to @buchgr who was also involved in the solution of the related issue on grpc-java.

@buchgr
Copy link
Contributor

buchgr commented Aug 25, 2017

@sethkoehler

Do you have any error logs that you can share? It could be for many reasons. Also, is there a way I can reproduce this?

A Channel in gRPC is not bound to one TCP connection, but it could be many. It will try to automatically reconnect if the connection is killed. If I had to guess, it could be due to the recently introduced RoundRobinLoadBalancer. See grpc/grpc-java#3297

@michaeledgar
Copy link
Contributor

Bazel appears to be using gRPC-Java 1.3.0: https://github.com/bazelbuild/bazel/tree/master/third_party/grpc

Maybe we just need to update to a new version?

@buchgr
Copy link
Contributor

buchgr commented Aug 31, 2017

@michaeledgar

It certainly doesn't hurt to update to a newer version. However, without a specific bug that's more wishful thinking. Maybe we are indeed lucky.

How can I reproduce this?

@buchgr
Copy link
Contributor

buchgr commented Mar 21, 2018

I believe this has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants