-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel exit code 38 on long builds communicating with Build Event Service #3570
Comments
To provide a bit more context, consider this issue filed against grpc-java. It reports that Google L3 Load Balancers (which support gRPC) will consistently reap idle TCP connections after 600 seconds. A compatible build event service served behind such a load balancer would fail for any build with a period of 600 seconds with no events to report. Even if the server configuration is not behind a load balancer, clients can observe unavailability when TCP sessions are idle for long periods. Enterprise customers are often behind aggressive middleboxes and mobile/roaming network providers can also drop idle TCP connections. |
Over to @buchgr who was also involved in the solution of the related issue on grpc-java. |
Do you have any error logs that you can share? It could be for many reasons. Also, is there a way I can reproduce this? A Channel in gRPC is not bound to one TCP connection, but it could be many. It will try to automatically reconnect if the connection is killed. If I had to guess, it could be due to the recently introduced |
Bazel appears to be using gRPC-Java 1.3.0: https://github.com/bazelbuild/bazel/tree/master/third_party/grpc Maybe we just need to update to a new version? |
It certainly doesn't hurt to update to a newer version. However, without a specific bug that's more wishful thinking. Maybe we are indeed lucky. How can I reproduce this? |
I believe this has been fixed. |
Description of the problem / feature request / question:
On long-running builds (we've seen this on builds as short as 45 minutes, though quite often on builds running 3+ hours) will exit with exit code 38. It appears that something is closing the streaming connection to the Build Event Service, which could make sense given the lengthy times that connection might be open during these long runs.
To avoid this behavior, I believe bazel may need some kind of keep-alive signal to avoid an idle timeout on the connection, and probably also needs proper recovery from the scenario where the TCP connection to BES dies (which seems inevitable given long enough builds).
If possible, provide a minimal example to reproduce the problem:
I've tried a number of things to produce a more minimal example here, but the key appears to be a long-running build (by the time we hit the 5-6 hour mark, this almost always happens) communicating with BES, without much regard to what is building (I was able to get this to reproduce by building "..." on bazel's own source code).
Environment info
Operating System: Ubuntu 14.04.1
Bazel version (output of
bazel info release
): release 0.5.3Have you found anything relevant by searching the web?
(e.g. StackOverflow answers,
GitHub issues,
email threads on the
bazel-discuss
Google group)Nothing so far.
The text was updated successfully, but these errors were encountered: