Bazel exit code 38 on long builds communicating with Build Event Service #3570

sethkoehler · 2017-08-16T13:45:12Z

Description of the problem / feature request / question:

On long-running builds (we've seen this on builds as short as 45 minutes, though quite often on builds running 3+ hours) will exit with exit code 38. It appears that something is closing the streaming connection to the Build Event Service, which could make sense given the lengthy times that connection might be open during these long runs.

To avoid this behavior, I believe bazel may need some kind of keep-alive signal to avoid an idle timeout on the connection, and probably also needs proper recovery from the scenario where the TCP connection to BES dies (which seems inevitable given long enough builds).

If possible, provide a minimal example to reproduce the problem:

I've tried a number of things to produce a more minimal example here, but the key appears to be a long-running build (by the time we hit the 5-6 hour mark, this almost always happens) communicating with BES, without much regard to what is building (I was able to get this to reproduce by building "..." on bazel's own source code).

Environment info

Operating System: Ubuntu 14.04.1
Bazel version (output of bazel info release): release 0.5.3

Have you found anything relevant by searching the web?

(e.g. StackOverflow answers,
GitHub issues,
email threads on the bazel-discuss Google group)

Nothing so far.

The text was updated successfully, but these errors were encountered:

michaeledgar · 2017-08-20T18:36:09Z

To provide a bit more context, consider this issue filed against grpc-java. It reports that Google L3 Load Balancers (which support gRPC) will consistently reap idle TCP connections after 600 seconds. A compatible build event service served behind such a load balancer would fail for any build with a period of 600 seconds with no events to report.

Even if the server configuration is not behind a load balancer, clients can observe unavailability when TCP sessions are idle for long periods. Enterprise customers are often behind aggressive middleboxes and mobile/roaming network providers can also drop idle TCP connections.

aehlig · 2017-08-24T11:46:59Z

Over to @buchgr who was also involved in the solution of the related issue on grpc-java.

buchgr · 2017-08-25T09:47:55Z

@sethkoehler

Do you have any error logs that you can share? It could be for many reasons. Also, is there a way I can reproduce this?

A Channel in gRPC is not bound to one TCP connection, but it could be many. It will try to automatically reconnect if the connection is killed. If I had to guess, it could be due to the recently introduced RoundRobinLoadBalancer. See grpc/grpc-java#3297

michaeledgar · 2017-08-25T21:48:11Z

Bazel appears to be using gRPC-Java 1.3.0: https://github.com/bazelbuild/bazel/tree/master/third_party/grpc

Maybe we just need to update to a new version?

buchgr · 2017-08-31T07:27:53Z

@michaeledgar

It certainly doesn't hurt to update to a newer version. However, without a specific bug that's more wishful thinking. Maybe we are indeed lucky.

How can I reproduce this?

buchgr · 2018-03-21T15:31:59Z

I believe this has been fixed.

iirina added category: misc > misc type: bug labels Aug 16, 2017

iirina assigned aehlig Aug 16, 2017

aehlig assigned buchgr and unassigned aehlig Aug 24, 2017

buchgr closed this as completed Mar 21, 2018

tomrenn mentioned this issue Apr 20, 2023

Add a keep-alive event for BES streams #18166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel exit code 38 on long builds communicating with Build Event Service #3570

Bazel exit code 38 on long builds communicating with Build Event Service #3570

sethkoehler commented Aug 16, 2017

michaeledgar commented Aug 20, 2017

aehlig commented Aug 24, 2017

buchgr commented Aug 25, 2017 •

edited

Loading

michaeledgar commented Aug 25, 2017

buchgr commented Aug 31, 2017

buchgr commented Mar 21, 2018

Bazel exit code 38 on long builds communicating with Build Event Service #3570

Bazel exit code 38 on long builds communicating with Build Event Service #3570

Comments

sethkoehler commented Aug 16, 2017

Description of the problem / feature request / question:

If possible, provide a minimal example to reproduce the problem:

Environment info

Have you found anything relevant by searching the web?

michaeledgar commented Aug 20, 2017

aehlig commented Aug 24, 2017

buchgr commented Aug 25, 2017 • edited Loading

michaeledgar commented Aug 25, 2017

buchgr commented Aug 31, 2017

buchgr commented Mar 21, 2018

buchgr commented Aug 25, 2017 •

edited

Loading