Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds on buildkite are hanging on remote-cache after 0.23 #7555

Closed
hlopko opened this issue Feb 27, 2019 · 16 comments
Closed

Builds on buildkite are hanging on remote-cache after 0.23 #7555

hlopko opened this issue Feb 27, 2019 · 16 comments
Assignees
Labels
breakage P0 This is an emergency and more important than other current work. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team

Comments

@hlopko
Copy link
Member

hlopko commented Feb 27, 2019

All the builds (e.g. presubmits) on Buildkite are hanging sith 'remote-cache' status with Bazel 0.23 (example). I see hangs on mac and windows as well.

Builds with 0.22 are still passing (example).

Testing 0.23rc3 with downstream (which include bazel and its tests) was green.

@hlopko hlopko added the P1 I'll work on this now. (Assignee required) label Feb 27, 2019
@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

@meisterT meisterT added P0 This is an emergency and more important than other current work. (Assignee required) and removed P1 I'll work on this now. (Assignee required) labels Feb 27, 2019
@hlopko hlopko added P1 I'll work on this now. (Assignee required) category: misc > release / binary breakage team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed P0 This is an emergency and more important than other current work. (Assignee required) labels Feb 27, 2019
@meisterT meisterT added P0 This is an emergency and more important than other current work. (Assignee required) and removed P1 I'll work on this now. (Assignee required) labels Feb 27, 2019
@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

It seems like all the hangs are on the C++ actions. Both compiling and linking. But it can be a red herring, maybe C++ actions are just first to be executed.

hlopko pushed a commit to bazelbuild/continuous-integration that referenced this issue Feb 27, 2019
In an attempt to resume at least some operation with bazelbuild/bazel#7555, I'm disabling remote caching on all workers.
@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

I'm trying whether disabling remote cache (bazelbuild/continuous-integration@c6131a7) will help. I will now cancel some hanging builds to make workers available.

@meisterT
Copy link
Member

It seems that downloading from remote never gives a timeout for the affected actions. Is 285c03e#diff-b1b932f73f8ce938510d06b752945cb1 related?

What's special about the C++ actions? Do they produce larger artifacts?

@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

I disabled remote cache on the ci, queued builds are now running, presubmits should be working. I'm not sure mac presubmits will finish in under 1 hour without remote cache though.

@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

They don't produce large artifacts. They are not spawn actions, so maybe they behave in a non standard way. Or it's a red herring.

@meteorcloudy
Copy link
Member

Hmm,, I couldn't reproduce the failure on Ubuntu 1404 in a docker running on a CI Linux slave.

@meteorcloudy
Copy link
Member

It is not always failing with 0.23.0, for example in one presubmit:
https://buildkite.com/bazel/bazel-bazel-github-presubmit/builds/1833#_

@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

Hmm, that could mean it was a temporary GCS outage and all is fine now...

@meisterT
Copy link
Member

Well, it should still not hang forever but run into a timeout.

@hlopko
Copy link
Member Author

hlopko commented Feb 27, 2019

Absolutely.

@meteorcloudy
Copy link
Member

meteorcloudy commented Feb 27, 2019

But this bug might also exist in previous versions? So maybe not a regression. @buchgr

@buchgr
Copy link
Contributor

buchgr commented Feb 27, 2019

It seems to work fine again on testing pipelines. I ll re-enable remote caching for presubmits again.

Hmm, that could mean it was a temporary GCS outage and all is fine now...

The reason it can't really be a GCS outage is that mac presubmits don't use GCS.

@buchgr
Copy link
Contributor

buchgr commented Feb 27, 2019

As of now, I see no reason for a patch release because I don't know what exactly the bug is. I was not able to reproduce it.

@buchgr buchgr closed this as completed Feb 27, 2019
@meteorcloudy
Copy link
Member

I found a way to reproduce the hanging symptom (probably it's the same bug)

  1. First build a project with remote cache flag,
  2. Bazel clean
  3. Build without remote cache flag

Then the actions will still try to use remote caching but end up hanging.

@meteorcloudy meteorcloudy reopened this Feb 27, 2019
@buchgr
Copy link
Contributor

buchgr commented Feb 27, 2019

@laurentlb this will need a patch release. I am sending out a fix and will update this bug once submitted

buchgr added a commit to buchgr/bazel that referenced this issue Feb 27, 2019
…d#7555

when using --remote_(http)_cache we wouldn't properly reset the state on
the bazel server and so on subsequent command invocations the server
would still think it's using remote caching. this would lead for bazel
to hang indefinitely.
fweikert added a commit to fweikert/continuous-integration that referenced this issue Feb 27, 2019
This commit unblocks CI by avoiding the bad release 0.23.0: bazelbuild/bazel#7555

bazelbuild/bazel#7555
fweikert added a commit to bazelbuild/continuous-integration that referenced this issue Feb 27, 2019
This commit unblocks CI by avoiding the bad release 0.23.0: bazelbuild/bazel#7555
laurentlb pushed a commit that referenced this issue Feb 28, 2019
when using --remote_(http)_cache we wouldn't properly reset the state on
the bazel server and so on subsequent command invocations the server
would still think it's using remote caching. this would lead for bazel
to hang indefinitely.

Closes #7562.

PiperOrigin-RevId: 235914044
bazel-io pushed a commit that referenced this issue Mar 4, 2019
Baseline: 441fd75

Cherry picks:

   + 6ca7763:
     Fix a typo
   + 2310b1c:
     Ignore SIGCHLD in test setup script
   + f9eb1b5:
     Complete channel initialization in the event loop
   + f0a1597:
     remote: properly reset state when using remote cache. Fixes #7555

Release 0.23.1rc1 (2019-02-28)
laurentlb pushed a commit that referenced this issue Mar 7, 2019
when using --remote_(http)_cache we wouldn't properly reset the state on
the bazel server and so on subsequent command invocations the server
would still think it's using remote caching. this would lead for bazel
to hang indefinitely.

Closes #7562.

PiperOrigin-RevId: 235914044
laurentlb pushed a commit that referenced this issue Mar 7, 2019
when using --remote_(http)_cache we wouldn't properly reset the state on
the bazel server and so on subsequent command invocations the server
would still think it's using remote caching. this would lead for bazel
to hang indefinitely.

Closes #7562.

PiperOrigin-RevId: 235914044
bazel-io pushed a commit that referenced this issue Mar 11, 2019
Baseline: 441fd75

Cherry picks:

   + 6ca7763:
     Fix a typo
   + 2310b1c:
     Ignore SIGCHLD in test setup script
   + f9eb1b5:
     Complete channel initialization in the event loop
   + f0a1597:
     remote: properly reset state when using remote cache. Fixes #7555
   + 56366ee:
     Set non-empty values for msvc_env_* when VC not installed

Release 0.23.2
joeleba pushed a commit to joeleba/continuous-integration that referenced this issue Jun 17, 2019
In an attempt to resume at least some operation with bazelbuild/bazel#7555, I'm disabling remote caching on all workers.
joeleba pushed a commit to joeleba/continuous-integration that referenced this issue Jun 17, 2019
This commit unblocks CI by avoiding the bad release 0.23.0: bazelbuild/bazel#7555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage P0 This is an emergency and more important than other current work. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

No branches or pull requests

5 participants