Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazel_bootstrap_distfile_test failing in postsubmit on Windows #12578

Closed
philwo opened this issue Nov 27, 2020 · 6 comments
Closed

bazel_bootstrap_distfile_test failing in postsubmit on Windows #12578

philwo opened this issue Nov 27, 2020 · 6 comments
Assignees
Labels
area-Windows Windows-specific issues and feature requests breakage P0 This is an emergency and more important than other current work. (Assignee required) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug

Comments

@philwo
Copy link
Member

philwo commented Nov 27, 2020

The test //src/test/shell/bazel:bazel_bootstrap_distfile_test has started to fail in our postsubmit pipeline on Windows only.

Here's a list of broken jobs:

I have not seen it happen before job 14736, so this might be the culprit. 🤔
I'm not aware of any CI infrastructure or Windows image changes in the last days.

Here's an example log: https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/745cad19-516a-479b-8084-9154bf14c893/src%5Ctest%5Cshell%5Cbazel%5Cbazel_bootstrap_distfile_test%5Cattempt_1.log

The relevant error message seems to be:

�[1A�[K�[31m�[1mERROR: �[0mAnalysis of target '//src:bazel_nojdk.exe' failed; build aborted: invalid registered execution platform '//:default_host_platform': no such target '//:default_host_platform': target 'default_host_platform' not declared in package '' defined by C:/b/iwgf237n/execroot/io_bazel/_tmp/a6b4c571d5cd258ae95c5906c6c25060/bazelbootstrap.iIshY86j/BUILD

When the test fails during a job, it will consistently fail during all attempts, always with the same error message. It is thus not a typically flaky test which fails with a random chance.

@philwo philwo added breakage P0 This is an emergency and more important than other current work. (Assignee required) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug area-Windows Windows-specific issues and feature requests labels Nov 27, 2020
@meteorcloudy
Copy link
Member

It looks like before #14743, the job succeeds after retrying, but not after that..
This is strange, it will take a look.

bazel-io pushed a commit that referenced this issue Nov 30, 2020
The test is only failing in postsubmit, not in presubmit or downstream pipeline, I cannot reproduce this locally or on a Windows VM.

The error message complained that //:default_host_platform doesn't exist, however, it should be defined in ./BUILD. Let's print out the content of ./BUILD to find out what's happening.

Related: #12578

RELNOTES: None
PiperOrigin-RevId: 344809001
@meteorcloudy
Copy link
Member

https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/5f2e15bb-e09e-4f65-ac7a-a4c6809fb9e4/src%5Ctest%5Cshell%5Cbazel%5Cbazel_bootstrap_distfile_test%5Cattempt_3.log

I printed out the content of ./BUILD, it was somehow overridden to https://github.com/bazelbuild/bazel/blob/master/tools/jdk/BUILD.java_tools. This is super weird, I suspect this is caused by running another test in parallel, this could also explain why it's not failing in presubmit (because of sharding)..

bazel-io pushed a commit that referenced this issue Dec 2, 2020
Related: #12578

RELNOTES: None
PiperOrigin-RevId: 345191170
@meteorcloudy
Copy link
Member

After hours of debugging.. I finally figured this out!

So the culprit is d10013d, where it introduced a genrule:

genrule(
    name = "java_tools_build_zip",
    srcs = ["//tools/jdk:BUILD.java_tools"],
    outs = ["java_tools_build.zip"],
    cmd = "cat $(SRCS) > BUILD; zip -qjX $@ BUILD",
)

And this is what's happening in the postsubmit pipeline on CI:

  • The CI runs bazel test //src:all_windows_tests
  • Bazel plants the symlink tree for io_bazel at C:/b/iwgf237n/execroot/io_bazel. Because we are on Windows, ./BUILD is actually copied to C:/b/iwgf237n/execroot/io_bazel/BUILD
  • Bazel first tries to execute one of the bazel_java_tests which depends on the java_tools_build_zip target.
  • Bazel runs the genrule, the command was executed under C:/b/iwgf237n/execroot/io_bazel, therefore the content of BUILD.java_tools was cat to C:/b/iwgf237n/execroot/io_bazel/BUILD.
  • Bazel tries to build the bootstrap test, which depends on the bazel-distfile.zip. It zips all sources into a tar ball, but at this time, the BUILD file was already changed.
  • Bazel executes the bootstrap test with the wrong source archive and complains some target wasn't defined.

This isn't happening on Linux or macOS because of sandbox, the genrule cannot override any file in the execroot.

To reproduce, simply run the following on Windows:

# //:srcs is needed so that Bazel plant symlinks for all files/dirs under the source root.
bazel build //src:java_tools_build_zip //:srcs     
head bazel-bazel/BUILD

Lessen learned: we have to use genrule very carefully on Windows because the entire main repo source tree under execroot was exposed to the genrule command due to the lack of sandbox.

/cc @comius

@philwo
Copy link
Member Author

philwo commented Dec 2, 2020

OMG, Yun! Thank you so much for all the debugging and figuring this out. 🤯 Of course, now that I read your explanation, it all makes sense. Just from looking at the genrule, it's very hard to see that this might cause these kind of problems, especially because it just happened to work and only affects non-sandboxed execution.

@comius
Copy link
Contributor

comius commented Dec 3, 2020

Thanks Yun for your effort. I didn't know I don't have a sandbox. Let me fix this issue properly, without putting some more perfume over a pig.

@comius comius assigned comius and unassigned meteorcloudy Dec 3, 2020
@meteorcloudy
Copy link
Member

meteorcloudy commented Dec 3, 2020

Thank you, glad I can help! 😉

bazel-io pushed a commit that referenced this issue Dec 7, 2020
Related #12578

RELNOTES: None
PiperOrigin-RevId: 346080243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Windows Windows-specific issues and feature requests breakage P0 This is an emergency and more important than other current work. (Assignee required) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants