Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky conflict installing rust-src on GitHub actions runners #3530

Closed
davidhewitt opened this issue Nov 11, 2023 · 24 comments
Closed

Flaky conflict installing rust-src on GitHub actions runners #3530

davidhewitt opened this issue Nov 11, 2023 · 24 comments

Comments

@davidhewitt
Copy link

On the PyO3 CI we're hitting a flaky issue installing rust-src as part of setup steps on a GitHub actions runner. (Via dtolnay/rust-toolchain action). E.g. https://github.com/PyO3/pyo3/actions/runs/6829368772/job/18575387359#step:5:111

error: failed to install component: 'rust-src', detected conflict: 'lib/rustlib/src/rust/Cargo.lock'

I think we've been encountering this for a little while at a very low probability, but since yesterday this was failing at a probably of maybe 2-3%. Still low, but because we want the whole build matrix to succeed, just one job failing will fail the merge. Restarting the CI doesn't help us much, because we get a different job failing with the same error.

See PyO3/pyo3#3570 for a repeated chain of failed merges hitting this.

Any insight you can offer to help resolve this would be greatly appreciated. Given the flakiness, it feels like a cache issue, but at the point of failing install I don't think we've restored anything from cache.

@rami3l
Copy link
Member

rami3l commented Nov 11, 2023

@davidhewitt Thanks for filing this issue!

The current workaround seems to be uninstalling the toolchain and installing it again: rust-lang/rls#1587. I'm still not sure what caused it though.
(Maybe removing the minimal profile will also work since you're on the official ubuntu image with rust already installed and this workflow will do an upgrade-downgrade to a newer version with a smaller profile, not quite sure about this...)

@rbtcollins do you happen to have any more context on this one?

@davidhewitt
Copy link
Author

(Maybe removing the minimal profile will also work since you're on the official ubuntu image with rust already installed and this workflow will do an upgrade-downgrade to a newer version with a smaller profile, not quite sure about this...)

Interesting, should this upgrade / downgrade be visible in the logs? I don't see any mention of "downgrade" in the logs from the failing CI job linked in the OP, for example. (At the time of writing this the ubuntu-latest image is up to date on Rust 1.73, so the toolchain version isn't changing.)

It looks like the GitHub Actions ubuntu-latest image is the minimal profile plus rustfmt and clippy components installed by rustup component add rustfmt clippy. Does that mean that if I were to add rustfmt and clippy components I would be able avoid the "downgrade"?

https://github.com/actions/runner-images/blob/e5b8919eebf4da1abfd013d080a4d75e6db21e34/images/linux/scripts/installers/rust.sh#L14

Thanks for the help 🙏

@rami3l
Copy link
Member

rami3l commented Nov 12, 2023

@davidhewitt Thanks for following up! I was probably wrong with the downgrade thing 1. It's just that I don't see why Cargo.toml is there already (I believe 'lib/rustlib/src/rust/Cargo.lock' is part of rust-src as the message suggests, which means we already have rust-src on that runner)... Normally this is what that conflict error message is for 🤔

Probably the status of the rust toolchains on the runner is not what it seems to be.

But anyway as a temporary workaround, you can uninstall stable before running dtolnay/rust-toolchain.


Update: Oops, it's a known issue! Looks like #2601 is still not completely fixed. Claim.

Footnotes

  1. I'm quite new here so I'm also learning the different parts of the project on the fly, sorry about that 🙇

@rami3l rami3l self-assigned this Nov 12, 2023
@rami3l
Copy link
Member

rami3l commented Nov 12, 2023

I can reproduce the same error message on my machine by:

  1. rustup toolchain install stable --component=rust-src --profile=minimal;
  2. Back up lib/rustlib/src to somewhere else (!!);
  3. rustup component remove rust-src;
  4. Restore the backup.

Now we have:

> rustup toolchain install stable --component=rust-src --profile=minimal
info: syncing channel updates for 'stable-aarch64-apple-darwin'
info: latest update on 2023-10-05, rust version 1.73.0 (cc66ad468 2023-10-03)
info: downloading component 'rust-src'
info: installing component 'rust-src'
info: rolling back changes
error: failed to install component: 'rust-src', detected conflict: 'lib/rustlib/src/rust/Cargo.lock'

But I'm sure if this is the same scenario. We can avoid this check, or possibly add something like --force, like what brew install does when it has detected an existing file 🤔

@rami3l
Copy link
Member

rami3l commented Nov 12, 2023

@davidhewitt Would you mind helping me record the output of rustup component list --installed and the status of lib/rustlib/src before setting up Rust in your failed CI? I suspect there is an incoherence problem. Thanks!

@davidhewitt
Copy link
Author

davidhewitt commented Nov 14, 2023

It looks like lib/rustlib/src is indeed populated when the job fails, but not always, which is why we see the flaky behaviour.

https://github.com/PyO3/pyo3/actions/runs/6861561741/job/18676110914?pr=3571

@davidhewitt
Copy link
Author

It also seems that our rust-toolchain.toml was part of the problem. It seemed to be triggering rustup to check if rust-src was installed multiple times over.

Removing it helped us get to a green CI run again: PyO3/pyo3#3575

@rami3l
Copy link
Member

rami3l commented Nov 15, 2023

@davidhewitt Thanks! I'll try to reproduce this issue again with this new info when I have time :)

@messense
Copy link

Looks like it's not just rust-src, other components can also fail with the same conflict error, see PyO3/maturin#1856

error: failed to install component: 'rustc-x86_64-unknown-linux-gnu', detected conflict: 'bin/rust-gdb'

and

error: failed to install component: 'clippy-preview-x86_64-unknown-linux-gnu', detected conflict: 'bin/cargo-clippy'

@rami3l
Copy link
Member

rami3l commented Nov 17, 2023

Looks like it's not just rust-src, other components can also fail with the same conflict error, see PyO3/maturin#1856

error: failed to install component: 'rustc-x86_64-unknown-linux-gnu', detected conflict: 'bin/rust-gdb'

and

error: failed to install component: 'clippy-preview-x86_64-unknown-linux-gnu', detected conflict: 'bin/cargo-clippy'

@messense Thanks for reporting! In your logs I have spotted something like:

info: removing previous version of component 'clippy'
warning: during uninstall component clippy was not found

... so I'm even more convinced that this is an inconsistency problem. Could it be that somehow certain (but not all) GHA runners pretend to have clippy or rust-src or whatever installed?

I'm still not sure if I have correctly reproduced this error message locally... So far it hasn't been possible to generate that error message on my machine without manually manipulating what's in the folder of installation.

But there is indeed a point of improvement anyway on our side:

Maybe rustup could have a toolchain verify subcommand that checks the health of and fixes corruption in the toolchain and components? Of course it'd be better if the corruption could be avoided in the first place but it might be good to have some obvious method of recovery for when the worst does happen.

Bonus points if it could diagnose the problem and ask the user to send a bug report with that information, but that's maybe a bit extra 🙂.

Originally posted by @ChrisDenton in #2704 (comment)

@Xuanwo
Copy link

Xuanwo commented Nov 17, 2023

Could it be that somehow certain (but not all) GHA runners pretend to have clippy or rust-src or whatever installed?

Yes.

Github Runner has it's own rust setup by default: https://github.com/actions/runner-images/blob/cd2cabc7ab676a9c4603ffd680963ddfc5220270/images/ubuntu/Ubuntu2204-Readme.md?plain=1#L142-L154

@rami3l
Copy link
Member

rami3l commented Nov 17, 2023

Could it be that somehow certain (but not all) GHA runners pretend to have clippy or rust-src or whatever installed?

Yes.

Github Runner has it's own rust setup by default: https://github.com/actions/runner-images/blob/cd2cabc7ab676a9c4603ffd680963ddfc5220270/images/ubuntu/Ubuntu2204-Readme.md?plain=1#L142-L154

I'm aware of that, and it has already been mentioned in #3530 (comment). However, if we take either that Dockerfile or this .md file as the source of truth, then I expect all those machines to have exactly the same set of tools installed, so rustup should not find what shouldn't be installed at all, nor should it fail to find what is actually installed. That's what I meant by the word "pretend".

However, as we've seen in this thread, this issue only happens randomly.

That being said, this doesn't mean it's not a rustup issue at all. As I said I'm still trying to reproduce this locally (still no success so far). If anyone happens to have found a way that seems to reflect what has happened on the CI better than what I've done in #3530 (comment), please let me know!

@Xuanwo
Copy link

Xuanwo commented Nov 19, 2023

Another point to note is that this error only occurs during workflows that call maturin build, and never in other workflows. (In OpenDAL cases, we maintained a rust-toolchain.toml which specifies stable)

@davidhewitt
Copy link
Author

Another point to note is that this error only occurs during workflows that call maturin build, and never in other workflows.

That's not the case for us, we had this problem on a job which just installed and ran rustfmt.

@Xuanwo
Copy link

Xuanwo commented Nov 20, 2023

I found a new case here, not sure if it's valuable:

× Preparing editable metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [33 lines of output]
      info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu'
      info: latest update on 2023-11-16, rust version 1.74.0 (79e9716c9 2023-11-13)
      info: downloading component 'clippy'
      info: downloading component 'rust-analyzer'
      info: downloading component 'rustfmt'
      info: downloading component 'cargo'
      info: downloading component 'rust-std'
      info: downloading component 'rustc'
      info: removing previous version of component 'clippy'
      warning: during uninstall component clippy was not found
      info: removing previous version of component 'rustfmt'
      warning: during uninstall component rustfmt was not found
      info: removing previous version of component 'cargo'
      warning: during uninstall component cargo was not found
      info: removing previous version of component 'rust-std'
      warning: during uninstall component rust-std was not found
      info: removing previous version of component 'rustc'
      info: installing component 'clippy'
      info: installing component 'rust-analyzer'
      info: installing component 'rustfmt'
      info: installing component 'cargo'
      info: installing component 'rust-std'
      info: rolling back changes
      error: could not rename component file from '/home/runner/.rustup/tmp/xsmd_g52euit203d_dir/bk' to '/home/runner/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share'
      error: could not rename component file from '/home/runner/.rustup/tmp/n4ezxsonwowzdzeg_dir/bk' to '/home/runner/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc'
      error: could not rename component file from '/home/runner/.rustup/tmp/sysi9zrkbrqnm3bl_dir/bk' to '/home/runner/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/man'
      error: failed to install component: 'rust-std-x86_64-unknown-linux-gnu', detected conflict: 'lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc-stable_rt.asan.a'
      
      Cargo, the Rust package manager, is not installed or is not on PATH.
      This package requires Rust and Cargo to compile extensions. Install it through
      the system's package manager or via https://rustup.rs/
      
      Checking for Rust toolchain....
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.

The full workflow: https://github.com/apache/incubator-opendal/actions/runs/6930546267/job/18850425527

Why we will touch /home/runner/.rustup/tmp/n4ezxsonwowzdzeg_dir/bk? It looks like something moves current toolchain to xxxx/bk but they have already been opened.

@messense
Copy link

@Xuanwo Have you tried actually setup Rust toolchain in Setup Rust toolchain step? Currently it does not really install Rust in that step which defers the installation until pip install.

@Xuanwo
Copy link

Xuanwo commented Nov 20, 2023

Have you tried actually setup Rust toolchain in Setup Rust toolchain step?

Setup Rust toolchain in OpenDAL CI just perform works like set RUSTFLAGS, RUST_BACKTRACE and CARGO_REGISTRIES_CRATES_IO_PROTOCOL, no rust toolchain been setup.

https://github.com/apache/incubator-opendal/blob/268d5b8784ce9f30e9c0b70dc6201107a0d5b8f1/.github/actions/setup/action.yaml#L36-L47

@messense
Copy link

messense commented Nov 20, 2023

Setup Rust toolchain in OpenDAL CI just perform works like set RUSTFLAGS, RUST_BACKTRACE and CARGO_REGISTRIES_CRATES_IO_PROTOCOL, no rust toolchain been setup.

Though off topic here, but that means the step name is just confusing.

Anyway you should definitely try actually setup Rust before invoking pip/maturin.

@Xuanwo
Copy link

Xuanwo commented Nov 20, 2023

Anyway you should definitely try actually setup Rust before invoking pip/maturin.

Thanks for advice, I submiited a PR for this: apache/opendal#3633

I'll observe it in the coming days and report back here.

@Xuanwo
Copy link

Xuanwo commented Nov 21, 2023

After PR apache/opendal#3633, we didn't re-trigger this issue again. Thanks for @messense's idea!

So the conclusion is:

  • rustup appears to have some concurrency issues. Perhaps we need a file lock when users invoke cargo on an unset rust-toolchain? cc @rami3l, please try calling cargo concurrently in this case.
  • Calling a simple cargo version before we start our workflow can avoid such thing happen. cc @davidhewitt, please consider add such step as a pre-action.

@messense
Copy link

  • rustup appears to have some concurrency issues.

Sounds like #988

@Xuanwo
Copy link

Xuanwo commented Nov 21, 2023

Sounds like #988

Exactly! Perhaps we could label this as a duplicate and direct users to the original issues. And I'm not sure if it worth to add cargo version trick into docs as a work around.

@rami3l
Copy link
Member

rami3l commented Nov 21, 2023

  • rustup appears to have some concurrency issues.

Sounds like #988

Oh, it has been on my watch list for so long, but I tend to avoid declaring everything as a duplicate of #988 (^_^') before the evidence is found.

Now, however, I'm fully convinced that is indeed another duplicate. Thanks for all your comments!

PS: #988 is actually the next target for me after #3483, however as you can see I'm a bit occupied by the graduation stuff right now. I'll definitely have a deeper look into it after finishing all that :)

@rami3l rami3l closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2023
@rami3l rami3l removed their assignment Nov 21, 2023
@rami3l
Copy link
Member

rami3l commented Nov 21, 2023

Closing as a duplicate of #988.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants