-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Alternative approach to retries in CI #41164
Conversation
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
@@ -769,38 +769,7 @@ fn link_natively(sess: &Session, | |||
// with some thread pool working in the background. It seems that no one | |||
// currently knows a fix for this so in the meantime we're left with this... | |||
info!("{:?}", &cmd); | |||
let retry_on_segfault = env::var("RUSTC_RETRY_LINKER_ON_SEGFAULT").is_ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also remove where we pass this in the various containers/scripts.
This introduces wrapper scripts which retry the binaries. This makes it trivial to add commands to retry in order to work-around the spurious failures related to commands. For example, failures related to sccache could be made less critical by retrying sccache with this approach.
Can you show how this actually gets integrated into the rest of the CI? It looks like this is only half of it so far. |
is prepending the directory with these shell wrappers to the I’m not sure if anything of the sort can be made to work on Windows, but I find this to be a significantly cleaner solution compared to adding retries all over the codebase :) |
Yes I'm primarily worried about windows, where executing shell scripts will not work. |
Some thoughts:
Of the commands listed, Aside from this PR, I do think it's worth having a single location to track retry hacks - retries spread across the codebase is a reasonable concern. A note on Windows: it'll run this shell script anyway (as it should, since the CI shell script makes calls to some of the commands on this list), but yes, it will need an equivalent batch or powershell script (entirely possible) to set up the actual Windows environment. |
That’s also something that could be improved eventually without much effort: adjust the loop to |
I don't know much about OSX, but the retry logic you removed refers just to clang? Ah, but is cc just a symlink to clang? That would make sense.
I was thinking that CI logs probably aren't good enough (who looks at them if they pass!?) but yes, we can probably automate some collection solution from logs if the info is in there. I'm not clear if 'RFC' in the title of this PR means it's to start a discussion or if there's an intent to implement and merge. My comments are mostly 'review-like' that can be addressed with some effort, so I'm mainly interested in the fundamentals (are we ok with users not getting the benefit? How confusing is it for infra newcomers? Does it matter that we'll need specialist retries anyway, e.g. git submodules? Do we have sufficient retries to make it worth it?) - without this, implementing my suggestions may pointless. The confusion for infra newcomers (or people who forget) is a personal concern (I've been bitten before by wrappers...including rustup! I admit, this probably makes me a bit leery). I also wonder if we have enough retries that would fall under this umbrella to make it worth it - a list would make it easier to judge the pain reduction. After a look down the spurious failures list (to be clear, I'm explicitly focusing on scenarios where we can filter retries since blind command retries make me nervous):
Not included are: git (specialist retry), x.py/bootstrap.rs curl (directly helpful to users), openssl failure on osx due to invalid archive (probably needs a specialist retry). Given this list, my inclination would be to add the retries in the specific places and document in Have others found wrapper approaches good in the past? As I note, I've had some bad experiences. |
Yes, I think that’s true.
You’ll certainly will not get any logging from the linker retries implementation that’s directly inside the compiler. It would be insane to have such code inside a compiler.
The RFC is a request for comments on approach to the problem of retries in context of our current retry code crumbs all over the compiler. Note that this is not proposing a change in functionality (we already do retries, regardless of how confusing or not they are to newcomers, how worth the retries are etc). All this does is centralise the retry code in a single (well, two files, with future windows retry implementation).
Which users are you thinking of? The regular people who do edit-compile-run cycle? They aren’t supposed to be able to enable the retry code we already have. It is internal. The people who use rustc in CI? They are more in the category of "regular people" rather than users who may use the internal implementation details of rustc. This is a good question, and the answer is that the users shouldn’t even know about the "benefit" like this existing in the first place.
Anything that gets out that mess from librustc_back is fine with me :) I would love to discuss alternative approaches which do not involve adding the insta-stable, even if hidden from public, mess into compiler backend code.
And so have I. That being said, I’d much rather a wrapper than a loop in librustc_back :) I’m glad that all the special cases you’ve listed is something that can then be implemented outside of the compiler itself. As long as we’re not putting retry code directly inside the compiler (for |
I’ll try to prototype a windows batch/powershell script eventually, but I’ve never written either of those so no idea how it will go. |
So, I’m way too busy with other stuff more pertinent to my T-compiler and IRL duties to push this any further for the time being. I’ll close this. |
This introduces wrapper scripts which retry the binaries. This makes it trivial
to add commands to retry in order to work-around the spurious failures related
to commands.
For example, failures related to sccache or artefact upload could be made less critical by retrying
sccache/s3 stuff with this approach.
cc @alexcrichton @aturon