Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Alternative approach to retries in CI #41164

Closed
wants to merge 1 commit into from

Conversation

nagisa
Copy link
Member

@nagisa nagisa commented Apr 8, 2017

This introduces wrapper scripts which retry the binaries. This makes it trivial
to add commands to retry in order to work-around the spurious failures related
to commands.

For example, failures related to sccache or artefact upload could be made less critical by retrying
sccache/s3 stuff with this approach.

cc @alexcrichton @aturon

@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@@ -769,38 +769,7 @@ fn link_natively(sess: &Session,
// with some thread pool working in the background. It seems that no one
// currently knows a fix for this so in the meantime we're left with this...
info!("{:?}", &cmd);
let retry_on_segfault = env::var("RUSTC_RETRY_LINKER_ON_SEGFAULT").is_ok();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also remove where we pass this in the various containers/scripts.

This introduces wrapper scripts which retry the binaries.  This makes it trivial
to add commands to retry in order to work-around the spurious failures related
to commands.

For example, failures related to sccache could be made less critical by retrying
sccache with this approach.
@alexcrichton
Copy link
Member

Can you show how this actually gets integrated into the rest of the CI? It looks like this is only half of it so far.

@nagisa
Copy link
Member Author

nagisa commented Apr 8, 2017

@alexcrichton

source "$ci_dir/retryify.sh"

is prepending the directory with these shell wrappers to the $PATH, so it becomes something like PATH=/tmp/generated/:/usr/bin/:.... This means that anything which calls anything in the array of EXECUTABLES will call into the wrapper script, which does the retries. Making a binary retry-able is then as simple as adding its name to this EXECUTABLES array.

I’m not sure if anything of the sort can be made to work on Windows, but I find this to be a significantly cleaner solution compared to adding retries all over the codebase :)

@alexcrichton
Copy link
Member

Yes I'm primarily worried about windows, where executing shell scripts will not work.

@aidanhs
Copy link
Member

aidanhs commented Apr 12, 2017

Some thoughts:

  • we end up with a central registry of retries due to spurious failures on CI, which is nice (assuming the large comment from src/librustc_trans/back/link.rs gets moved)
  • people who aren't using the rust CI scripts no longer get the benefit of any current or future retry logic. I'd draw a parallel to the spurious sccache failures due to the OSX bug, where there was a workaround for it...in the mio tests (IIRC). Currently this comment applies to the curl retry in bootstrap.py, but I'm now curious why the OSX linker retry isn't on by default.
  • the retries are much less targeted. For example, clang will now unconditionally retry on failure three times, rather than checking if it's a linking failure on OSX (or other platforms where clang is the linker?). This could be considered a positive in the case of curl
  • related to being less targeted, this retry would not help in the case of git submodule clone failures (see Retry-on-failure for submodule doesn't seem to work #39051) because the operation isn't atomic (I'm not sure what git failures it would help with - maybe clone, but that tends to happen before we get involved in the build process)
  • if we do do this in this way, we'll definitely need some way to monitor and collect the retries

Of the commands listed, git and clang I've mentioned above as ones I'm less convinced by. Do we have failures of cc? I guess any command in this list could probably do with a brief description of why it's there. I could be in favour of curl, as long as we track failures so if there's an external dependency that goes down often, we know it needs re-hosting (I seem to recall this happening before?).

Aside from this PR, I do think it's worth having a single location to track retry hacks - retries spread across the codebase is a reasonable concern.

A note on Windows: it'll run this shell script anyway (as it should, since the CI shell script makes calls to some of the commands on this list), but yes, it will need an equivalent batch or powershell script (entirely possible) to set up the actual Windows environment.

@nagisa
Copy link
Member Author

nagisa commented Apr 14, 2017

Of the commands listed, git and clang I've mentioned above as ones I'm less convinced by. Do we have failures of cc?

clang and cc are what could be used to link stuff on OS X. And apple does not know how to implement linkers (or anything really, but that’s besides the point), apparently :)

if we do do this in this way, we'll definitely need some way to monitor and collect the retries

as long as we track failures

That’s also something that could be improved eventually without much effort: adjust the loop to echo "stuff" > /tmp/retries which is then cat'ed after build ends.

@arielb1 arielb1 added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-tools labels Apr 15, 2017
@alexcrichton
Copy link
Member

@nagisa can you show (in code) how this would work on all platforms and how some of @aidanhs's concerns would be addressed?

@aidanhs
Copy link
Member

aidanhs commented Apr 19, 2017

clang and cc are what could be used to link stuff on OS X. And apple does not know how to implement linkers (or anything really, but that’s besides the point), apparently :)

I don't know much about OSX, but the retry logic you removed refers just to clang? Ah, but is cc just a symlink to clang? That would make sense.

That’s also something that could be improved eventually without much effort: adjust the loop to echo "stuff" > /tmp/retries which is then cat'ed after build ends.

I was thinking that CI logs probably aren't good enough (who looks at them if they pass!?) but yes, we can probably automate some collection solution from logs if the info is in there.


I'm not clear if 'RFC' in the title of this PR means it's to start a discussion or if there's an intent to implement and merge. My comments are mostly 'review-like' that can be addressed with some effort, so I'm mainly interested in the fundamentals (are we ok with users not getting the benefit? How confusing is it for infra newcomers? Does it matter that we'll need specialist retries anyway, e.g. git submodules? Do we have sufficient retries to make it worth it?) - without this, implementing my suggestions may pointless.

The confusion for infra newcomers (or people who forget) is a personal concern (I've been bitten before by wrappers...including rustup! I admit, this probably makes me a bit leery). I also wonder if we have enough retries that would fall under this umbrella to make it worth it - a list would make it easier to judge the pain reduction. After a look down the spurious failures list (to be clear, I'm explicitly focusing on scenarios where we can filter retries since blind command retries make me nervous):

  • clang (cc?) - osx, retry on segfault message
  • ar.exe - windows, maybe retries will help us get past the segfaults, needs batch/powershell magic
  • rust test harness segfault, linux musl only
  • windows access violations (maybe? Not even sure that we know the failing command)

Not included are: git (specialist retry), x.py/bootstrap.rs curl (directly helpful to users), openssl failure on osx due to invalid archive (probably needs a specialist retry). Given this list, my inclination would be to add the retries in the specific places and document in src/ci/README.md or similar - if there were twice as many I'd possibly think otherwise.

Have others found wrapper approaches good in the past? As I note, I've had some bad experiences.

@nagisa
Copy link
Member Author

nagisa commented Apr 19, 2017

I don't know much about OSX, but the retry logic you removed refers just to clang? Ah, but is cc just a symlink to clang? That would make sense.

Yes, I think that’s true.

I was thinking that CI logs probably aren't good enough (who looks at them if they pass!?) but yes, we can probably automate some collection solution from logs if the info is in there.

You’ll certainly will not get any logging from the linker retries implementation that’s directly inside the compiler. It would be insane to have such code inside a compiler.

I'm not clear if 'RFC' in the title of this PR means it's to start a discussion or if there's an intent to implement and merge.

The RFC is a request for comments on approach to the problem of retries in context of our current retry code crumbs all over the compiler. Note that this is not proposing a change in functionality (we already do retries, regardless of how confusing or not they are to newcomers, how worth the retries are etc). All this does is centralise the retry code in a single (well, two files, with future windows retry implementation).

are we ok with users not getting the benefit?

Which users are you thinking of? The regular people who do edit-compile-run cycle? They aren’t supposed to be able to enable the retry code we already have. It is internal. The people who use rustc in CI? They are more in the category of "regular people" rather than users who may use the internal implementation details of rustc. This is a good question, and the answer is that the users shouldn’t even know about the "benefit" like this existing in the first place.

I also wonder if we have enough retries that would fall under this umbrella to make it worth it - a list would make it easier to judge the pain reduction.

Anything that gets out that mess from librustc_back is fine with me :) I would love to discuss alternative approaches which do not involve adding the insta-stable, even if hidden from public, mess into compiler backend code.

The confusion for infra newcomers (or people who forget) is a personal concern (I've been bitten before by wrappers...including rustup!

And so have I. That being said, I’d much rather a wrapper than a loop in librustc_back :)


I’m glad that all the special cases you’ve listed is something that can then be implemented outside of the compiler itself. As long as we’re not putting retry code directly inside the compiler (for clang, ar or anything else), I’m fine with anything :)

@nagisa
Copy link
Member Author

nagisa commented Apr 19, 2017

I’ll try to prototype a windows batch/powershell script eventually, but I’ve never written either of those so no idea how it will go.

@arielb1 arielb1 removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-tools labels Apr 27, 2017
@nagisa nagisa added the T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. label Apr 27, 2017
@nagisa
Copy link
Member Author

nagisa commented Apr 28, 2017

So, I’m way too busy with other stuff more pertinent to my T-compiler and IRL duties to push this any further for the time being.

I’ll close this.

@nagisa nagisa closed this Apr 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants