[RFC] Alternative approach to retries in CI #41164

nagisa · 2017-04-08T17:55:31Z

This introduces wrapper scripts which retry the binaries. This makes it trivial
to add commands to retry in order to work-around the spurious failures related
to commands.

For example, failures related to sccache or artefact upload could be made less critical by retrying
sccache/s3 stuff with this approach.

cc @alexcrichton @aturon

rust-highfive · 2017-04-08T17:55:42Z

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

Mark-Simulacrum · 2017-04-08T18:03:01Z

src/librustc_trans/back/link.rs

@@ -769,38 +769,7 @@ fn link_natively(sess: &Session,
    // with some thread pool working in the background. It seems that no one
    // currently knows a fix for this so in the meantime we're left with this...
    info!("{:?}", &cmd);
-    let retry_on_segfault = env::var("RUSTC_RETRY_LINKER_ON_SEGFAULT").is_ok();


Should probably also remove where we pass this in the various containers/scripts.

This introduces wrapper scripts which retry the binaries. This makes it trivial to add commands to retry in order to work-around the spurious failures related to commands. For example, failures related to sccache could be made less critical by retrying sccache with this approach.

alexcrichton · 2017-04-08T20:22:37Z

Can you show how this actually gets integrated into the rest of the CI? It looks like this is only half of it so far.

nagisa · 2017-04-08T22:21:55Z

@alexcrichton

source "$ci_dir/retryify.sh"

is prepending the directory with these shell wrappers to the $PATH, so it becomes something like PATH=/tmp/generated/:/usr/bin/:.... This means that anything which calls anything in the array of EXECUTABLES will call into the wrapper script, which does the retries. Making a binary retry-able is then as simple as adding its name to this EXECUTABLES array.

I’m not sure if anything of the sort can be made to work on Windows, but I find this to be a significantly cleaner solution compared to adding retries all over the codebase :)

alexcrichton · 2017-04-10T15:02:28Z

Yes I'm primarily worried about windows, where executing shell scripts will not work.

aidanhs · 2017-04-12T20:10:46Z

Some thoughts:

we end up with a central registry of retries due to spurious failures on CI, which is nice (assuming the large comment from src/librustc_trans/back/link.rs gets moved)
people who aren't using the rust CI scripts no longer get the benefit of any current or future retry logic. I'd draw a parallel to the spurious sccache failures due to the OSX bug, where there was a workaround for it...in the mio tests (IIRC). Currently this comment applies to the curl retry in bootstrap.py, but I'm now curious why the OSX linker retry isn't on by default.
the retries are much less targeted. For example, clang will now unconditionally retry on failure three times, rather than checking if it's a linking failure on OSX (or other platforms where clang is the linker?). This could be considered a positive in the case of curl
related to being less targeted, this retry would not help in the case of git submodule clone failures (see Retry-on-failure for submodule doesn't seem to work #39051) because the operation isn't atomic (I'm not sure what git failures it would help with - maybe clone, but that tends to happen before we get involved in the build process)
if we do do this in this way, we'll definitely need some way to monitor and collect the retries

Of the commands listed, git and clang I've mentioned above as ones I'm less convinced by. Do we have failures of cc? I guess any command in this list could probably do with a brief description of why it's there. I could be in favour of curl, as long as we track failures so if there's an external dependency that goes down often, we know it needs re-hosting (I seem to recall this happening before?).

Aside from this PR, I do think it's worth having a single location to track retry hacks - retries spread across the codebase is a reasonable concern.

A note on Windows: it'll run this shell script anyway (as it should, since the CI shell script makes calls to some of the commands on this list), but yes, it will need an equivalent batch or powershell script (entirely possible) to set up the actual Windows environment.

nagisa · 2017-04-14T14:54:31Z

Of the commands listed, git and clang I've mentioned above as ones I'm less convinced by. Do we have failures of cc?

clang and cc are what could be used to link stuff on OS X. And apple does not know how to implement linkers (or anything really, but that’s besides the point), apparently :)

if we do do this in this way, we'll definitely need some way to monitor and collect the retries

as long as we track failures

That’s also something that could be improved eventually without much effort: adjust the loop to echo "stuff" > /tmp/retries which is then cat'ed after build ends.

alexcrichton · 2017-04-19T01:37:44Z

@nagisa can you show (in code) how this would work on all platforms and how some of @aidanhs's concerns would be addressed?

aidanhs · 2017-04-19T04:14:35Z

clang and cc are what could be used to link stuff on OS X. And apple does not know how to implement linkers (or anything really, but that’s besides the point), apparently :)

I don't know much about OSX, but the retry logic you removed refers just to clang? Ah, but is cc just a symlink to clang? That would make sense.

That’s also something that could be improved eventually without much effort: adjust the loop to echo "stuff" > /tmp/retries which is then cat'ed after build ends.

I was thinking that CI logs probably aren't good enough (who looks at them if they pass!?) but yes, we can probably automate some collection solution from logs if the info is in there.

I'm not clear if 'RFC' in the title of this PR means it's to start a discussion or if there's an intent to implement and merge. My comments are mostly 'review-like' that can be addressed with some effort, so I'm mainly interested in the fundamentals (are we ok with users not getting the benefit? How confusing is it for infra newcomers? Does it matter that we'll need specialist retries anyway, e.g. git submodules? Do we have sufficient retries to make it worth it?) - without this, implementing my suggestions may pointless.

The confusion for infra newcomers (or people who forget) is a personal concern (I've been bitten before by wrappers...including rustup! I admit, this probably makes me a bit leery). I also wonder if we have enough retries that would fall under this umbrella to make it worth it - a list would make it easier to judge the pain reduction. After a look down the spurious failures list (to be clear, I'm explicitly focusing on scenarios where we can filter retries since blind command retries make me nervous):

clang (cc?) - osx, retry on segfault message
ar.exe - windows, maybe retries will help us get past the segfaults, needs batch/powershell magic
rust test harness segfault, linux musl only
windows access violations (maybe? Not even sure that we know the failing command)

Not included are: git (specialist retry), x.py/bootstrap.rs curl (directly helpful to users), openssl failure on osx due to invalid archive (probably needs a specialist retry). Given this list, my inclination would be to add the retries in the specific places and document in src/ci/README.md or similar - if there were twice as many I'd possibly think otherwise.

Have others found wrapper approaches good in the past? As I note, I've had some bad experiences.

nagisa · 2017-04-19T14:37:14Z

I don't know much about OSX, but the retry logic you removed refers just to clang? Ah, but is cc just a symlink to clang? That would make sense.

Yes, I think that’s true.

I was thinking that CI logs probably aren't good enough (who looks at them if they pass!?) but yes, we can probably automate some collection solution from logs if the info is in there.

You’ll certainly will not get any logging from the linker retries implementation that’s directly inside the compiler. It would be insane to have such code inside a compiler.

I'm not clear if 'RFC' in the title of this PR means it's to start a discussion or if there's an intent to implement and merge.

The RFC is a request for comments on approach to the problem of retries in context of our current retry code crumbs all over the compiler. Note that this is not proposing a change in functionality (we already do retries, regardless of how confusing or not they are to newcomers, how worth the retries are etc). All this does is centralise the retry code in a single (well, two files, with future windows retry implementation).

are we ok with users not getting the benefit?

Which users are you thinking of? The regular people who do edit-compile-run cycle? They aren’t supposed to be able to enable the retry code we already have. It is internal. The people who use rustc in CI? They are more in the category of "regular people" rather than users who may use the internal implementation details of rustc. This is a good question, and the answer is that the users shouldn’t even know about the "benefit" like this existing in the first place.

I also wonder if we have enough retries that would fall under this umbrella to make it worth it - a list would make it easier to judge the pain reduction.

Anything that gets out that mess from librustc_back is fine with me :) I would love to discuss alternative approaches which do not involve adding the insta-stable, even if hidden from public, mess into compiler backend code.

The confusion for infra newcomers (or people who forget) is a personal concern (I've been bitten before by wrappers...including rustup!

And so have I. That being said, I’d much rather a wrapper than a loop in librustc_back :)

I’m glad that all the special cases you’ve listed is something that can then be implemented outside of the compiler itself. As long as we’re not putting retry code directly inside the compiler (for clang, ar or anything else), I’m fine with anything :)

nagisa · 2017-04-19T14:37:51Z

I’ll try to prototype a windows batch/powershell script eventually, but I’ve never written either of those so no idea how it will go.

nagisa · 2017-04-28T21:44:45Z

So, I’m way too busy with other stuff more pertinent to my T-compiler and IRL duties to push this any further for the time being.

I’ll close this.

rust-highfive assigned brson Apr 8, 2017

nagisa force-pushed the retryify branch from 29b9730 to 7e30ded Compare April 8, 2017 17:56

Mark-Simulacrum reviewed Apr 8, 2017

View reviewed changes

nagisa force-pushed the retryify branch from 7e30ded to bd97114 Compare April 8, 2017 18:12

arielb1 added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-tools labels Apr 15, 2017

arielb1 removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-tools labels Apr 27, 2017

nagisa added the T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. label Apr 27, 2017

nagisa closed this Apr 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Alternative approach to retries in CI #41164

[RFC] Alternative approach to retries in CI #41164

nagisa commented Apr 8, 2017

rust-highfive commented Apr 8, 2017

Mark-Simulacrum Apr 8, 2017

alexcrichton commented Apr 8, 2017

nagisa commented Apr 8, 2017

alexcrichton commented Apr 10, 2017

aidanhs commented Apr 12, 2017

nagisa commented Apr 14, 2017

alexcrichton commented Apr 19, 2017

aidanhs commented Apr 19, 2017

nagisa commented Apr 19, 2017

nagisa commented Apr 19, 2017

nagisa commented Apr 28, 2017

[RFC] Alternative approach to retries in CI #41164

[RFC] Alternative approach to retries in CI #41164

Conversation

nagisa commented Apr 8, 2017

rust-highfive commented Apr 8, 2017

Mark-Simulacrum Apr 8, 2017

Choose a reason for hiding this comment

alexcrichton commented Apr 8, 2017

nagisa commented Apr 8, 2017

alexcrichton commented Apr 10, 2017

aidanhs commented Apr 12, 2017

nagisa commented Apr 14, 2017

alexcrichton commented Apr 19, 2017

aidanhs commented Apr 19, 2017

nagisa commented Apr 19, 2017

nagisa commented Apr 19, 2017

nagisa commented Apr 28, 2017