-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
meta: PRs with infra failures can land #22439
Conversation
Amend the collaborator guide so that in case of severe infrastructure failures we could still land PRs.
cc @nodejs/tsc |
The current statement seems awfully vague. Does this mean every target platform/configuration crashed? Just one? Something else? If every one of them crashed, to me it doesn't make sense to land the PR since it wasn't successful at all... |
I think @nodejs/build and specially @Trott should review/sign off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against this. If the infrastructure is causing issues with landing PRs we should fix the infrastructure, not ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last CI run should be green before landing PRs. If there's an infra issue it should be fixed before landing new PRs.
@richardlau @mmarchini I don't think @nodejs/build and @nodejs/tsc is willing to give all collaborators access to the machines, nor that all collaborators have the skills to do so. For example, as a @nodejs/tsc member I do not have access. So, I cannot fix this. Let's consider https://ci.nodejs.org/job/node-test-commit-linux/20945/nodes=centos7-64-gcc6/console and #22439. According to the current rules, I cannot land this. However, this is ok to land. The SLA that @nodejs/build (being a group of volunteers, like everybody else) can provide does not allow us to be this strict. @nodejs/build are our silent heroes, and this it's not their fault (they are awesome), but rather ours in asking such a high SLA.
@mscdex something between 1 and 5? Definitely not all. I'm happy to seek new formulations. |
in that case lets get more people on the build team strong -1 to encouraging landing on red in any case, even if the pr doesn't cause the red |
I'm also -1. I don't think velocity is a trump card. I would favor stability and quality. |
P.S. Will proactive notification of infra failures help? |
I think it would help to know whether CI is in a “usable” state or not? |
Yes!
Yes. |
shoutout to https://nodejs-ci-health.mmarchini.me/ |
@devsnek that’s great work (and we should link it somewhere), but it does not tell me if those 60% failures are because there are spurious tests failing, tests rightfully failing or because infra is having trouble. It does not answer the question “is CI usable now”. |
ATM we have a manually updated project board https://github.com/nodejs/build/projects/1 the first column lists thing the Build WG is aware of.
I'll prioritize that task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sympathize with the frustration @mcollina is having. Still, I'm a -1 for all the same reasons others have expressed and for the reasons at https://medium.com/@Trott/on-landing-code-when-ci-fails-f3aa999cda3d. Any loophole to landing code without a full CI run isn't something I'm likely to sign off on. That said, here's an alternate solution for @mcollina and others:
One workaround that is similar to what is suggested here but has a slightly higher bar is to request that Build WG remove the broken host or even entire platform/OS from node-test-pull-request
. I don't want to document that as the recommended way to do things when you run into CI difficulties, but for the handful of highly-active folks landing multiple pull requests a day: When reasonably certain that a problem is unrelenting and unrelated, it's an option. I slightly prefer it to what is suggested in this PR because it improves things for everyone because CI will be green again. It helps build trust in CI results rather than perpetuating distrust in CI results. Downside is that it's on Build WG to not forget to fix the host/platform and get it back online, but they're pretty great about that as far as I know.
@mcollina (Shameless plug here)
The essence of the command is the pattern database in https://github.com/nodejs/node-core-utils/blob/master/lib/ci/ci_failure_parser.js which I try to improve based on errors I find every time I run the command. It guesses the root cause wrong sometimes but I think it's usually smarter than me when I look at the logs ;) And as far as aggregation goes it is good enough. We can improve it to associate issue links with failures and probably auto-update https://github.com/nodejs/build/projects/1 but it will take some work to make it smart enough to figure out what to do... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also -1, I understand the frustration of not being able to land a PR in the time slot you have available, but I feel it is more important to prioritize confidence in our what has landed through green builds as opposed to allowing things to land when things are broken.
Closing this for now, let's see how things go and reassess in the future. |
Amend the collaborator guide so that in case of severe infrastructure
failures we could still land PRs.
Checklist