-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
staging workflow #26
staging workflow #26
Conversation
@vcunat Before going into the specifics of how we can change things on the workflow side, which I agree are probably needed regardless, I do want to point out that there seems to be an issue with hydra that exacerbates this. For at least two 5+-day periods this month, there was a case where some build was blocked on "sending inputs" for many days and that for some reason resulted in the queue not being pulled from properly. It's possible with that issue fixed we won't have to worry about this quite yet. |
Yes, I know this. If one is unlucky and a "mass-rebuild step" gets stuck this way... |
So it seems like right now there are a bunch of different things we use staging for:
This is over-engineered, but just to share the thought: I think in principle we could achieve guaranteed* eventual merging of correct code with a series of queues, possibly prioritized. At each stage, jobs are processed in the order they are evaluated for that stage. Each jobset only allowed one evaluation in the given queue at a time, if it needs a new commit to be tested it goes back to the back of the line at that queue. A jobset at stage N+1 must have as its base a commit that succeeded at stage N. If any commit at stage M causes a change to the output path of one of the builds of stage N < M, that change must go back to the earliest such N. The last integration stage is special: merge commits that change the output paths of earlier steps but don't break them are allowed, but regular commits aren't. The expected workflow is that you'd start off of some commit in master, iterate at each stage fixing the failures without changing what already worked, and only at the last stage merging in the latest and doing fixups there. This guaranteed eventual merging falls down if every time we get to the "integration" stage a new change has been merged that fundamentally breaks all earlier stages. But this seems very unlikely to happen by accident. Maybe we can simply allow only one change in the final step at a time, and allow arbitrary iteration within that step. |
Note that |
The regular jobsets certainly should have a very large advantage against staging, IMO, but on the whole we might agree on some system how to set the shares. It seems rather ad-hoc ATM. |
@vcunat What do you think of this workflow: In the normal case, most things can just go to staging and big changes (like a gcc bump) have to have their own release-small jobset pass first. At any time, a maintainer can branch off |
@vcunat how about we get this RFC written? :) |
Well, I don't expect to really have time to "maintain" staging this year, with all other things I've "promised" to do, so I'm perhaps not a good person to act on this. I believe Shea's or mine proposal or some hybrid might work reasonably. Perhaps it's better to first start using something and see how it goes, and document the resulting workflow retrospectively? (Both of the changes really affect only Hydra's setup and workflow of "staging maintainer(s)" and not regular merges to staging.) |
I also like the idea of two branches:
For 2) I would like to see a Hydra job that corresponds to a small set, say |
I like the idea of
This approach would make it much easier to create and stabilize a release. It could also evolve into a rolling-release model in the future. |
I'm not sure about having all PRs to |
@Ekleog Hm, good question how to have a fast track and a slow track without having fast track catch the slow track in eternal catching-up and re-testing… |
I think it's doable here. Master especially shouldn't directly get any "risky" changes; if breakage happens, it should be quickly fixed or reverted ("moved" to another branch). Mass-rebuild status is now auto-determined on PRs (reliably), and non-mass changes rarely break very many packages, so workflow with fixes and running nixos tests can typically be iterated much faster than on staging. |
Well, sometimes staging contains fixes for a package that has been updated on the master while the staging was building. |
I merged the second-order PR immediately to reduce the level of indirection, but: Hydra sharesI think Hydra shares should be lower for |
rfcs/0026-staging-workflow.md
Outdated
# Alternatives | ||
[alternatives]: #alternatives | ||
|
||
What other designs have been considered? What is the impact of not doing this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would need to be removed / reworded before merging. Something like this?
## Keep the statu quo
As explained above, this leads to issues with `staging` never managing
to stabilize due to the stream of mass-rebuilds.
## Merge all changes in `staging`
Also merging non-mass-rebuilds changes in `staging` would mean a
slower delivery of non-mass-rebuilding changes to `master`. Given
there is now a way to know, before merging a PR, whether it is a mass
rebuild, the benefit of having non-mass-rebuilding changes go to
`staging` are limited.
The most important one is duplicate work in case a package doesn't
build on `staging` while it has already been fixed in `master`. This can
be handled by regularly merging `master` in `staging`, as `master`
rebuilds should be almost free compared to `staging` rebuilds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filled. I can't see any potential advantage of "Merge all changes in staging
" compared to the current workflow or the one proposed in this RFC, but it seems to be essentially the "Single branch" alternative in RFC draft now, if I understand it right.
rfcs/0026-staging-workflow.md
Outdated
# Future work | ||
[future]: #future-work | ||
|
||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, maybe “No future work is expected from this RFC”? Or something like “Adjust hydra intervals and build shares depending on the load that will be observed”?
Actually, maybe reevaluating the choice of “small fixes in master
” vs. “small fixes in staging-next
too” could make sense after some experimentation, depending on how frequently we manage to merge staging
into master
with this new scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now "future work" is filled, basically in sense "put this RFC into action".
rfcs/0026-staging-workflow.md
Outdated
start-date: (fill me in with today's date, YYYY-MM-DD) | ||
author: (name of the main author) | ||
co-authors: (find a buddy later to help our with the RFC) | ||
related-issues: (will contain links to implementation PRs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be filled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rfcs/0026-staging-workflow.md
Outdated
|
||
|
||
The check interval of `staging` is reduced from 24 hours to 12 hours. This can | ||
be done because only stabilization fixes shall be submitted and thus fewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence seems to contradict the definition of the branches above. "only stabilization fixes" is true for staging-next
, not staging
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also slightly confused about what staging* branch is intended for pull requests vs stabilisation after reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to staging-next
.
Confusion: I believe that any fixes-only changes should go to staging-next
(counting only problems present in the branch already). I would hope that the current text makes it clear:
staging-next
is branched fromstaging
and only fixes to stabilize and security fixes shall be delivered to this branch. This branch shall be merged intomaster
when deemed of sufficiently high quality.
but certainly feel free to suggest a clearer formulation.
The exact mechanism how they get there doesn't seem too important to me. Ideally they get committed to staging-next
directly, but if they might be in staging
already (accidentally) and get cherry-picked to staging-next
...
rfcs/0026-staging-workflow.md
Outdated
[drawbacks]: #drawbacks | ||
|
||
A potential drawback of this new workflow is that the additional branch may be considered complicated and/or more difficult to work with. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, most contributors would keep working as before: choose master
or staging
depending on the number of rebuilds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, added.
rfcs/0026-staging-workflow.md
Outdated
- `master` is the main branch and all small deliveries shall go here; | ||
- `staging` is branched from `master` and mass-rebuilds and other large deliveries go to this branch; | ||
- `staging-next` is branched from `staging` and only fixes to stabilize and security fixes shall be delivered to this branch. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify how changes end up in master
. From the discussion, my understanding is that only staging-next
(never staging
directly) is eventually merged into master
.
Do we want to set criteria for when staging-next
is "good enough" for merging to master
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified.
|
||
Reducing the size of the Hydra jobset would mean the iteration pace could be | ||
higher, but has the downside of testing fewer packages, and having fewer binary | ||
substitutes available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to reduce the channel-blocking Hydra jobset size, but have a 48-hourly cycle of trying to build the long tail. Still probably increases amount of things broken in the channel, but might be a middle ground. (I am not saying I prefer that compromise to the current proposal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel-update criteria contain waiting for all builds to finish, but the thing is doable by having two jobsets on the branch. Personally I think the channel-blocking set should grow and not shrink, especially for the release branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I also think it is good to have options enumerated and explicitly dismissed with a reason given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sentence describing the approach.
I think you should remove the first line of the initial message now. |
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the proposal, I am not completely sure if it will in fact reduce or increase the overall number of jobs for Hydra to build. It would increase usefuk information for unit of build time, but it might increase the load.
This risk should either be mentioned as a possible drawback or there should be a discussion why this risk is not likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we will see how the load turns out in real life and adjust shares/intervals or even the whole workflow. People actively working on staging stabilization should also be able cancel jobs and force evaluations.
In any case, I would like to push towards more power in Hydra over the long term, as I believe it will help with mass rebuilds and human time is more precious than machine time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about Hydra access (this is a part of proposed workflow, right?), and I agree in principle about expanding Hydra checks.
I just want things mentioned explicitly.
I think such explicit discussion could also be a way to better understand the trade-offs relative to complicated alternatives: staging (release-small)
-> staging-next (release-small)
-> staging-next (release)
(manually triggered, unless Hydra can have jobsets with preconditions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I was always doing cancels and evals on Hydra when working with staging. The right to manage Hydra jobsets doesn't seem more security-sensitive that push access to nixpkgs. Overall, this text so far hasn't mentioned people – who will be doing this staging stabilization, etc.
Complicated alternatives: another thing to consider is that Hydra seems likely to be over-powered on x86 linux (relative to other platforms), and even for other reasons it makes sense for most changes to stabilize them for x86_64-linux
first/deeper before others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it doesn't mention people but it describes what tools these people are expected to use (and how). You do discuss possible evaluation frequencies.
And yes, I would find it nice if you mentioned possible using buildpower disbalance.
@7c6f434c: which part do you mean? |
The text is not empty anymore. |
I see, I was just looking at the RFC text :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ready for implementation.
Can't we rename `staging-next` to `testing`, `staging-stable` or whatever else after all? I'm sure everyone here, including me, got accustomed and stopped caring already, but I still think it is a very confusing naming scheme for new contributors.
|
I agree.
Why not simply call it |
Is there any agreement on how to proceed here? It's a bit unfortunate that we're stuck on a naming issue 😅 Let's throw a dice and let the universe decide. |
This should just be merged. The proposal is in use and nitpicking about the name seems pointless to me. |
Merging as it's implemented |
Is the following diagram correct? digraph {
"small changes" [shape=none]
"mass-rebuilds and other large changes" [shape=none]
"critical security fixes" [shape=none]
"broken staging-next fixes" [shape=none]
"small changes" -> master
"mass-rebuilds and other large changes" -> staging
"critical security fixes" -> master
"broken staging-next fixes" -> "staging-next"
"staging-next" -> master [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]
"staging" -> "staging-next" [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]
master -> staging [color="#5F5EE8"] [label="any time"] [fontcolor="#5F5EE8"]
} |
I think we can append this graph to the RFC. |
Has the RFC workflow been copied to nixpkgs yet? RFC are just here to gain consensus, the live documentation should live somewhere else. |
@jtojnar I guess it's not really mentioned in the RFC but I think in general even critical security fixes should go to staging-next first. This obviously depends on the size of the rebuild and current state of staging-next, but in my experience having a large queue on master is detrimental for the stability there and often results in breakages/issues that hold back channel updates. |
It's actually in the RFC: (emphasis mine)
(though granted I can't find a place where it explicitly states that security mass-rebuilds don't go in |
I understood that as fixes to stabilize staging and fix vulnerabilities on staging. That interpretation is consistent with the doctrine that critical security fixes should go directly to the branch that is affected by the vulnerabilities. Not sure if this doctrine still applies with this RFC accepted. |
This pull request has been mentioned on Nix community. There might be relevant details there: https://discourse.nixos.org/t/release-process-staging-branches/2799/2 |
This pull request has been mentioned on Nix community. There might be relevant details there: https://discourse.nixos.org/t/release-process-staging-branches/2799/3 |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/maintainers-needed-for-merging-staging-staging-next-branches/3833/3 |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nix-shell-on-a-python-package-gives-a-pip-error/4137/4 |
I bring you the latest diagram: Now the reverse flow (master → staging-next → staging) happens automatically every six hours thanks to NixOS/nixpkgs#105153. digraph {
"small changes" [shape=none]
"mass-rebuilds and other large changes" [shape=none]
"critical security fixes" [shape=none]
"broken staging-next fixes" [shape=none]
"small changes" -> master
"mass-rebuilds and other large changes" -> staging
"critical security fixes" -> master
"broken staging-next fixes" -> "staging-next"
"staging-next" -> master [color="#E85EB0"] [label="stabilization ends"] [fontcolor="#E85EB0"]
"staging" -> "staging-next" [color="#E85EB0"] [label="stabilization starts"] [fontcolor="#E85EB0"]
master -> "staging-next" -> staging [color="#5F5EE8"] [label="every six hours/any time"] [fontcolor="#5F5EE8"]
} |
^ This is great, should go in the Nipxkgs manual. |
See NixOS/nixpkgs#105986 😉 |
Commenting on @jtojnar NixOS/nixpkgs#106302
Yes I think we need to revise/clarify what is written there. The rfc indeed says staging-next, however it lacks a motivation. Critical security issuesI'd argue that critical issues should go to master, however if staging-next is as good as ready, that should merged right as well as well because that saves rebuilds. At that point it does not matter whether the fix goes to master or staging-next regarding the amount of rebuilds. It does matter regarding channel updates; during the rebuild there won't be any updates due to other smaller changes, but that is not very important. What is important is that
This I agree with. So this is a trade-off, and I think in case of critical security fixes the delay of a day or two due to the rebuilds is worth it. One should however also note that the security fix is typically also backported. A rebuild of the stable branch has a higher priority and that causes the blocking of master for a period longer than a day or two. Non-critical security fixesThese go through staging as usual, to reduce the amounts of rebuilds necessary. |
* staging workflow: init empty RFC * 0026-staging-workflow: first draft * 0026-staging-workflow: second draft Close vcunat#2 * 0026-staging-workflow alternatives: more discussion
Define a new workflow for the
staging
branch that can better accomodate the current and future influx of changes in order to deliver mass-rebuilds faster to master. As part of this new workflow an additional branch,staging-next
, shall be introduced.More details in the RFC proposal itself now.