-
Notifications
You must be signed in to change notification settings - Fork 2.6k
CI friendly fast mode for try state checks #13382
base: master
Are you sure you want to change the base?
Conversation
frame/executive/src/lib.rs
Outdated
@@ -353,7 +353,7 @@ where | |||
let _guard = frame_support::StorageNoopGuard::default(); | |||
<AllPalletsWithSystem as frame_support::traits::TryState<System::BlockNumber>>::try_state( | |||
frame_system::Pallet::<System>::block_number(), | |||
frame_try_runtime::TryStateSelect::All, | |||
frame_try_runtime::TryStateSelect::Fast, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might complicate things too much, but how about:
pub enum UpgradeCheckSelect {
/// Run no checks.
None,
/// Run the `try_state`, `pre_upgrade` and `post_upgrade` checks.
All,
/// Run the `pre_upgrade` and `post_upgrade` checks.
PreAndPost,
/// Run the `try_state` checks.
TryState(TryStateSelect),
}
This will give us the ultimate power to customize this, and hopefully the default behavior remains the same.
@ggwpez wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically @Ank4n had asked:
Should we add another option in UpgradeCheckSelect to choose try-state-fast checks?
I'd say yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selecting TryState
won't run pre and post checks though.
@@ -34,6 +34,10 @@ pub enum Select { | |||
/// | |||
/// Pallet names are obtained from [`super::PalletInfoAccess`]. | |||
Only(Vec<Vec<u8>>), | |||
/// Run only fast running tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will run all pallets, but in fast mode right? Or else I am not sure how it work with that regards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is same as All
but the pallets can choose to ignore some tests in the fast mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a bit more work, looking forward to it + finally enabling these test in CI.
This is also relevant for #13013, probably decoding the entire state is not something that we want to do in CI. Not sure.
All in all, my priority is to make these test and the CI work. @Ank4n fwiw I would happily approve a temp solution that simply reduces the try_state
of all pallets, including staking, to be something sensible (perhaps feature gated by env!("CI_EXEC")
as a quick hack) and then we can incrementally introduce them back, or think about how to do it.
In other words, in the case of this PR, getting something out there fast is more important than perfection.
Please check with @ggwpez about enabling CI checks as a lot of the recent CI checks have been his curtesy.
Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions. |
Updates here? |
Repeat 😁 |
/// - `pre-and-post`: Perform pre- and post-upgrade checks. | ||
/// - `try-state`: Perform the try-state checks. | ||
/// - `fast-try-state`: Perform fast running state checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a better approach would be to add another option to the OnRuntimeUpgrade
command, such as mode: [fast | normal]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.
- Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.
- At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.
- The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.
- The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.
I'm very open to being shown why the cost is justified here, just want to make sure this has been considered
@liamaharon Thanks for raising those points. Its still all open to discussion but I will write how I was thinking about this.
Currently the try-state checks are disabled on CI (it only runs pre- and post-upgrade checks). So in a way what you are suggesting is what is happening currently.
Another way to think about it is, this would allow every pallet to write very thorough try state tests. Some of these tests may take really long time (probably hours) and it would be great to run all of them on CI but we also don't want CI to be stuck for hours. The pallet developer still believes other tests would ensure 99.9% scenarios are covered and this slow running test is something they run occasionally outside CI to ensure all storage items are consistent to the expectations. To give a more concrete example (which is actually the reason why we wanted to introduce the fast mode) we can look at this try-state check in staking pallet. It iterates over all nominators (~ 44k) and then for each nominator, it fetches all active validators (~ 300), gets their exposure, iterates over all of its stakers (could be upto 512) and checks if the nominator (in the first loop) is only present once in a validator's exposure. This is cubic time complexity and takes more than 2 hours to run currently. Also, most of the times, a failing try-state checks may not be introduced by the current PR but through a sneaky bug that we only found out about later. I think we might even want to keep these tests to To summarise, I think what this PR is trying to do is find a middle ground where we can run try state checks in the CI that would cover most of the inconsistent state scenario but still enable pallet developer write some more thorough but expensive tests that they can run outside CI if they want to. Its also important to emphasise try-state check should only be our 3rd or 4th line of defence.
I agree we should look to improve this, make it more intuitive and better documented. By default every test should be marked as fast test unless we notice they are taking really long to run in the CI. I do think though that running 99% of the try state checks on CI and ignoring one is better alternative than not running anything on CI. If there are complex changes to a pallet logic, a developer should always run full suite (may be we can have a bot command that runs the full suite on demand). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
@@ -306,9 +308,12 @@ pub trait Hooks<BlockNumber> { | |||
/// It should focus on certain checks to ensure that the state is sensible. This is never | |||
/// executed in a consensus code-path, therefore it can consume as much weight as it needs. | |||
/// | |||
/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to | |
/// Takes the block number and `TryStateSelect` as a parameter. The `TryStateSelect` is used to |
@@ -550,7 +550,7 @@ impl ExtBuilder { | |||
let mut ext = self.build(); | |||
ext.execute_with(test); | |||
ext.execute_with(|| { | |||
Staking::do_try_state(System::block_number()).unwrap(); | |||
Staking::do_try_state(System::block_number(), false).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would be more clear for the reader to pass one of [TryStateSelect::Fast
, TryStateSelect::All
] here instead of bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly think usage of TryStateSelect
is not correct here.
fn try_state(_n: BlockNumberFor<T>) -> Result<(), TryRuntimeError> { | ||
fn try_state( | ||
_n: BlockNumberFor<T>, | ||
_s: frame_support::traits::TryStateSelect, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is unfortunately not correct.
You are right to pass in something into the hook that helps it understand if it is fast or not, but TryStateSelect
is not the right type here.
TryStateSelect
is meant to identify which pallets to execute, not how much time they should each consume. It is only interpreted by the Executive
and should not be exposed to the end user here at all.
What we instead want is a TryStateSpeed
which you can for now assume to bool
or enum TryStateSpeed { Slow, Mid, Fast }
.
Then, you are capable of selecting which pallets to run, and at what speed.
As it stands now, I see this flaw as welll:
A pallet could possibly see RoundRobin(7)
as its try-state select. How should it interpret this? Answer: it cannot, because it is not the right audience for it.
bot rebase |
Rebased |
Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions. |
Yes, will be reworking on this. |
Reviving #13286
Should resolve paritytech/polkadot-sdk#234
Adds a new fast mode to
TryStateSelect
which skips slow running state checks.Adds two new options
[fast-all | fast-try-state]
toon-runtime-upgrade --checks
.Example usage: