-
Notifications
You must be signed in to change notification settings - Fork 768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126
Comments
Thanks @alexggh and @s0me0ne-unkn0wn for raising the issue. Let's double the workers and go for Kusama first. Other things to consider:
|
Unfortunately, the number of workers is hardcoded in the node excutable,
However, I do think increasing the executor numbers from 2 to 4 should be uber low risk. Here we can seee that https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware, our recommended HW spec is 4 hardware cores and 32GiB of ram, so we definitely should have space for |
We can do this for Kusama only, see run_inner_node It just is a bit of plumbing work to do.
That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, etc. |
Not sure, how that correlates, increasing this to 4 pvf executions would speed up the time we approve and back candidate, which are things we want to do as fast as we can. Agree that those 2 extra threads would increase the total consumption of resources in the system, on the cpu size 2 threads should be the tipping point since we already have plenty of threads spawned. On the memory size the PVF execution seems to be limited to Is that what you are referring to ? |
However with the blunder we did yesterday on polkadot, I agree we should actually thread carefully here, so I will invest some time on the extra plumbing to enable it just on kusama first. |
I agree that 4 pvfs executed in parallel would speed things up, but it would eat the CPU resources of the approval subsystems for example, so we need to think in terms of total resource consumption of the node and manage the load such that we don't get additional PVFs to execute when the system is loaded.
I am more concerned about the situation when we have longer PVFs execution times. Ideally at least 75% of the node CPU should be spent on executing PVFs, but as we've seen this is not the case. What I propose to do instead of determining the sweet spot is a dynamic way of allocating CPU resources to PVF compilation and execution. We reserve a pool of 4 workers (4 CPUs) dedicated for PVF. We then implement the priorities for dispatching work to this pool. The goal should be to prioritise finality above liveness of parachains.
We can cap ongoing PVF work to 1-2 at a time, but since PVF compilation takes a lot of time compared to execution we can choose to kill an ongoing PVF compilation if the CPU resources are required for disputes and there is no free worker. This should be a rare event. This however doesn't solve what we observed on Kusama/Polkadot, but it should reduce the amount of new work created when finality is lagging. |
IMO was not really a blunder. The system worked as expected in the end, but we had different expectations on the duration and magnitude of the event. |
Compilation/preparation and execution pipelines are separate and use different workers. The preparation pipeline is prioritized, and the execution uses a best-effort approach (nearly FIFO in most situations, but that changes if executor parameters change on the session boundary or if a candidate from a previous session having different execution parameters should be validated). It quite makes sense to work on execution queue prioritization, considering the prolonged execution times. But I'm not sure we'd benefit from killing preparation workers (more precisely, the preparation worker, as we only have one) to give resources to the execution workers. The preparation worker only occupies a single CPU core. It makes more sense to bump hardware requirements to me than to develop non-trivial algorithms trying to manage node resources in the software. |
Even if there are separate pipelines with different workers, they still use the same physical CPUs, so from that perspective I think it is a good idea to not start preparation/compilation if we have a large backlog of PVF executions due to high node/system load.
In the context of what happened yesterday it doesn't really make sense to kill it. Maybe we should still allow or even prioritise it if we need to do it as part of participating in a dispute or approve a candidate deep in the unfinalized chain
Yes, bumping hardware requirements is needed, but we need to have some numbers that justify increasing. With more validators we should be doing less work in terms of PVF executions but likely more work in approval signature checking for example. |
This is not triggered by that.
That's what we actually have right now, we have 1 worker for compilation and 2 workers for execution in disjunct worker pools.
Putting backing at the end would actually affect liveness of the parachains, because when you have several PVF executions taking around 2seconds, you can easily build a back-log where candidates don't get backed in time. Now, the way I think we should approach this problem.
So, what do you think about me investigating the time on safely rolling out the increase of PVF execution workers to 4 on kusama and then on polkadot and in parallel we can also have a backlog ticket to build a dynamic scheduler for this, which I think would take a longer time to implement and properly validate. Or do you think we should invest from the beginning in building a dynamic scheduler. |
Yes, we'd want to backpressure backing if there is a lot of load in approval voting. If we don't do it, it will soon be even more work for approval voting which eventually leads to slow finality and slower block production if we have authoring backoff, If we don't have backoff, then could lead to OOM. I would frame it as a producer/consumer problem. We shouldn't produce more work if consumption doesn't keep up.
👍🏼 and at same time we have to consider raising HW specs based on the data. We could run some gluttons and see what the impact of having 100% blockspace utilization with 2 workers vs 4 are.
This can happen later if we bump specs anyway and should be driven by higher block space utilization trends. |
…4172) Related to #4126 discussion Currently all preparations have same priority and this is not ideal in all cases. This change should improve the finality time in the context of on-demand parachains and when `ExecutorParams` are updated on-chain and a rebuild of all artifacts is required. The desired effect is to speed up approval and dispute PVF executions which require preparation and delay backing executions which require preparation. --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Part of #4126 we want to safely increase the execute_workers_max_num gradually from chain to chain and assess if there are any negative impacts. This PR performs the necessary plumbing to be able to increase it based on the chain id, it increase the number of execution workers from 2 to 4 on test network but lives kusama and polkadot unchanged until we gather more data. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Plumbing PR #4252 to be able to increase the execution workers based on the chain ID, it will take a few releases until an increase reaches polkadot, but I think we don't have any reason to rush this, so it should be fine to move slow on this. |
Add a metric to be able to understand the time jobs are waiting in the execution queue waiting for an available worker. #4126 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Part of #4126 we want to safely increase the execute_workers_max_num gradually from chain to chain and assess if there are any negative impacts. This PR performs the necessary plumbing to be able to increase it based on the chain id, it increase the number of execution workers from 2 to 4 on test network but lives kusama and polkadot unchanged until we gather more data. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Did some simulations to estimate the CPU HW needs for a network with 500 validators, 100 cores, in normal conditions the configuration is here and a run of it here. Reference hardware: https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware Estimated CPU usage with benchmarks
Adding all that it would consume ~4.7s of a single CPU core, this are the system that we know consume the most of our CPU time, but we are still missing a lot of other subsystem, a safety margin would be to double that, so let's assume everything besides PVF execution consumes 9s of CPU time.
What average parachain execution time could we support with 2 execution threads ?At minimum validators would need to verify at least 7 candidates per block(6 random vrf assignments and 1 backing candidate), with 2 execution threads we have a maximum of 12s CPU time, so the maximum theoretical parachain execution time can not go farther than 1.7 seconds per candidate. The current average on kusama is around 200ms, so it would take around ~10x increase in the execution time for all parachains to reach the maximum throughput. My conclusionsWith this numbers, we can say there isn't much spare cpu time, so I would concur that 2 pvf execution threads is actually the safe choice for a reference HW with 4 HW cpu cores, because that hard caps the PVF execution time at maximum 50% of the available CPU time. Going to 4 pvf execution threads, would increase the available PVF execution time, but it has the downside that at the theoretical limit it could steal valuable CPU time from other mission critical subsystems, while with just 2 execution workers if a lot of work gets queued for PVF execution, then we will be slow on backing and approvals, but since no-show approvals are accepted late and backing needs to happen in just a window of time, we actually end-up in a situation where we don't back new candidates and give the network time to catch up with the approval work. 2 pvf execution threads, does not properly take advantage of validators having way more than 4 HW cores, but building a dynamic scheduling based on the HW the node is running on, would introduce a source of indeterminism in the network, since there is no guarantee other nodes have the same HW underneath them. |
Thanks @alexggh for coming up with this neat analysis. In order to get the full picture we also need to consider |
However, the new litep2p stack should alleviate the issue mentioned above. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: |
With the upcoming async backing changes where we are increasing the
parachain authoring time to 2 seconds instead of the max 500ms, 2
execution workers might prove not to be enough when multiple parachains
produce blocks that take 2 seconds to verify we will create very easy a
backlog of candidates we need to verify.
In the best case scenario a validator has to verify at least 7
candidates, the 6 tranch0 assignments and the candidate it helps with
backing, so if all of them take 2 seconds on the worst case scenario you
end-up needing 14s of execution time each block, so spliting that
between two workers you would need 7s of execution each 6s, that get us
in a situation where the pvf execution workers become the bottleneck of
the system.
Old PR where we changed this in the past: paritytech/polkadot#4273
Remaining work
The text was updated successfully, but these errors were encountered: