Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite approval voting and approval distribution as a single subsystem #1617

Closed
Tracked by #26
sandreim opened this issue Sep 18, 2023 · 21 comments
Closed
Tracked by #26
Assignees
Labels
I4-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@sandreim
Copy link
Contributor

sandreim commented Sep 18, 2023

The current implementation splits the node approval work in two distinct subsystems, one responsible for implementing the networking functionality: spam protection and gossip while the second one implements the logic of the approval voting protocol: assignment, approval vote checking and parachain block approval book keeping and parablock finality consensus. Both of these subsystems are designed as a single thread that manipulate a global state object.

The message processing pipeline is very simple:

  • a message is received by approval-distribution and goes through an initial spam filter that ensures that each assignment/approval passes only once
  • duplicates are dropped, and peer reputation changes are done as needed.
  • the message is then forwarded to approval-voting that checks and imports the assignment certificate or approval vote signatures
  • approval-distribution blocks and waits for the result of such checks.
  • If the import is successful, the message is distributed only to peers that have not seen it yet. The peer set is defined according to the gossip topology and distribution aggression level.

The problem

The ToF of each queued approval-distribution message is equal to the sum of the processing time of all the already queued messages at sending time.
In our case, this means that the maximum throughput of approval-distribution is a function of the time it takes to process a single message.In practice, the tranche0 VRF modulo assignments are configured to provide on average the required number of votes (needed_approvals) for approving a candidate. As a result, on each relay chain block which includes candidates the network generates an initial burst of assignments that scales with n_cores and n_validators. Approvals might be less bursty as the time to recover and check candidates depends on more variables. But, this is fine, as long as the average message processing rate is higher than the rate of assignment/approvals being sent by the network to a node.

Profiling - #732 (comment) shows where we spend most CPU time, and areas of optimisation that we already explored.

Failed Versi load testing

All attempts, including with a new pr for batching assignment checks have failed at 300 prachain validators due to approval-distribution/voting not keeping up with amount of messages.

For 300 validators and 50 parachains to run nicely with 6 second block times we have the requirement of producing and checking at least need_approvals=30 per candidate, leading to a total minimum number of 1500+1500 unique messages(assignments + approvals) to approve all parachain blocks included on a relay chain block. On average this means, approval voting needs to be able to check and import 250 assignments + 250 approvals per second. Approval distribution on the other hand needs to deal with around 2.5x times more messages, due to the gossip topology duplication of messages (assuming normal-operation conditions).

We expect that v2 assignments and approval coalescing will reduce the average cost of processing approvals and assignments by a factor of 3 at least and allow us to go beyod the current limit of 200 validators with and 50 approval cores. In not so perfect network conditions, for example when operators upgrade their nodes, the network can experience at times slower availability recovery and no-shows which will trigger more tranches of approval checkers. Approval vote coalescing will provide little benefits at that point leading to unstable block times and high finality lag.

Failure modes

  • (1) approval distribution back pressure at the block producer (network bridge blocks sending to approval-distribution), leading to availability stalling (bitfields missing) -> slower parachain block production
  • (2) High time of flight of approval messages > 10s, slow propagation and finality lag -> slow finality -> slow block production / finality stall
    • approval distribution aggression is designed to account for network propagation issues and backfires in overload situations, slowing finality instead of recovering from it.

Nodes of an overloaded network ping pong between (1) and (2) due to a feedback loop which is triggered at very high ToF (10s). Approvals are processed slowly and nodes detect no shows and trigger additional tranches of assignments increasing the load of the system until (1) happens which halves block production rate on each missed slot, leading to less work for approval voting, which breaks the feedback loop.

Future proof solution

A long term solution needs to solve the problem by delivering on the following:

  • The total message processing power can easily scale well beyond one CPU core, accommodating a much larger volume of messages and signature checks
  • (1) Provides a fine grained control on back pressure
    • We define a maximum ratio of approved candidates to backed candidates that can be processed at any given time
    • The node (in candidate-validation for example), can choose to not back a candidate due to this backpressure, reducing the load of the network.
  • (2) The approval-distribution subsystem maximum ToF is < 6s at maximum capacity.
  • Invariants like, deduplication, assignment/approval processing order, always forwrd assignments first and general spam logic remain the same, the new subsystem must pass all previous approval voting and approval distribution tests

Approval voting and distribution in a single subsystem

The new subsystem is broken down into one component per one responsibility:

  • Network worker (handles incoming and outgoing messages - most of the approval distribution code)
  • Main subsystem loop (handles all incoming messages, forwards to network worker)
  • Worker pool (assigns candidate for workers to work on)
    • workers handlle all incoming message book keeping, check the singatures and propagate the message according to topology
  • Database worker (acts as a sink for db changes)

These components will be run in separate async tasks:

  • main subsystem loop (spawn_blocking)
  • database worker (spawn_blocking)
  • worker pool: N approval workers (N could be equal to number of CPUs)

Main subsystem loop

The main loops should be roughly an equivalent to the loops we have now in both subsystems: handling of ApprovalDistributionMessage and ApprovalVotingMessage. It needs to forward assignments (including own assignments) and approvals to the worker pool and handles active leaves updates, finality and pruning of db.

It also needs a read only view of the DB, to answer GetApprovalSignaturesForCandidate or ApprovedAncestor messages.

The worker pool

The pool exposes an interface for assigning specific candidates, assignments and approvals to be processed by the approval voting workers.

We pick the simplest way to assign work, further improvements can be made later (multiple producers and consumer, work stealing). The pool will maintain a candidate affinity mapping for each candidate and assign candidates to workers in a round robin fashion.

/// Tasks that can be performed by the approval worker pool
pub enum ApprovalTask {
	// Imports our own assignment only, distribution happens when it is triggered.
	OwnAssignment(IndirectAssignmentCert, CandidateIndex),
	// Checks/imports and circulates an assignment certificate
	Assignment(IndirectAssignmentCert, CandidateIndex),
	// Checks/imports and circulates an approval vote
	Approval(IndirectSignedApprovalVote),
};

The API could look like this:

  • fn new_task(task: ApprovalTask)
  • fn prune(finalised, &[CandidateHash])

Approval tasks on a given candidate are sticky, meaning that once a worker has processed the first assignment for a candidate, it will process all the other messages. The pool guarantees that a candidate is always assigned once, to one single worker (1:1 mapping). This ensures the processing of assignments and approvals is done sequentially in the context of a given candidate wrt the order of receiving from the network stack.

Approval workers

Each worker can process up to a bounded number of candidates at any time via receiving new assignments and approvals over a bounded channel from the main loop worker pool instance. The exact number of candidates that are being worked on depends on backpressure on the backing new candidates across the network.

/// The context in which an approval worker processes assignments. 
/// There are different flavours of  `CandidateEntry` and `BlockEntry`  in both approval voting and distribution. 
/// These should be refactored to contain all the information required for book keeping on the approval state 
/// and distribution of messages to peers.

CandidateApprovalContext {
 block_entry: BlockEntry,
 candidate_entry: CandidateEntry,
};

Each approval worker has the following responsibilities:

  • Setup CandidateContext from any new candidate task received from the main loop. The context contains a snapshot of the global persisted state for the given candidate: BlockEntry, CandidateEntry, assignments, approvals - per candidate state.
  • Monitor per candidate assignments wrt ticks and trigger own assignment
  • Monitor approval checking lag and enable approval distribution aggression when needed
  • The context is kept in memory until the candidate is pruned (on finality)
  • Context is updated by processing assignments/approvals
  • Once a candidate has been approved, we still keep accepting more assignments and approvals.
  • When a candidate is pruned or is approved, the Context is sent as a Vec of WriteOps to the database worker

For each new message, the worker will follow exactly the same processing pipeline as we do in the present.

Database worker

Represents the only place where we write changes to the parachains DB. Runs in a blocking thread. We only allow readers in all other tasks of the subsystem.
Basically the worker just receives a stream of BackendWriteOp from approval workers that update the approval state of a specific candidate.

@sandreim sandreim added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. I4-refactor Code needs refactoring. T8-parachains_engineering labels Sep 18, 2023
@sandreim sandreim changed the title Rewrite approval voting and approval distritbution as a single subsystem Rewrite approval voting and approval distribution as a single subsystem Sep 18, 2023
@BradleyOlson64
Copy link
Contributor

Out of curiosity, if a candidate fails to become finalized when would it get pruned? Session boundaries?

@sandreim
Copy link
Contributor Author

sandreim commented Sep 20, 2023

If a candidate fails to be finalized, the inclusion relay chain block will never get finalized.
This means the relay chain finality will stall. In that situation we trigger distribution aggression and resend all assignments/votes for candidates included under the oldest unapproved relay block with the hope to work around any potential network issues that prevent finality.

Generally the workers must import assignments and approvals for candidates included in unfinalized relay chain blocks. This means we only prune after finality. Even after finality of a block, the current approval-distribution implementation keeps track of recently finalized blocks in RecentlyOutdated, such that it won't punish peers for valid messages sent after a receiving node observes finality. We must also do the same.

@rphmeier
Copy link
Contributor

This approach seems really reasonable.

The approval-distribution / approval-voting split was likely wrong from the beginning and performance has suffered as a result. Splitting work by candidate makes sense - though will it be properly balanced even after assignments are batched?

long-term it'd be interesting to autoscale the number of workers according to something like a PID controller. Over time, the amount of CPU spent on validating PVFs would hopefully be the largest source of load on the node.

@sandreim
Copy link
Contributor Author

sandreim commented Sep 25, 2023

The approval-distribution / approval-voting split was likely wrong from the beginning and performance has suffered as a result. Splitting work by candidate makes sense - though will it be properly balanced even after assignments are batched?

Yes, I believe batching and bundling are orthogonal. We need to change the criteria for assigning work to ensure workers on average are processing similar amount of assignments/approvals. When batching we could look at current load of individual workers and aim to schedule the biggest batch on the least occupied worker.

long-term it'd be interesting to autoscale the number of workers according to something like a PID controller. Over time, the amount of CPU spent on validating PVFs would hopefully be the largest source of load on the node.

Yes, I expect autoscaling will be easy to implement. I am not really sure we'd want any tight coupling with OS features, but instead we could be implementing an internal node load monitor based on things we can measure like network tput and PVF executions per block. Once we include pov_size and execution time in candidate receipts, it should provide some good internal metrics that subsystem's can use to apply backpressure or scale up if needed.

@sandreim
Copy link
Contributor Author

As we want to scale up to higher number of cores, for example 200, we're likely to see similar issues with other single threaded subsystems like statement-distribution or bitfield-distribution.

It makes a lot of sense that an initial implementation of this worker based approach could bring also some major refactoring and maybe some orchestra support for workers making it easier to implement where needed later.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/parachain-consensus-updates-coretime-asynchronous-backing-scalability/4396/1

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/update-validator-set-size-increase-on-kusama/8218/1

@alexggh
Copy link
Contributor

alexggh commented Jun 5, 2024

Spent some time think about this and I have a proposal that should provide us the end goal where processing of assignments and approvals can be process in parallel and will scale up with the available hardware capacity.

Current state

Currently, every time we receive a new message the approval distribution and approval voting processing are basically serialised and consists of the following important processing:

  1. approval-distribution - sanity check that we should have received this message from peer(assignment is not duplicate, assignment is for known block, approval is received only if we received assignment, etc.).
  2. approval-voting - sanity checks again the message and loads state from the database
  3. approval-voting - verify VRF assignment and approval signature.
  4. approval-voting - bookkeeping and database update
  5. approval-distribution - bookkeeping and storing of message in in-memory structs(GetApprovalSignatures answers uses this).
  6. approval-distribution - distribution of new messages according to topology and random sampling.

From these operations the most expensive ones according to pyroscope are the crypto operations at step 3) and the operations at step 1), the good part is that both 1) and 3) can actually be done in parallel.

Proposal

We keep both subsystems and we augment approval-distribution with a parallel worker pool(ValidationPool) where work is distributed based on the validator that created the message(not the one gossiping it), the algorithm of distributing work would be simple worker_index = (msg.validator + RANDOM_SALT) % WORKER_COUNT. This guarantees that messages coming from the same validator lands on the same worker, so the worker having just a context built out of each relay chain can perform the following validation work, which is currently split between the two subsystems.

  • Sanity check the message is not duplicate and that approvals are received after the assignment.
  • Check VRF assignment is correct.
  • Check we received all assignments for the approval vote.
  • Check the approval vote is correctly signed

Once the message has been properly validated by a ValidationPoolWorker it is forwarded in parallel to both approval-voting and approval-distribution, so that they both can perform the remaining operations in parallel with the assumption that the message is now valid approval-distribution will do 5 & 6 and approval-voting will do 2 & 4.

The proposal in a picture

Screenshot 2024-06-05 at 17 56 44

FAQ

How about view updates ?

View updates are forwarded to the ValidationPoolWorkers, so they can use it to decide if a message is correctly received.

The main approval-distribution loop will simply through the messages that have been already validated as it does it now without any changes. All messages that haven't been validated yet and are at the moment we receive the view update that are in the worker pools queue will be forwarded to the peer when approval-distribution main loop does step 6)

What about reputation updates ?

Worker pools send the reputation update to approval-distribution periodically and the really bad offences instantaneously.

What about GetApprovalSignatures ?

The main approval-distribution loops returns all the votes that have been already validated, this is functionally equivalent with the current setup since GetApprovalSignatures always returns only the messages that have been fully processed and since the new pipeline will process messages faster things should actually be better in stress conditions.

PoC

To prove this can be done with good results I created a small Poc with the above and ran subsystem benchmarks with 1000 validators, 100 occupied cores all tranches triggered for 10 relay blocks with both assignments and approvals. This is the worse case scenario when sharding breaks and all validators have to validate 100 parachains per block.

  • ToF for approval distribution and approval voting does not go above 500ms, in the current implementation this is around 10s of seconds.
Screenshot 2024-06-05 at 18 09 36
  • Another metric that looks really well is the time from the beginning of the block until we gathered enough assignments for approving the block, this is important because it guarantees that even in conditions of stress we won't trigger new assignments because we are late on processing new ones.
Screenshot 2024-06-05 at 17 40 56
  • Chain approving does not fall behind
Poc: Chain selection approved  after 4000 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

 Master: Chain selection approved  after 71500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

Advantages of the proposal.

  • It does not require a full re-write it just rips the parts that are consuming most of the cpus in both subsystem and puts them in a worker pool that can do stuff in parallel, to prove that I built a small PoC with all the thing described above.
  • It scales up processing with the available hardware.
  • No database changes needed or migrations.

@sandreim @AndreiEres @eskimor @ordian Let me know what you think.

@sandreim
Copy link
Contributor Author

sandreim commented Jun 6, 2024

I'm still thinking about it, but here are some inline comments/questions:

  1. approval-distribution - sanity check that we should have received this message from peer(assignment is not duplicate, assignment is for known block, approval is received only if we received assignment, etc.).
  2. approval-voting - sanity checks again the message and loads state from the database

It would be great if you could detail what you mean. Is it just a candidate existing in the db ?

  1. approval-voting - verify VRF assignment and approval signature.
  2. approval-voting - bookkeeping and database update
  3. approval-distribution - bookkeeping and storing of message in in-memory structs(GetApprovalSignatures answers uses this).
  4. approval-distribution - distribution of new messages according to topology and random sampling.

From these operations the most expensive ones according to pyroscope are the crypto operations at step 3) and the operations at step 1), the good part is that both 1) and 3) can actually be done in parallel.

I am not sure how you can do 3 as you would need to have to read CandidateEntry that you write in approval voting.

Proposal

We keep both subsystems and we augment approval-distribution with a parallel worker pool(ValidationPool) where work is distributed based on the validator that created the message(not the one gossiping it), the algorithm of distributing work would be simple worker_index = (msg.validator + RANDOM_SALT) % WORKER_COUNT. This guarantees that messages coming from the same validator lands on the same worker, so the worker having just a context built out of each relay chain can perform the following validation work, which is currently split between the two subsystems.

Wouldn't this approach work better if we just had one single subsystem and this pool part of it. I mean, it's less messages to pass around and easier to debug, and enables very easy batching to further optimize signature checks.

  • Sanity check the message is not duplicate and that approvals are received after the assignment.
  • Check VRF assignment is correct.
  • Check we received all assignments for the approval vote.

Doesn't this introduce a race condition vs the approval-voting subsystem that writes DB ?

Once the message has been properly validated by a ValidationPoolWorker it is forwarded in parallel to both approval-voting and approval-distribution, so that they both can perform the remaining operations in parallel with the assumption that the message is now valid approval-distribution will do 5 & 6 and approval-voting will do 2 & 4.

I guess we could avoid doing 2 if we merge the 2 subsystems.

Advantages of the proposal.

  • It does not require a full re-write it just rips the parts that are consuming most of the cpus in both subsystem and puts them in a worker pool that can do stuff in parallel, to prove that I built a small PoC with all the thing described above.
  • It scales up processing with the available hardware.

This is indeed an interesting proposal and the above advantages sound reasonable

  • No database changes needed or migrations.

Why do you think no database changes are required ? IMO figuring out how to use the DB such that you can isolate the I/O per candidate in the worker is key to be able to fully parallelise the entirety of the the pipeline from 1 -> 6 .

@alexggh
Copy link
Contributor

alexggh commented Jun 6, 2024

approval-voting - verify VRF assignment and approval signature.

I am not sure how you can do 3 as you would need to have to read CandidateEntry that you write in approval voting.

The crypto checks need just data that is available at the beginning of the block and you could provide that easily to the workers, you don't need the full CandidateEntry just some data from it like the candidate_hash, core allocation and vrf story.

approval-distribution - sanity check that we should have received this message from peer(assignment is not duplicate, assignment is for known block, approval is received only if we received assignment, etc.).
approval-voting - sanity checks again the message and loads state from the database
It would be great if you could detail what you mean. Is it just a candidate existing in the db ?

It would be great if you could detail what you mean. Is it just a candidate existing in the db ?

Yes, exactly.

Wouldn't this approach work better if we just had one single subsystem and this pool part of it. I mean, it's less messages to pass around and easier to debug, and enables very easy batching to further optimize signature checks.

Yes that would work as well, I just think it is not necessary to extract the peformance, as far as I can tell our subsystems queues are pretty fast at moving things around and we would still need two workers for S2 and S3 streams(see bellow)

Sanity check the message is not duplicate and that approvals are received after the assignment.
Check VRF assignment is correct.
Check we received all assignments for the approval vote.

Doesn't this introduce a race condition vs the approval-voting subsystem that writes DB ?

It is not worse than the current setup, no approval vote gets passed to the approval-voting if approval-distribution does not have a record that it already received a valid assignment for it and since both vote and assignment lands on the same worker, they have everything in the worker context to validate it.

Once the message has been properly validated by a ValidationPoolWorker it is forwarded in parallel to both approval-voting and approval-distribution, so that they both can perform the remaining operations in parallel with the assumption that the message is now valid approval-distribution will do 5 & 6 and approval-voting will do 2 & 4.

I guess we could avoid doing 2 if we merge the 2 subsystems.

The operations done by approval-voting(database recording) and by approval-distribution(determine the peers that should receive the message) can be performed in parallel, so with 2 subsystem you already get the necessary parallelism if we merge, then we would have to have two separate workers, which is actually the subsystems gives us for free.

Why do you think no database changes are required ? IMO figuring out how to use the DB such that you can isolate the I/O per candidate in the worker is key to be able to fully parallelise the entirety of the the pipeline from 1 -> 6 .

With these proposal we don't need to isolate the IO because we get the necessary parallelisation and pipelining by having 3 working streams that can advance in parallel:

S1: ValidationPool workers validating in parallel that messages satisfy all crypto requirements.
S2: Approval-voting in a single main loop just takes the messages one by one and records them into the database and the rest of the work that it does currently to concur if it needs to trigger a new assignment or if a block is approved.
S3 Approval-distribution determining what peers need to receive the message and sending it to them.

Also, with this split it is actually S3 that consumes slightly more cpu than S2, because it turns out iterating for every message through 1000 validators adds up to a lot of usage.

@alexggh
Copy link
Contributor

alexggh commented Jun 7, 2024

Another picture showing how the processing would progress above.

Screenshot 2024-06-07 at 11 26 56

Thought a bit about it and I see no reason why we can do it even better, by allowing each worker to also run the logic for deciding which peers should receive a given message and queue it to the network bridge tx something like this.

Screenshot 2024-06-07 at 11 28 08

@eskimor
Copy link
Member

eskimor commented Jun 7, 2024

Impressive @alexggh !

So basically distribution is happening in the workers then (last picture)? (Also book keeping on who has which message and such?)

Do we understand where the speedup is coming from? We do no longer wait for e.g. crypto checks to report back before starting processing the next message anymore - correct? I guess this is the major reason for the speedup? Do we see improvements with only one worker?

@sandreim
Copy link
Contributor Author

sandreim commented Jun 7, 2024

We had an off-line discussion about this new proposal and the conclusion was to go ahead with full PoC and more testing followed up by production implementation.

Key takeaways:

  • there are no major differences in what performance gains we expect from the new proposal vs old one
  • there are no regressions in terms of functionality or spam protection
  • there will be additional memory required per workers as they track the relay blocks and candidate hashes themselves , but it is very little information
  • move everything into a single subsystem
  • extend the worker pipeline to also include distribution to peers
  • no DB changes
  • main loop of the subsystem would likely do only dispatching of messages to workers while the approval voting db updates and wakeup processing will be handled in a separate task

@alexggh
Copy link
Contributor

alexggh commented Jun 7, 2024

So basically distribution is happening in the workers then (last picture)? (Also book keeping on who has which message and such?)

Yes, that's what it should happen, I don't have PoC for the last picture yet, but it is coming.

Do we understand where the speedup is coming from? We do no longer wait for e.g. crypto checks to report back before starting processing the next message anymore - correct? I guess this is the major reason for the speedup?

The speedup comes from two directions:

  • we get pipelining where approval-distribution can move on do to work without waiting for approval-voting to report back.
  • we get parallelisation because with more than one worker we can utilise the available CPUs to do the heavy-lifting of checking in parallel that the messages are properly signed and valid.

Do we see improvements with only one worker?

Yes, we see improvements with only one worker, although it is not enough to not fall slightly behind when all assignments and all approvals gets triggered(130_000 unique messages per block), 2 workers seems to be fine on my machine.

@burdges
Copy link

burdges commented Jul 7, 2024

approval-distribution - sanity check that we should have received this message from peer(assignment is not duplicate, assignment is for known block, approval is received only if we received assignment, etc.).

In this, approval-distribution workers need more read access to the approval-voting database now, right?

there will be additional memory required per workers as they track the relay blocks and candidate hashes themselves , but it is very little information

As those tasks are just reading, you could prepare what all workers need once after the relay chain updates, wrap it inside an Arc<StuffForWorker>, and send this to the workers, who then drop their old one. That's probably a minor thing, dependupon upon how muc hthey have to sift through.

@alexggh
Copy link
Contributor

alexggh commented Jul 8, 2024

approval-distribution - sanity check that we should have received this message from peer(assignment is not duplicate, assignment is for known block, approval is received only if we received assignment, etc.).

In this, approval-distribution workers need more read access to the approval-voting database now, right?

No, approval-distribution does not have access to the approval-voting database, in both the current implementation and the parallelised approval-distribution just passes sanitized message to the approval voting via channels. The difference with this optimisations is that approval-distribution won't have to wait after the approval-voting anymore because it already confirmed the messages are valid.

This should give you a pretty good idea of all that is needed to make it work #484

@burdges
Copy link

burdges commented Jul 8, 2024

You maybe misstyped that issue number, but..

approval is received only if we received assignment

We'd need the assignments information for that to hold, no?

We do not gossip on approval votes messages for non-existant assigments, right?

Any idea what our current behavior assumes about assignments being recieved before approval votes? It's hopefully fine if they're dropped, but maybe problematic, so what we're doing now gives some evidence.

@alexggh
Copy link
Contributor

alexggh commented Jul 9, 2024

You maybe misstyped that issue number, but..

Sorry about that, this is the main issue: #4849

approval is received only if we received assignment

We'd need the assignments information for that to hold, no?

Yes, and the approval-distribution already has that and enforces it.

We do not gossip on approval votes messages for non-existant assigments, right?
Any idea what our current behavior assumes about assignments being recieved before approval votes? It's hopefully fine if they're dropped, but maybe problematic, so what we're doing now gives some evidence.

The current behaviour assumes that approvals follow the same path as assignments, because nodes send the approvals to all nodes that they send the assignments to, so if an approval is received before receiving the assignment for it it is considered miss-behaviour and we slightly reduce the reputation of the node https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/approval-distribution/src/lib.rs#L1540C58-L1540C81.

@burdges
Copy link

burdges commented Jul 9, 2024

nodes send the approvals to all nodes that they send the assignments to

This seems unlikely given our four random sends in the gossip protocol. It's possible we'd break this in future too. Anyways, I'm not sure this matters since the approval-vote should come well after the assignment.

if an approval is received before receiving the assignment for it it is considered miss-behaviour and we slightly reduce the reputation of the node

Alright. Also, approval-voting does not count the approval vote of course, and the vote message gets droped presumably.

approval is received only if we received assignment
We'd need the assignments information for that to hold, no?
Yes, and the approval-distribution already has that and enforces it.

Alright fine, so you track this in two places. That's maybe for the best: We likely need approval-distribution to keep running for a little while after approval-voting finishes because we do not know if all peers have marked the candidate approved yet. Ergo the extra information on approval-voting knows would be a distraction in approval-distribution.

@alexggh
Copy link
Contributor

alexggh commented Jul 9, 2024

nodes send the approvals to all nodes that they send the assignments to

This seems unlikely given our four random sends in the gossip protocol. It's possible we'd break this in future too. Anyways, I'm not sure this matters since the approval-vote should come well after the assignment

We do keep track of the four random we send the assignment to, and when we receive the vote we send it to them as well.

Alright. Also, approval-voting does not count the approval vote of course, and the vote message gets droped presumably.

Yes.

Alright fine, so you track this in two places. That's maybe for the best: We likely need approval-distribution to keep running for a little while after approval-voting finishes because we do not know if all peers have marked the candidate approved yet. Ergo the extra information on approval-voting knows would be a distraction in approval-distribution.

Yes, that's how it was done from the beginning the optimisation proposed here and reflected in the series #4849, just make better use of that, so we can better use the available hw capacity.

github-merge-queue bot pushed a commit that referenced this issue Sep 26, 2024
This is the implementation of the approach described here:
#1617 (comment)
&
#1617 (comment)
&
#1617 (comment).

## Description of changes

The end goal is to have an architecture where we have single
subsystem(`approval-voting-parallel`) and multiple worker types that
would full-fill the work that currently is fulfilled by the
`approval-distribution` and `approval-voting` subsystems. The main loop
of the new subsystem would do just the distribution of work to the
workers.

The new subsystem will have:
- N approval-distribution workers: This would do the work that is
currently being done by the approval-distribution subsystem and in
addition to that will also perform the crypto-checks that an assignment
is valid and that a vote is correctly signed. Work is assigned via the
following formula: `worker_index = msg.validator % WORKER_COUNT`, this
guarantees that all assignments and approvals from the same validator
reach the same worker.
- 1 approval-voting worker: This would receive an already valid message
and do everything the approval-voting currently does, except the
crypto-checking that has been moved already to the approval-distribution
worker.

On the hot path of processing messages **no** synchronisation and
waiting is needed between approval-distribution and approval-voting
workers.

<img width="1431" alt="Screenshot 2024-06-07 at 11 28 08"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/a196199b-b705-4140-87d4-c6900ba8595e">



## Guidelines for reading

The full implementation is broken in 5 PRs and all of them are
self-contained and improve things incrementally even without the
parallelisation being implemented/enabled, the reason this approach was
taken instead of a big-bang PR, is to make things easier to review and
reduced the risk of breaking this critical subsystems.

After reading the full description of this PR, the changes should be
read in the following order:
1. #4848, some other
micro-optimizations for networks with a high number of validators. This
change gives us a speed up by itself without any other changes.
2. #4845 , this contains
only interface changes to decouple the subsystem from the `Context` and
be able to run multiple instances of the subsystem on different threads.
**No functional changes**
3. #4928, moving of the
crypto checks from approval-voting in approval-distribution, so that the
approval-distribution has no reason to wait after approval-voting
anymore. This change gives us a speed up by itself without any other
changes.
4. #4846, interface
changes to make approval-voting runnable on a separate thread. **No
functional changes**
5. This PR, where we instantiate an `approval-voting-parallel` subsystem
that runs on different workers the logic currently in
`approval-distribution` and `approval-voting`.
6. The next step after this changes get merged and deploy would be to
bring all the files from approval-distribution, approval-voting,
approval-voting-parallel into a single rust crate, to make it easier to
maintain and understand the structure.

## Results
Running subsystem-benchmarks with 1000 validators 100 fully ocuppied
cores and triggering all assignments and approvals for all tranches

#### Approval does not lags behind. 
 Master
```
Chain selection approved  after 72500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
```
With this PoC
```
Chain selection approved  after 3500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
```

#### Gathering enough assignments
 
Enough assignments are gathered in less than 500ms, so that gives un a
guarantee that un-necessary work does not get triggered, on master on
the same benchmark because the subsystems fall behind on work, that
number goes above 32 seconds on master.
 
<img width="2240" alt="Screenshot 2024-06-20 at 15 48 22"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/d2f2b29c-5ff6-44b4-a245-5b37ab8e58bc">


#### Cpu usage:
Master
```
CPU usage, seconds                     total   per block
approval-distribution                96.9436      9.6944
approval-voting                     117.4676     11.7468
test-environment                     44.0092      4.4009
```
With this PoC
```
CPU usage, seconds                     total   per block
approval-distribution                 0.0014      0.0001 --- unused
approval-voting                       0.0437      0.0044.  --- unused
approval-voting-parallel              5.9560      0.5956
approval-voting-parallel-0           22.9073      2.2907
approval-voting-parallel-1           23.0417      2.3042
approval-voting-parallel-2           22.0445      2.2045
approval-voting-parallel-3           22.7234      2.2723
approval-voting-parallel-4           21.9788      2.1979
approval-voting-parallel-5           23.0601      2.3060
approval-voting-parallel-6           22.4805      2.2481
approval-voting-parallel-7           21.8330      2.1833
approval-voting-parallel-db          37.1954      3.7195.  --- the approval-voting thread.
```

# Enablement strategy

Because just some trivial plumbing is needed in approval-distribution
and approval-voting to be able to run things in parallel and because
this subsystems plays a critical part in the system this PR proposes
that we keep both ways of running the approval work, as separated
subsystems and just a single subsystem(`approval-voting-parallel`) which
has multiple workers for the distribution work and one worker for the
approval-voting work and switch between them with a comandline flag.

The benefits for this is twofold.
1. With the same polkadot binary we can easily switch just a few
validators to use the parallel approach and gradually make this the
default way of running, if now issues arise.
2. In the worst case scenario were it becomes the default way of running
things, but we discover there are critical issues with it we have the
path to quickly disable it by asking validators to adjust their command
line flags.


# Next steps
- [x] Make sure through various testing we are not missing anything 
- [x] Polish the implementations to make them production ready
- [x] Add Unittest Tests for approval-voting-parallel.
- [x] Define and implement the strategy for rolling this change, so that
the blast radius is minimal(single validator) in case there are problems
with the implementation.
- [x]  Versi long running tests.
- [x] Add relevant metrics.

@ordian @eskimor @sandreim @AndreiEres, let me know what you think.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor

alexggh commented Sep 27, 2024

Implementation done with: #4849

@alexggh alexggh closed this as completed Sep 27, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Completed in parachains team board Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: Completed
Development

No branches or pull requests

8 participants