-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add new execution code-path for unified scheduler #31239
Conversation
Codecov Report
@@ Coverage Diff @@
## master #31239 +/- ##
========================================
Coverage 81.5% 81.5%
========================================
Files 729 732 +3
Lines 206580 207349 +769
========================================
+ Hits 168445 169141 +696
- Misses 38135 38208 +73 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is must firstly be settled on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is must firstly be settled on...
}; | ||
|
||
// Send + Sync is needed to be a field of BankForks | ||
#[cfg_attr(any(test, feature = "test-in-workspace"), automock)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
let mocked_scheduler = setup_mocked_scheduler_with_extra( | ||
[WaitReason::TerminatedFromBankDrop].into_iter(), | ||
Some(|mocked: &mut MockInstalledScheduler| { | ||
mocked | ||
.expect_schedule_execution() | ||
.times(1) | ||
.returning(|_, _| ()); | ||
}), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hope there's no anti-mock guys around.. lol
// Calling this is illegal as soon as schedule_termiantion is called on &self. | ||
fn schedule_execution(&self, sanitized_tx: &SanitizedTransaction, index: usize); | ||
|
||
// This optionally signals scheduling termination request to the scheduler. | ||
// This is subtle but important, to break circular dependency of Arc<Bank> => Scheduler => | ||
// SchedulingContext => Arc<Bank> in the middle of the tear-down process, otherwise it would | ||
// prevent Bank::drop()'s last resort scheduling termination attempt indefinitely | ||
fn schedule_termination(&mut self); | ||
|
||
#[must_use] | ||
fn wait_for_termination(&mut self, reason: &WaitReason) -> Option<ResultWithTimings>; | ||
|
||
// suppress false clippy complaints arising from mockall-derive: | ||
// warning: the following explicit lifetimes could be elided: 'a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentionally, these aren't doc-comments... xD not intending proper wordz
"process_batches()/schedule_batches_for_execution({} batches)", | ||
batches.len() | ||
); | ||
schedule_batches_for_execution(bank, batches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
execution diversion starts from here. (meaning all locking verification are equally done for both if
branches)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried this writing first time. but i kinda like putting all relevant code by topic (scheduling in this case) into a single file, opening bunch of layered structs
, thanks for rust's open classes at the cost of pub(crate)
impurification.
fn mode(&self) -> SchedulingMode; | ||
} | ||
|
||
// This file will be populated with actual implementation later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stay tuned!
tagging as this is monorepo-wide big change: @carllin (i trust your eyes for cutting-through review), @bw-solana (this should be complement to your concurrent replay), @ilya-bobyr (saw your recent work at blockstore_processor; tldr tx could be executed differently) happy to explain context/impl details if interested (i'll try very hard to put answers into code/comments for future posterity, lol). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some very initial comments from first scan through.
At a high-level, I'm still trying to understand these cyclic dependencies but will need another pass at least.
Should probably loop more than just me in on the review requests as well 😄
core/src/replay_stage.rs
Outdated
ancestor_hashes_replay_update_sender, | ||
purge_repair_slot_counter, | ||
); | ||
// If the bank was corrupted, abort now to prevent further normal processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a bank being corrupted mean? How does that happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrupt here refers to any BlockstoreProcessorError
error; this comment is copied from the existing one. anyway, i don't think corrupt
here adding any value; so removed it...: 16a1e9c
local-cluster/src/local_cluster.rs
Outdated
@@ -126,23 +126,33 @@ pub struct LocalCluster { | |||
} | |||
|
|||
impl LocalCluster { | |||
pub fn new_with_equal_stakes( | |||
pub fn config_with_equal_stakes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to ClusterConfig
impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: eed7db0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also created a pr for this: #32330
runtime/src/bank.rs
Outdated
@@ -8282,6 +8291,8 @@ impl TotalAccountsStats { | |||
|
|||
impl Drop for Bank { | |||
fn drop(&mut self) { | |||
self.drop_scheduler(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer not to have the empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 3ba582b
//! | ||
//! Albeit being at this abstract interface level, it's generally assumed that each | ||
//! `InstalledScheduler` is backed by multiple threads for performant transaction execution and | ||
//! there're multiple independent schedulers inside a single instance of `InstalledSchedulerPool`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always the case, or is it only necessary if --replay-slots-concurrently
is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, the current situation of concurrency level is complicated.
direct to your question, firstly, it is always the case. as soon as a new fork is created by BankForks::insert()
, a new PooledScheduler
is spawn, which means a new independent group of threads from other ongoing forks (sharing no synchronization primitives among others). This is true, regardless --replay-slots-concurrently
or not. And multiple slots can be replayed concurrently as .schdule_execution()
is assumed be async/non-blocking. i.e. single-threaded replay stage can still buffer txes into multiple unified schedulers.
So, when --replay-slots-concurrently
is enabled, unified scheduler can take advantage of it in that it can fully concurrently replay ledger entries, including the buffering part (note: by principle!). on the other hand, blockstore_processor uses a shared rayon thread group. While the thread group is populated with N (core count) threads, it's still possible for any blocking call there to starve the other fork. also, I've observed non-optimal rayon behavior when .install()
is called from multiple threads against a shared thread group.
the caveat for the full concurrency of unified scheduler is that it's true if there's no single is_complete()
bank at given moment among all of ongoing forks. that's because process_replay_results
isn't concurrent regarding multiple forks even if --replay-slots-concurrently
is enabled (still). so, as soon as any buffered schedule_execution()
-ed transactions are all executed, other !is_complete()
forks can be starved sub-optimally.
Anyway, these are all edge case runtime behavior... I'll planning to revisit this in distant future... or @bw-solana might be planning to further enhance the replay stage's concurrency. xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(while writing the above, I noticed i better off pulling a small assert from the dirty topic branch: 9769288 , pending ci for its correctness... lol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this concurrency bla bla also affected by this pending feature activation: #31239 (comment)
|
||
#[must_use] | ||
fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> { | ||
debug!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mix of inlined formatting and regular formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you meant like this?:
$ git diff HEAD~
diff --git a/runtime/src/installed_scheduler_pool.rs b/runtime/src/installed_scheduler_pool.rs
index 83408d1c861..c56c651f238 100644
--- a/runtime/src/installed_scheduler_pool.rs
+++ b/runtime/src/installed_scheduler_pool.rs
@@ -242,8 +242,7 @@ impl Bank {
#[must_use]
fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> {
debug!(
- "wait_for_scheduler(slot: {}, reason: {reason:?}): started...",
- self.slot()
+ "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...",
);
let mut scheduler_guard = self.scheduler.write().expect("not poisoned");
actually, this is syntax error:
error: invalid format string: expected `'}'`, found `'.'`
--> runtime/src/installed_scheduler_pool.rs:245:44
|
245 | "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...",
| - ^ expected `}` in format string
| |
| because of this opening brace
|
= note: if you intended to print `{`, you can escape it using `{{`
error: could not compile `solana-runtime` due to previous error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha I was unclear, but actually meant the opposite direction
debug!(
"wait_for_scheduler(slot: {}, reason: {:?}): started...",
self.slot(),
reason,
);
It's fine to leave it, I always just find it harder to read if some params are inlined and others are not, and either choose to go all inlined or all not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also under the impression that it is recommended to use either all inlined, or all non-inlined arguments, but not a mix.
But I can not find any recommendations on this point right now.
One reason might be that there is a clippy rule that insists that all arguments should be inlined, when they all can be inlined.
But as soon as there is at least one non-inlinable argument, it stops complaining.
This rule has been downgraded to "pedantic" due to rust-analyzer
not been robust enough just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 1dc0c2c
@@ -0,0 +1,22 @@ | |||
[package] | |||
name = "solana-scheduler-pool" | |||
description = "The solana scheduler pool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description = "The solana scheduler pool" | |
description = "The Solana scheduler pool" |
@@ -0,0 +1,10 @@ | |||
[package] | |||
name = "solana-scheduler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to be so generic with the name if we've got other scheduler impls in the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, how about solana-unified-scheduler-logic
/solana-unified-scheduler-pool
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like those better 😄
scheduler-pool/src/lib.rs
Outdated
// this will be replaced with more proper implementation... | ||
// not usable at all, especially for mainnet-beta | ||
#[derive(Debug)] | ||
struct Scheduler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused by Scheduler
being here, rather than in the scheduler crate, even if this is a temporary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. well, maybe this struct should be named like PooledScheduler
. as you pointed out, this is unintended name conflict... and this isn't temporary. the struct is something which will be glueing/managing scheduling logic and threads together (in the future). solana-scheduler
is meant to hold pure logic not depending on mundane world of solana-ledger (and even solana-runtime!)... so, i wonder i should also name the crate to solana-unified-scheduler-logic
or something else to avoid ambiguity coming from calling things as scheduler altogether... xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've got the separation in a branch, could you reference that to me? or just explain a bit more what you mean in terms of separting "pure logic" from threads. I suppose I'm lacking imagination, and am not yet sold on the idea that they need/should be separated.
If we do go down the route that they should indeed be separate, then I do like the name solana-unified-scheduler-logic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've got the separation in a branch, could you reference that to me?
sure!: https://github.com/ryoqun/solana/tree/unified-scheduler-for-block-verification-wip-pre-strip
that branch is full of dirty code, though. however, it's runnable against mainnet-beta. (fully-performant code)
just explain a bit more what you mean in terms of separting "pure logic" from threads.
sure, again! some more reasoning:
I think crate-level separation provides good discipline over time for designing things, assuming creating crates is cheap in terms of human time. as you know, solana-core
/solana-runtime
/solana-ledger
is too tangled up right now (ref: https://discord.com/channels/428295358100013066/439194979856809985/1097774577817505842). so, I'm afraid to add another lump of code to there. and to the brand-new solana-unified-scheduler-pool
as well.
discipline here means well-cut and well-defined interface between and explicit and discrete set of dependencies for each respective crates.
in a say, this is kind of crate-safety
as in type-safety
.
also, that strict discipline make it easy to bench the scheduling logic by its own. i'll put bunch of benches in solana-unified-scheduler-logic
, let alone unit-testing the scheduler logics.
lastly, i think scheduler logic shold be agnostic to the actual execution details up to the communicated information across the interface, but not more. in this way, logic eval and reasoning and reviewing should be easy. that will eventually leads to performant impl. (Imo).
as for the interface, we just need bunch of crossbeam_channel::Channels
s to define such an api between crates (without any extra runtime overhead if those code would be in a single crate)
callee:
caller:
another datapoint is that solana-unified-scheduler-logic
's Cargo.toml
is so small like this:
and the other from solana-unified-scheduler-pool
:
if these explanation does make sense, i'll transcribe these as source comments in this pr or coming pr.
If we do go down the route that they should indeed be separate, then I do like the name
solana-unified-scheduler-logic
.
gocha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this separate-or-not discussion is currently blocking all the renames and crate template creation according to #31239 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after examining the dependency graph the crate separation makes sense. Appreciate the detailed comments for justification.
Tried to map out some of my thoughts this morning about dependencies, which I always find benefit from some visualization: graph TD
Bank["Arc<Bank>"]
subgraph solana-runtime
BankForks;
subgraph cyclic-ref
Bank;
SchedulingContext;
InstalledScheduler{{InstalledScheduler}};
end
InstalledSchedulerPool{{InstalledSchedulerPool}};
end
subgraph solana-ledger
ExecuteBatch(["execute_batch()"]);
end
subgraph solana-scheduler-pool
SchedulerPool;
SchedulerPoolScheduler[Scheduler];
end
subgraph solana-scheduler
WithSchedulingMode{{WithSchedulingMode}};
end
SchedulingContext -- refs --> Bank;
BankForks -- owns --> Bank;
BankForks -- owns --> InstalledSchedulerPool;
Bank -- refs --> InstalledScheduler;
SchedulerPool -. impls .-> InstalledSchedulerPool;
SchedulerPoolScheduler -. impls .-> InstalledScheduler;
InstalledScheduler -- refs --> SchedulingContext;
SchedulingContext -. impls .-> WithSchedulingMode;
SchedulerPoolScheduler -- refs --> SchedulerPool;
SchedulerPool -- owns --> SchedulerPoolScheduler;
SchedulerPoolScheduler -. calls .-> ExecuteBatch;
solana-scheduler-pool -- deps --> solana-scheduler;
solana-scheduler-pool -- deps --> solana-ledger;
solana-scheduler-pool -- deps --> solana-runtime;
solana-ledger -- deps --> solana-runtime;
solana-runtime -- deps --> solana-scheduler;
As mentioned in my review comment, I'm trying to wrap my head around the cyclical-ness of the traits & structs. From this diagram (I may have missed something), it seems like the main reason we need to define these new crates for implementing the scheduler(pool) is the dependency on I'm wondering if moving execute_batch and TransactionStatusSender types into runtime would significantly simplify this? Do we need dyn then? edit: @ryoqun thanks for the edits marking ref/own/dep on each connection! |
core/src/replay_stage.rs
Outdated
@@ -2640,14 +2640,44 @@ impl ReplayStage { | |||
.expect("Bank fork progress entry missing for completed bank"); | |||
|
|||
let replay_stats = bank_progress.replay_stats.clone(); | |||
let r_replay_stats = replay_stats.read().unwrap(); | |||
let mut replay_stats = replay_stats.write().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only mutable for the accumulate on line 2651? Maybe we can keep the write lock nested down below and then grab a read lock after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. thanks for nice suggestion! well, i overlooked proper locking granularity thinking...: 3a7810c
super appreciate that diagram. yeah, you depicted it correctly. I further improved it.
Unfortunately, this won't be that easy to remove dyn in that way. |
pub type SchedulerPoolArc = Arc<dyn InstalledSchedulerPool>; | ||
pub(crate) type InstalledSchedulerPoolArc = Option<SchedulerPoolArc>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit confusing that a SchedulerPoolArc
is an Arc<InstalledSchedulerPool>
, and at the same time InstalledSchedulerPoolArc
is an Option<SchedulerPoolArc>
.
I wonder if it could be because the Installed
part in the InstalledScheduler
name is somewhat "artificial".
It seems like it is used to introduce the Scheduler
API and nothing else. Scheduler
always implements InstalledScheduler
:
impl InstalledScheduler for Scheduler
So there is no actual action that makes a certain scheduler "installed" as opposed to some other state, at the type level.
The fact that InstalledSchedulerPool
API deals with SchedulerBox
es, which are internally InstalledScheduler
. But it is a bit strange that this API does not operate rather on InstalledSchedulerBox
es.
pub trait InstalledSchedulerPool: Send + Sync + Debug {
fn take_from_pool(&self, context: SchedulingContext) -> SchedulerBox;
fn return_to_pool(&self, scheduler: SchedulerBox);
}
pub type SchedulerBox = Box<dyn InstalledScheduler>;
pub(crate) struct InstalledSchedulerBox(Option<SchedulerBox>);
Maybe adding a new name to use for the Scheduler
API could remove some of this back and forth?
For example call the API TransactionExecutor
or TransactionScheduler
.
And then the implementation can be called Scheduler
.
Or call the API Scheduler
, and give a different name to the implementation.
SchedulerImpl
, or SimpleScheduler
, or PooledScheduler
as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really appreciate these in-depth feedback from someone with fewest context. i think these makes code more readable/accessible in the end.
yeah, the Installed
prefix was chosen artificially just for the sake of trait/impl naming. Other options were: Injected
/Dyn
/Like
/Pluggable
. However, I've settled onInstalled
after long thought (not bothered to create wall of text... lol).
Also, I'm reluctant to use different nouns like Executor
(this is taken by ExecutorCache
) and Transaction
(scheduler will work on something called Task
s; generic to both plain Transaction
s and/or Bundle
(common word for grouped tx in mev jargon)).
anyway, as you pointed out there was still inconsistency for the usage of Installed
.. how about this?: a8d66fa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing some context.
It still feels strange to have the word Installed
here, even when the same object is used while both being installed into a Bank
and not.
In other words, an InstalledSchedulerPoolArc
, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.
When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the Bank
/Scheduler
circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.
Consider this:
- Rename the
InstalledScheduler
trait to be justScheduler
. - Remove reference to the
Bank
from theScheduler
. - Only
Scheduler::execute
requires aBank
, so just pass it in as an argument. - When
Bank
will be using the scheduler it currently has installed, it will just passself
as the first argument toexecute()
.
I wonder if it would also remove the need for tracking the shutdown process, simplifying the Scheduler
trait to this:
pub trait Scheduler: Send + Sync + Debug {
fn id(&self) -> SchedulerId;
fn pool(&self) -> InstalledSchedulerPoolArc;
fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);
// suppress false clippy complaints arising from mockall-derive:
// warning: the following explicit lifetimes could be elided: 'a
#[allow(clippy::needless_lifetimes)]
fn context<'a>(&'a self) -> Option<&'a SchedulingContext>;
fn replace_context(&mut self, context: SchedulingContext);
}
Bank
now holds Box<dyn Scheduler>
and returns it into a pool when it is frozen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, an
InstalledSchedulerPoolArc
, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.
how about Installable
?
When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the
Bank
/Scheduler
circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.
really, appreciate going so deep and suggestion. I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context
Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);
admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.
we need to make {schedule_execution/execute}()
non-blocking without executing txes. soon, it just crossbeam::Channel::send()s
SanitizedTransaction
s. Also, Scheduler
s are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to use sync::Weak
to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.
all in all, SchedulingContext
and complicated dance stage is born.
That said, i'm still open to better trait design. :)
again, I'll write a proper in-source doc explaining this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[... ] I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context
Glad my attempts to understand the PR brought at least some value :))
Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.
we need to make
{schedule_execution/execute}()
non-blocking without executing txes. soon, it justcrossbeam::Channel::send()s
SanitizedTransaction
s. Also,Scheduler
s are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to usesync::Weak
to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.all in all,
SchedulingContext
and complicated dance stage is born.That said, i'm still open to better trait design. :)
again, I'll write a proper in-source doc explaining this.
I should have checked. So the dependency is not via a Bank
directly, but via a SchedulingContext
. I am still missing a complete picture, but just to be clear, you are saying that this:
fn execute(&self, context: &SchedulingContext, sanitized_tx: &SanitizedTransaction, index: usize);
Is not going to work, correct?
Is it because, the caller of the execute()
does not have a SchedulingContext
at hand, and you do not want to add it into the Bank
?
Currently, SchedulingContext
just wraps a reference to a Bank
, adding a mode
to it:
#[derive(Clone, Debug)]
pub struct SchedulingContext {
mode: SchedulingMode,
// Intentionally not using Weak<Bank> for performance reasons
bank: Arc<Bank>,
}
Maybe if you put mode
into the Scheduler
, but keep the Bank
as the execute()
argument, it would work, ownership wise?
And if, in the future, the scheduling machinery is going to change, you can replace mode
with a channel or anything else, still living inside a Scheduler
, but the Bank
is kept outside.
Bank
and any other data is the only combined when execute()
is invoked, and the reference to Bank
is unnecessary for other method calls.
This way, there is no circular dependency between the Bank
that (temporarily) owns the Scheduler
and the Scheduler
itself.
Something like this:
pub trait Scheduler: Send + Sync + Debug {
fn id(&self) -> SchedulerId;
fn pool(&self) -> InstalledSchedulerPoolArc;
fn mode(&self) -> SchedulingMode;
fn set_mode(&mut self, v: SchedulingMode);
fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);
}
Depending on how mode
is used in a specific implementation, mode()
/set_mode()
might be unnecessary in this trait. If, for example, setter and getter of the mode
both know the concrete type of the Scheduler
they work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, the above benches are also demonstrating why std::sync::Weak
can't be used due to perf. concern, justifying another related design decision of inevitable introduction of cyclic ref of Bank
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗ arc downgrade + upgrade takes 1us? Did I do that math right? ~100_000_000 ns overhead per iter with 10_000 txs, 10 upgrade/downgrade per tx = 100_000 upgrade/downgrade per iter.
~4GHz machine => ~4000 cycles per upgrade downgrade?
That seems really high to me, maybe my understanding of how Arc works is not matched with reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exclamation arc downgrade + upgrade takes 1us?
well, you did that math wrong... it's actually 100ns (for the contended testcase) , thanks to the super busy cpu cache-line for the Arc<Bank>
:
100_000_000 ns overhead per iter with 10_000 txs, 100 (see my out-of-tree patch comment) upgrade/downgrade per tx = 1_000_000 upgrade/downgrade per iter.
that said, 100ns is still too costly, considering this perf. penalty increases as the count count goes up (I'm targeting 100 cores in mind).
as shown by blocking::bench_{with,without}_arc_mutation
and nonblocking::bench_with_01_thread_{with,without}_arc_mutation
, the uncontended-case of overhead is ~14_000_000 ns, so 14ns per upgrade/downgrade`. (that's reasonable as a baseline number for a few of cas) that's 6x slowdown, which can be completely avoidable with proper impl like this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(hi, really thanks for another dense code-review round. I'll address them in order.)
So, the InstalledScheduler
trait namaing and api is okay for you eyes by now?
I'm planning to start intense coding/writing session once it's okay as laid out at : #31239 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I'm clinging to AttachedScheduler
now that we're settling on BankWithScheduler
while InstalledSchedulerPool
is remaining as-is... lol
impl Scheduler { | ||
fn spawn(pool: Arc<SchedulerPool>, initial_context: SchedulingContext) -> Self { | ||
Self { | ||
id: thread_rng().gen::<SchedulerId>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason these need to be random, rather than just sequentially numbered?
For example, using a shared AtomicU64
.
scheduler-pool/src/lib.rs
Outdated
} | ||
|
||
fn schedule_termination(&mut self) { | ||
drop::<Option<SchedulingContext>>(self.context.take()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why an explicit type here is necessary.
Or was it added for documentation purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally i thought this was nice idea; now, not... lol: 9f82921
scheduler-pool/src/lib.rs
Outdated
WaitReason::ReinitializedForRecentBlockhash => { | ||
// rustfmt... | ||
true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor
Why not remove the comment and allow rustfmt to turn block in to an expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i occasionally prevent rustfmt from formatting to see each match arms be styled in the same manner.
but, the code is gone as part of 81075ac
scheduler-pool/src/lib.rs
Outdated
fn scheduling_context(&self) -> Option<&SchedulingContext> { | ||
self.context.as_ref() | ||
} | ||
|
||
fn replace_scheduler_context(&mut self, context: SchedulingContext) { | ||
self.context = Some(context); | ||
*self.result_with_timings.lock().expect("not poisoned") = None; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming between these two functions is a bit inconsistent.
They both operate on the same field: context
.
The first function talks about "scheduling context", while the second talks about "scheduler context".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this is leftover from rename... fn replace_scheduling_context
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, Scheduler
<=> Scheduling
is so nuanced. i named context as SchedulingContext
to indicate its lifecycle irrelevance to particular Scheduler (i.e. reinitialize
carrys over the same SchedulingContext
) . i'll comment this aspect as well later. thanks for pointing out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll comment this aspect as well later. thanks for pointing out!
done: b39e2c9
scheduler-pool/src/lib.rs
Outdated
fn scheduler_id(&self) -> SchedulerId { | ||
self.id | ||
} | ||
|
||
fn scheduler_pool(&self) -> SchedulerPoolArc { | ||
self.pool.clone() | ||
} | ||
|
||
fn schedule_execution(&self, transaction: &SanitizedTransaction, index: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor
Are these methods expected to be part of an API that has multiple other purposes?
If not, their names could probably be shortened a bit to id()
, pool()
and execute()
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i named schedule_execution
to indicate non-blocking/async nature like the counterpart: schedule_termination
. (termination will be non-blocking as well in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for id()
and pool()
, i thought it's too short at first. but rethink it later. (i really like shorter name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for
id()
andpool()
, i thought it's too short at first. but rethink it later. (i really like shorter name)
i changed my mind: 21b7250
scheduler-pool/src/lib.rs
Outdated
// so, we're NOT scheduling at all; rather, just execute tx straight off. we doesn't need | ||
// to solve inter-tx locking deps only in the case of single-thread fifo like this.... | ||
if result.is_ok() || !fail_fast { | ||
*result = execute_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a person lacking some context, I must admit that the logic here is still not completely clear to me, even after reading the comment.
Is the idea that certain scheduler may want to execute transactions even if one of the previous transactions in the block have failed?
I think adding a comment to the
let fail_fast = match context.mode() {
line might clarify this. You clearly already made an effort there when you said
// this should be false, for (upcoming) BlockGeneration variant.
but it might still be lacking a higher level explanation as to the whole reasoning for the fail_fast
existance in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under block generation mode, it's normal for some txes to fail. I'll write a comment. :)
let debug = format!("{pool:#?}"); | ||
assert!(!debug.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this always be true?
pool
is an Arc<SchedulerPool>
.
SchedulerPool
is a struct with a default implementation for Debug
.
And Arc
forwards its Debug
implementation to the contained type.
So debug
will contain a debug output for a SchedulerPool
instance. Which should never be an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, this is done for sanity check and coverage of Debug impl. i did similar thing Bank. maybe i need to comment both as such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the "Codecov Report", or by "coverage" you are saying you want to actually exercise the Debug
code?
For SchedulerPool
Debug
is written by a macro. So there seems to be no need for checking it.
0ad961f
to
3a7810c
Compare
4dbbcbd
to
9769288
Compare
runtime/src/bank.rs
Outdated
// This is needed until we activate fix_recent_blockhashes because intra-slot | ||
// recent_blockhash updates necessitates synchronization for consistent tx check_age | ||
// handling. | ||
self.wait_for_reusable_scheduler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
scheduler-pool/benches/scheduler.rs
Outdated
let mut processing_states: Vec<State> = vec![State::Blocked; dependency_graph.len()]; | ||
let mut signature_indices: HashMap<&Signature, usize> = | ||
HashMap::with_capacity(dependency_graph.len()); | ||
signature_indices.extend( | ||
pending_transactions | ||
.iter() | ||
.enumerate() | ||
.map(|(idx, tx)| (tx.signature(), idx)), | ||
); | ||
|
||
let mut is_done = false; | ||
while !is_done { | ||
is_done = true; | ||
for idx in 0..processing_states.len() { | ||
match processing_states[idx] { | ||
State::Blocked => { | ||
is_done = false; | ||
|
||
// if all the dependent txs are executed, this transaction can be | ||
// scheduled for execution. | ||
if dependency_graph[idx] | ||
.iter() | ||
.all(|idx| matches!(processing_states[*idx], State::Done)) | ||
{ | ||
self.inner_scheduelr.schedule_execution(Arc::new(( | ||
pending_transactions[idx].clone(), | ||
idx, | ||
))); | ||
// this idx can be scheduled and moved to processing | ||
processing_states[idx] = State::Processing; | ||
} | ||
} | ||
State::Processing => { | ||
is_done = false; | ||
} | ||
State::Done => {} | ||
} | ||
} | ||
|
||
if is_done { | ||
break; | ||
} | ||
|
||
let mut executor_responses: Vec<_> = vec![receiver.recv().unwrap()]; | ||
executor_responses.extend(receiver.try_iter()); | ||
for r in &executor_responses { | ||
processing_states[*signature_indices.get(r).unwrap()] = State::Done; | ||
} | ||
} | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<bike-shedding>
I wonder if instead of going through all dependency_graph.len()
entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.
Something like
#[derive(Copy, PartialEq, Eq)]
enum State {
Blocked,
Processing,
Done,
}
let mut state = vec![State::Blocked; dependency_graph.len()];
let mut blocked = pending_transactions
.iter()
.enumerate()
.map(|(idx, tx)| (idx, tx.signature()))
.collect<Vec<_>>();
let mut pending = Vec::with_capacity(dependency_graph.len());
while !(blocked.is_empty() && pending.is_empty()) {
for blocked_i in 0..blocked.len() {
if blocked_i >= blocked.len() {
break;
}
let (tx_idx, _signature) = blocked[blocked_i];
// if all the dependent txs are executed, this transaction can be
// scheduled for execution.
if dependency_graph[tx_idx]
.iter()
.all(|dependency_idx| state[*dependency_idx] == State::Done)
{
self.inner_scheduelr.schedule_execution(Arc::new((
pending_transactions[tx_idx].clone(),
tx_idx,
)));
pending.push(blocked.swap_remove(blocked_i));
state[tx_idx] = State::Processing;
}
}
assert!(
!pending.is_empty(),
"If there is no pending work but `blocked` is not empty we \
are in a deadlock"
);
let done_tx_sig = receiver.recv()
.expect("`pending` is non-empty");
let pending_i = pending
.position(|(_idx, signature) signature == done_tx_sig)
.expect("`pending` must contain the completed transaction");
let done_tx_idx = pending.swap_remove(pending_i).0;
state[done_tx_idx] = State::Done;
}
Ok(())
}
Considering it is testing code, not sure if it really matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead of going through all
dependency_graph.len()
entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.Something like ...
yeah, it looks competitive with current algo, at least. cc: @buffalu possible algo improve for fast replay.
the .position
part is linear scan but should fare well with HashMap
as pending_transactions
shouldn't be that large to begin with.
also, i think reviving receiver.try_iter()
might worth.
... Considering it is testing code, not sure if it really matters.
np. thanks for suggestion. I leave these code as-is as you pointed out that it is bench code (the bench isn't trying to eval this code; it just want small scheduler logic).
but i just cc-ed people who care about it. :)
scheduler-pool/src/lib.rs
Outdated
pub fn self_arc(&self) -> Arc<Self> { | ||
self.weak_self | ||
.upgrade() | ||
.expect("self-referencing Arc-ed pool") | ||
} | ||
} | ||
|
||
impl InstalledSchedulerPool for SchedulerPool { | ||
fn take_from_pool(&self, context: SchedulingContext) -> InstalledSchedulerBox { | ||
let mut schedulers = self.schedulers.lock().expect("not poisoned"); | ||
// pop is intentional for filo, expecting relatively warmed-up scheduler due to having been | ||
// returned recently | ||
let maybe_scheduler = schedulers.pop(); | ||
if let Some(mut scheduler) = maybe_scheduler { | ||
scheduler.replace_context(context); | ||
scheduler | ||
} else { | ||
Box::new(PooledScheduler::<_, DefaultScheduleExecutionArg>::spawn( | ||
self.self_arc(), | ||
context, | ||
DefaultTransactionHandler, | ||
)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc<Self>
is object safe: https://doc.rust-lang.org/reference/items/traits.html#object-safety
This is one of the things that Rust can do that C++ can not.
These all seem overly generic. It looks to me, you are trying to create a pretty specific API.
You can use it if you like it :P
Looking at just the diff, it seems nice that |
pub fn clone_with_scheduler(&self) -> BankWithScheduler { | ||
BankWithScheduler { | ||
inner: self.inner.clone(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be impl Clone for BankWithScheduler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would shadow <Arc<Bank> as Clone>::clone()
, given the existence of impl Deref for BankWithScheduler
... so i can't do it. maybe i need to comment about this deliberate subtlety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a suspicious that Deref
abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P
Maybe remove the Deref
? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.
Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a suspicious that
Deref
abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P
yeah, i was reluctant as well for the use of Deref
because of these unwanted effects...
Maybe remove the Deref? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.
... that said, i still think removing Deref
isn't desired, considering yet another rather large code churns by doing so..
also, it seems there are other funny Deref
s already (just run $ git grep -A5 "impl.*Deref"
). so, I'm not the first.. xD
Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)
thanks for raising voices. I'll go as-is with some good (hopefully) doc: 3843793
I've settled on |
bb37d91
to
60edf6b
Compare
Problem
both current tx execution mechanisms for block verification (ReplayStage/blockstore_processor) and generation (BankingStage) aren't ideal for upcoming unified scheduler. see #30746 for these terminology.
Summary of Changes
refer to #31239 (comment) to untangle this complex object graph....
introduce a third tx execution code path and plumb all the things, which is planned to be used for both block verification and generation by unified scheduler.
Primary design choices are (not all things are coded in this pr):
Arc::upgrade()
per task execution.)Also provide minimally-viable impl just to make the plumbing actually workable for demonstration purpose. That will be replaced with actual full-fledged impl in coming prs.
This pr touches various parts but i tried very hard to make these code changes understandable. For details, see doc comments (rare from me, lol!) in pr changes.
todos
solana-scheduler
with actual scheduling code (along with proper docs, explaning its algo.)solana-scheduler-pool
with actual multi-threaded code.