Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Add new execution code-path for unified scheduler #31239

Closed
wants to merge 56 commits into from

Conversation

ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Apr 18, 2023

Problem

both current tx execution mechanisms for block verification (ReplayStage/blockstore_processor) and generation (BankingStage) aren't ideal for upcoming unified scheduler. see #30746 for these terminology.

Summary of Changes

refer to #31239 (comment) to untangle this complex object graph....

introduce a third tx execution code path and plumb all the things, which is planned to be used for both block verification and generation by unified scheduler.

Primary design choices are (not all things are coded in this pr):

  • this new execution mechanism allows async (non-blocking) tx buffering for execution (this is very desirable for optimal scheduling across lock-conflicting ledger entries, etc)
  • concurrent and isolated worker thread pool for each active forks.
  • minimize overhead extremely for non-batched individual-tx-level task queuing. (to the extent where i went into greater deal just to remove single Arc::upgrade() per task execution.)

Also provide minimally-viable impl just to make the plumbing actually workable for demonstration purpose. That will be replaced with actual full-fledged impl in coming prs.

This pr touches various parts but i tried very hard to make these code changes understandable. For details, see doc comments (rare from me, lol!) in pr changes.

todos

  • 1/3: this pr
  • 2/3: flesh out solana-scheduler with actual scheduling code (along with proper docs, explaning its algo.)
  • 3/3: flesh out solana-scheduler-pool with actual multi-threaded code.

@ryoqun ryoqun requested a review from apfitzge April 18, 2023 05:21
@codecov
Copy link

codecov bot commented Apr 18, 2023

Codecov Report

Merging #31239 (465205e) into master (e74bc4e) will increase coverage by 0.0%.
The diff coverage is 91.7%.

@@           Coverage Diff            @@
##           master   #31239    +/-   ##
========================================
  Coverage    81.5%    81.5%            
========================================
  Files         729      732     +3     
  Lines      206580   207349   +769     
========================================
+ Hits       168445   169141   +696     
- Misses      38135    38208    +73     

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is must firstly be settled on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is must firstly be settled on...

};

// Send + Sync is needed to be a field of BankForks
#[cfg_attr(any(test, feature = "test-in-workspace"), automock)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Comment on lines 433 to 441
let mocked_scheduler = setup_mocked_scheduler_with_extra(
[WaitReason::TerminatedFromBankDrop].into_iter(),
Some(|mocked: &mut MockInstalledScheduler| {
mocked
.expect_schedule_execution()
.times(1)
.returning(|_, _| ());
}),
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope there's no anti-mock guys around.. lol

Comment on lines 54 to 67
// Calling this is illegal as soon as schedule_termiantion is called on &self.
fn schedule_execution(&self, sanitized_tx: &SanitizedTransaction, index: usize);

// This optionally signals scheduling termination request to the scheduler.
// This is subtle but important, to break circular dependency of Arc<Bank> => Scheduler =>
// SchedulingContext => Arc<Bank> in the middle of the tear-down process, otherwise it would
// prevent Bank::drop()'s last resort scheduling termination attempt indefinitely
fn schedule_termination(&mut self);

#[must_use]
fn wait_for_termination(&mut self, reason: &WaitReason) -> Option<ResultWithTimings>;

// suppress false clippy complaints arising from mockall-derive:
// warning: the following explicit lifetimes could be elided: 'a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentionally, these aren't doc-comments... xD not intending proper wordz

"process_batches()/schedule_batches_for_execution({} batches)",
batches.len()
);
schedule_batches_for_execution(bank, batches)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execution diversion starts from here. (meaning all locking verification are equally done for both if branches)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried this writing first time. but i kinda like putting all relevant code by topic (scheduling in this case) into a single file, opening bunch of layered structs, thanks for rust's open classes at the cost of pub(crate) impurification.

fn mode(&self) -> SchedulingMode;
}

// This file will be populated with actual implementation later.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stay tuned!

@ryoqun
Copy link
Contributor Author

ryoqun commented Apr 18, 2023

tagging as this is monorepo-wide big change: @carllin (i trust your eyes for cutting-through review), @bw-solana (this should be complement to your concurrent replay), @ilya-bobyr (saw your recent work at blockstore_processor; tldr tx could be executed differently)

happy to explain context/impl details if interested (i'll try very hard to put answers into code/comments for future posterity, lol).

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some very initial comments from first scan through.

At a high-level, I'm still trying to understand these cyclic dependencies but will need another pass at least.

Should probably loop more than just me in on the review requests as well 😄

ancestor_hashes_replay_update_sender,
purge_repair_slot_counter,
);
// If the bank was corrupted, abort now to prevent further normal processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a bank being corrupted mean? How does that happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupt here refers to any BlockstoreProcessorError error; this comment is copied from the existing one. anyway, i don't think corrupt here adding any value; so removed it...: 16a1e9c

ledger-tool/src/main.rs Show resolved Hide resolved
@@ -126,23 +126,33 @@ pub struct LocalCluster {
}

impl LocalCluster {
pub fn new_with_equal_stakes(
pub fn config_with_equal_stakes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to ClusterConfig impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: eed7db0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also created a pr for this: #32330

@@ -8282,6 +8291,8 @@ impl TotalAccountsStats {

impl Drop for Bank {
fn drop(&mut self) {
self.drop_scheduler();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer not to have the empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 3ba582b

//!
//! Albeit being at this abstract interface level, it's generally assumed that each
//! `InstalledScheduler` is backed by multiple threads for performant transaction execution and
//! there're multiple independent schedulers inside a single instance of `InstalledSchedulerPool`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the case, or is it only necessary if --replay-slots-concurrently is enabled?

Copy link
Contributor Author

@ryoqun ryoqun Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, the current situation of concurrency level is complicated.

direct to your question, firstly, it is always the case. as soon as a new fork is created by BankForks::insert(), a new PooledScheduler is spawn, which means a new independent group of threads from other ongoing forks (sharing no synchronization primitives among others). This is true, regardless --replay-slots-concurrently or not. And multiple slots can be replayed concurrently as .schdule_execution() is assumed be async/non-blocking. i.e. single-threaded replay stage can still buffer txes into multiple unified schedulers.

So, when --replay-slots-concurrently is enabled, unified scheduler can take advantage of it in that it can fully concurrently replay ledger entries, including the buffering part (note: by principle!). on the other hand, blockstore_processor uses a shared rayon thread group. While the thread group is populated with N (core count) threads, it's still possible for any blocking call there to starve the other fork. also, I've observed non-optimal rayon behavior when .install() is called from multiple threads against a shared thread group.

the caveat for the full concurrency of unified scheduler is that it's true if there's no single is_complete() bank at given moment among all of ongoing forks. that's because process_replay_results isn't concurrent regarding multiple forks even if --replay-slots-concurrently is enabled (still). so, as soon as any buffered schedule_execution()-ed transactions are all executed, other !is_complete() forks can be starved sub-optimally.

Anyway, these are all edge case runtime behavior... I'll planning to revisit this in distant future... or @bw-solana might be planning to further enhance the replay stage's concurrency. xD

Copy link
Contributor Author

@ryoqun ryoqun Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(while writing the above, I noticed i better off pulling a small assert from the dirty topic branch: 9769288 , pending ci for its correctness... lol)

Copy link
Contributor Author

@ryoqun ryoqun Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this concurrency bla bla also affected by this pending feature activation: #31239 (comment)


#[must_use]
fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> {
debug!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mix of inlined formatting and regular formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you meant like this?:

$ git diff HEAD~
diff --git a/runtime/src/installed_scheduler_pool.rs b/runtime/src/installed_scheduler_pool.rs
index 83408d1c861..c56c651f238 100644
--- a/runtime/src/installed_scheduler_pool.rs
+++ b/runtime/src/installed_scheduler_pool.rs
@@ -242,8 +242,7 @@ impl Bank {
     #[must_use]
     fn wait_for_scheduler(&self, reason: WaitReason) -> Option<ResultWithTimings> {
         debug!(
-            "wait_for_scheduler(slot: {}, reason: {reason:?}): started...",
-            self.slot()
+            "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...",
         );
 
         let mut scheduler_guard = self.scheduler.write().expect("not poisoned");

actually, this is syntax error:

error: invalid format string: expected `'}'`, found `'.'`
   --> runtime/src/installed_scheduler_pool.rs:245:44
    |
245 |             "wait_for_scheduler(slot: {self.slot()}, reason: {reason:?}): started...",
    |                                       -    ^ expected `}` in format string
    |                                       |
    |                                       because of this opening brace
    |
    = note: if you intended to print `{`, you can escape it using `{{`

error: could not compile `solana-runtime` due to previous error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha I was unclear, but actually meant the opposite direction

         debug!(
            "wait_for_scheduler(slot: {}, reason: {:?}): started...",
            self.slot(),
            reason,
         );

It's fine to leave it, I always just find it harder to read if some params are inlined and others are not, and either choose to go all inlined or all not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also under the impression that it is recommended to use either all inlined, or all non-inlined arguments, but not a mix.
But I can not find any recommendations on this point right now.
One reason might be that there is a clippy rule that insists that all arguments should be inlined, when they all can be inlined.
But as soon as there is at least one non-inlinable argument, it stops complaining.
This rule has been downgraded to "pedantic" due to rust-analyzer not been robust enough just yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 1dc0c2c

@@ -0,0 +1,22 @@
[package]
name = "solana-scheduler-pool"
description = "The solana scheduler pool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "The solana scheduler pool"
description = "The Solana scheduler pool"

@@ -0,0 +1,10 @@
[package]
name = "solana-scheduler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to be so generic with the name if we've got other scheduler impls in the codebase

Copy link
Contributor Author

@ryoqun ryoqun Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, how about solana-unified-scheduler-logic/solana-unified-scheduler-pool then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like those better 😄

scheduler-pool/src/lib.rs Show resolved Hide resolved
// this will be replaced with more proper implementation...
// not usable at all, especially for mainnet-beta
#[derive(Debug)]
struct Scheduler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by Scheduler being here, rather than in the scheduler crate, even if this is a temporary

Copy link
Contributor Author

@ryoqun ryoqun Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. well, maybe this struct should be named like PooledScheduler. as you pointed out, this is unintended name conflict... and this isn't temporary. the struct is something which will be glueing/managing scheduling logic and threads together (in the future). solana-scheduler is meant to hold pure logic not depending on mundane world of solana-ledger (and even solana-runtime!)... so, i wonder i should also name the crate to solana-unified-scheduler-logic or something else to avoid ambiguity coming from calling things as scheduler altogether... xD

Copy link
Contributor

@apfitzge apfitzge Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've got the separation in a branch, could you reference that to me? or just explain a bit more what you mean in terms of separting "pure logic" from threads. I suppose I'm lacking imagination, and am not yet sold on the idea that they need/should be separated.

If we do go down the route that they should indeed be separate, then I do like the name solana-unified-scheduler-logic.

Copy link
Contributor Author

@ryoqun ryoqun Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've got the separation in a branch, could you reference that to me?

sure!: https://github.com/ryoqun/solana/tree/unified-scheduler-for-block-verification-wip-pre-strip

that branch is full of dirty code, though. however, it's runnable against mainnet-beta. (fully-performant code)

just explain a bit more what you mean in terms of separting "pure logic" from threads.

sure, again! some more reasoning:

I think crate-level separation provides good discipline over time for designing things, assuming creating crates is cheap in terms of human time. as you know, solana-core/solana-runtime/solana-ledger is too tangled up right now (ref: https://discord.com/channels/428295358100013066/439194979856809985/1097774577817505842). so, I'm afraid to add another lump of code to there. and to the brand-new solana-unified-scheduler-pool as well.

discipline here means well-cut and well-defined interface between and explicit and discrete set of dependencies for each respective crates.

in a say, this is kind of crate-safety as in type-safety.

also, that strict discipline make it easy to bench the scheduling logic by its own. i'll put bunch of benches in solana-unified-scheduler-logic, let alone unit-testing the scheduler logics.

lastly, i think scheduler logic shold be agnostic to the actual execution details up to the communicated information across the interface, but not more. in this way, logic eval and reasoning and reviewing should be easy. that will eventually leads to performant impl. (Imo).

as for the interface, we just need bunch of crossbeam_channel::Channelss to define such an api between crates (without any extra runtime overhead if those code would be in a single crate)

callee:

https://github.com/ryoqun/solana/blob/96013125e3c6f34d093447b5fe260fa08b34435a/scheduler/src/lib.rs#L2139-L2175

caller:

https://github.com/ryoqun/solana/blob/96013125e3c6f34d093447b5fe260fa08b34435a/scheduler-pool/src/lib.rs#L738-L755

another datapoint is that solana-unified-scheduler-logic's Cargo.toml is so small like this:

https://github.com/ryoqun/solana/blob/unified-scheduler-for-block-verification-wip-pre-strip/scheduler/Cargo.toml

and the other from solana-unified-scheduler-pool:

https://github.com/ryoqun/solana/blob/unified-scheduler-for-block-verification-wip-pre-strip/scheduler-pool/Cargo.toml

if these explanation does make sense, i'll transcribe these as source comments in this pr or coming pr.

If we do go down the route that they should indeed be separate, then I do like the name solana-unified-scheduler-logic.

gocha

Copy link
Contributor Author

@ryoqun ryoqun Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this separate-or-not discussion is currently blocking all the renames and crate template creation according to #31239 (comment)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after examining the dependency graph the crate separation makes sense. Appreciate the detailed comments for justification.

@apfitzge apfitzge self-requested a review April 18, 2023 18:54
@apfitzge
Copy link
Contributor

apfitzge commented Apr 18, 2023

Tried to map out some of my thoughts this morning about dependencies, which I always find benefit from some visualization:

graph TD
    Bank["Arc&lt;Bank&gt;"]

    subgraph solana-runtime
        BankForks;
        subgraph cyclic-ref
        Bank;
        SchedulingContext;
        InstalledScheduler{{InstalledScheduler}};
        end
        InstalledSchedulerPool{{InstalledSchedulerPool}};
    end

    subgraph solana-ledger
        ExecuteBatch(["execute_batch()"]);
    end


    subgraph solana-scheduler-pool
        SchedulerPool;
        SchedulerPoolScheduler[Scheduler];
    end

    subgraph solana-scheduler
        WithSchedulingMode{{WithSchedulingMode}};
    end

    SchedulingContext -- refs --> Bank;
    BankForks -- owns --> Bank;
    BankForks -- owns --> InstalledSchedulerPool;
    Bank -- refs --> InstalledScheduler;

    SchedulerPool -. impls .-> InstalledSchedulerPool;
    SchedulerPoolScheduler -. impls .-> InstalledScheduler;
    InstalledScheduler -- refs --> SchedulingContext;
    SchedulingContext -. impls .-> WithSchedulingMode;

    SchedulerPoolScheduler -- refs --> SchedulerPool;
    SchedulerPool -- owns --> SchedulerPoolScheduler;
    SchedulerPoolScheduler -. calls .-> ExecuteBatch;

    solana-scheduler-pool -- deps --> solana-scheduler;
    solana-scheduler-pool -- deps --> solana-ledger;
    solana-scheduler-pool -- deps --> solana-runtime;
    solana-ledger -- deps --> solana-runtime;
    solana-runtime -- deps --> solana-scheduler;
Loading

As mentioned in my review comment, I'm trying to wrap my head around the cyclical-ness of the traits & structs.

From this diagram (I may have missed something), it seems like the main reason we need to define these new crates for implementing the scheduler(pool) is the dependency on execute_batch and TransactionStatusSender (not shown in the diagram).

I'm wondering if moving execute_batch and TransactionStatusSender types into runtime would significantly simplify this? Do we need dyn then?

edit: @ryoqun thanks for the edits marking ref/own/dep on each connection!

@@ -2640,14 +2640,44 @@ impl ReplayStage {
.expect("Bank fork progress entry missing for completed bank");

let replay_stats = bank_progress.replay_stats.clone();
let r_replay_stats = replay_stats.read().unwrap();
let mut replay_stats = replay_stats.write().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only mutable for the accumulate on line 2651? Maybe we can keep the write lock nested down below and then grab a read lock after

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. thanks for nice suggestion! well, i overlooked proper locking granularity thinking...: 3a7810c

@ryoqun
Copy link
Contributor Author

ryoqun commented Apr 19, 2023

Tried to map out some of my thoughts this morning about dependencies, which I always find benefit from some visualization:
...

super appreciate that diagram. yeah, you depicted it correctly. I further improved it.

I'm wondering if moving execute_batch and TransactionStatusSender types into runtime would significantly simplify this? Do we need dyn then?

Unfortunately, this won't be that easy to remove dyn in that way. execute_batch is such a beast, it would pull in even spl-token crates. so, i resorted to dyn. Also, solana-scheduler-pool will need to depend on PohRecorder in the future. That's another reason for this complex arrangement... ;)

Comment on lines 73 to 74
pub type SchedulerPoolArc = Arc<dyn InstalledSchedulerPool>;
pub(crate) type InstalledSchedulerPoolArc = Option<SchedulerPoolArc>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit confusing that a SchedulerPoolArc is an Arc<InstalledSchedulerPool>, and at the same time InstalledSchedulerPoolArc is an Option<SchedulerPoolArc>.
I wonder if it could be because the Installed part in the InstalledScheduler name is somewhat "artificial".
It seems like it is used to introduce the Scheduler API and nothing else. Scheduler always implements InstalledScheduler:

impl InstalledScheduler for Scheduler

So there is no actual action that makes a certain scheduler "installed" as opposed to some other state, at the type level.

The fact that InstalledSchedulerPool API deals with SchedulerBoxes, which are internally InstalledScheduler. But it is a bit strange that this API does not operate rather on InstalledSchedulerBoxes.

pub trait InstalledSchedulerPool: Send + Sync + Debug {
    fn take_from_pool(&self, context: SchedulingContext) -> SchedulerBox;
    fn return_to_pool(&self, scheduler: SchedulerBox);
}

pub type SchedulerBox = Box<dyn InstalledScheduler>;
pub(crate) struct InstalledSchedulerBox(Option<SchedulerBox>);

Maybe adding a new name to use for the Scheduler API could remove some of this back and forth?
For example call the API TransactionExecutor or TransactionScheduler.
And then the implementation can be called Scheduler.

Or call the API Scheduler, and give a different name to the implementation.
SchedulerImpl, or SimpleScheduler, or PooledScheduler as you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really appreciate these in-depth feedback from someone with fewest context. i think these makes code more readable/accessible in the end.

yeah, the Installed prefix was chosen artificially just for the sake of trait/impl naming. Other options were: Injected/Dyn/Like/Pluggable. However, I've settled onInstalled after long thought (not bothered to create wall of text... lol).

Also, I'm reluctant to use different nouns like Executor (this is taken by ExecutorCache) and Transaction (scheduler will work on something called Tasks; generic to both plain Transactions and/or Bundle (common word for grouped tx in mev jargon)).

anyway, as you pointed out there was still inconsistency for the usage of Installed.. how about this?: a8d66fa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing some context.
It still feels strange to have the word Installed here, even when the same object is used while both being installed into a Bank and not.
In other words, an InstalledSchedulerPoolArc, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.

When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the Bank/Scheduler circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.

Consider this:

  1. Rename the InstalledScheduler trait to be just Scheduler.
  2. Remove reference to the Bank from the Scheduler.
  3. Only Scheduler::execute requires a Bank, so just pass it in as an argument.
  4. When Bank will be using the scheduler it currently has installed, it will just pass self as the first argument to execute().

I wonder if it would also remove the need for tracking the shutdown process, simplifying the Scheduler trait to this:

pub trait Scheduler: Send + Sync + Debug {
    fn id(&self) -> SchedulerId;
    fn pool(&self) -> InstalledSchedulerPoolArc;

    fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);

    // suppress false clippy complaints arising from mockall-derive:
    //   warning: the following explicit lifetimes could be elided: 'a
    #[allow(clippy::needless_lifetimes)]
    fn context<'a>(&'a self) -> Option<&'a SchedulingContext>;

    fn replace_context(&mut self, context: SchedulingContext);
}

Bank now holds Box<dyn Scheduler> and returns it into a pool when it is frozen.

Copy link
Contributor Author

@ryoqun ryoqun Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, an InstalledSchedulerPoolArc, on the first glance, would then need to refer to a pool of schedulers that are actually "installed". But, IIUC, they are not. It is holding schedulers that are not installed into any bank, and are waiting to be installed.

how about Installable?

When I was trying to build a mental model of the abstractions interacting here, also reading your comments on the Bank/Scheduler circular references affecting their lifetimes, I wondered if choosing an ownership model could simplify both references and names.

really, appreciate going so deep and suggestion. I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context

Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...

    fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);

admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.

we need to make {schedule_execution/execute}() non-blocking without executing txes. soon, it just crossbeam::Channel::send()s SanitizedTransactions. Also, Schedulers are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to use sync::Weak to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.

all in all, SchedulingContext and complicated dance stage is born.

That said, i'm still open to better trait design. :)

again, I'll write a proper in-source doc explaining this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[... ] I think i mislead you. I think i can improve docs with this feedback from someone with non-perfect context

Glad my attempts to understand the PR brought at least some value :))

Only Scheduler::execute requires a Bank, so just pass it in as an argument.
...

    fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);

admittedly, this isn't documented anywhere in this pr, i thought this approach; but we can't take it.

we need to make {schedule_execution/execute}() non-blocking without executing txes. soon, it just crossbeam::Channel::send()s SanitizedTransactions. Also, Schedulers are expected to be inherited across banks in the case of block_generation scheduling_mode. so, we can't naitvely tie any SanitizedTransaction with particular bank at the time of scheduling. Also, if we were to tie, it will be forced to use sync::Weak to avoid bank leaks just because of buffered sanitized txes, which in turn incurs perf. penalty for each tx, which isn't acceptable.

all in all, SchedulingContext and complicated dance stage is born.

That said, i'm still open to better trait design. :)

again, I'll write a proper in-source doc explaining this.

I should have checked. So the dependency is not via a Bank directly, but via a SchedulingContext. I am still missing a complete picture, but just to be clear, you are saying that this:

    fn execute(&self, context: &SchedulingContext, sanitized_tx: &SanitizedTransaction, index: usize);

Is not going to work, correct?
Is it because, the caller of the execute() does not have a SchedulingContext at hand, and you do not want to add it into the Bank?

Currently, SchedulingContext just wraps a reference to a Bank, adding a mode to it:

#[derive(Clone, Debug)]
pub struct SchedulingContext {
    mode: SchedulingMode,
    // Intentionally not using Weak<Bank> for performance reasons
    bank: Arc<Bank>,
}

Maybe if you put mode into the Scheduler, but keep the Bank as the execute() argument, it would work, ownership wise?
And if, in the future, the scheduling machinery is going to change, you can replace mode with a channel or anything else, still living inside a Scheduler, but the Bank is kept outside.
Bank and any other data is the only combined when execute() is invoked, and the reference to Bank is unnecessary for other method calls.
This way, there is no circular dependency between the Bank that (temporarily) owns the Scheduler and the Scheduler itself.

Something like this:

pub trait Scheduler: Send + Sync + Debug {
    fn id(&self) -> SchedulerId;
    fn pool(&self) -> InstalledSchedulerPoolArc;
    
    fn mode(&self) -> SchedulingMode;
    fn set_mode(&mut self, v: SchedulingMode);

    fn execute(&self, bank: &Bank, sanitized_tx: &SanitizedTransaction, index: usize);
}

Depending on how mode is used in a specific implementation, mode()/set_mode() might be unnecessary in this trait. If, for example, setter and getter of the mode both know the concrete type of the Scheduler they work with.

Copy link
Contributor Author

@ryoqun ryoqun May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the above benches are also demonstrating why std::sync::Weak can't be used due to perf. concern, justifying another related design decision of inevitable introduction of cyclic ref of Bank.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗ arc downgrade + upgrade takes 1us? Did I do that math right? ~100_000_000 ns overhead per iter with 10_000 txs, 10 upgrade/downgrade per tx = 100_000 upgrade/downgrade per iter.

~4GHz machine => ~4000 cycles per upgrade downgrade?

That seems really high to me, maybe my understanding of how Arc works is not matched with reality.

Copy link
Contributor Author

@ryoqun ryoqun May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclamation arc downgrade + upgrade takes 1us?

well, you did that math wrong... it's actually 100ns (for the contended testcase) , thanks to the super busy cpu cache-line for the Arc<Bank>:

100_000_000 ns overhead per iter with 10_000 txs, 100 (see my out-of-tree patch comment) upgrade/downgrade per tx = 1_000_000 upgrade/downgrade per iter.

that said, 100ns is still too costly, considering this perf. penalty increases as the count count goes up (I'm targeting 100 cores in mind).

as shown by blocking::bench_{with,without}_arc_mutation and nonblocking::bench_with_01_thread_{with,without}_arc_mutation, the uncontended-case of overhead is ~14_000_000 ns, so 14ns per upgrade/downgrade`. (that's reasonable as a baseline number for a few of cas) that's 6x slowdown, which can be completely avoidable with proper impl like this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-bobyr

(hi, really thanks for another dense code-review round. I'll address them in order.)

So, the InstalledScheduler trait namaing and api is okay for you eyes by now?

I'm planning to start intense coding/writing session once it's okay as laid out at : #31239 (comment)

Copy link
Contributor Author

@ryoqun ryoqun Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I'm clinging to AttachedScheduler now that we're settling on BankWithScheduler while InstalledSchedulerPool is remaining as-is... lol

impl Scheduler {
fn spawn(pool: Arc<SchedulerPool>, initial_context: SchedulingContext) -> Self {
Self {
id: thread_rng().gen::<SchedulerId>(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason these need to be random, rather than just sequentially numbered?
For example, using a shared AtomicU64.

}

fn schedule_termination(&mut self) {
drop::<Option<SchedulingContext>>(self.context.take());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why an explicit type here is necessary.
Or was it added for documentation purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally i thought this was nice idea; now, not... lol: 9f82921

Comment on lines 163 to 166
WaitReason::ReinitializedForRecentBlockhash => {
// rustfmt...
true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor
Why not remove the comment and allow rustfmt to turn block in to an expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i occasionally prevent rustfmt from formatting to see each match arms be styled in the same manner.

but, the code is gone as part of 81075ac

Comment on lines 188 to 195
fn scheduling_context(&self) -> Option<&SchedulingContext> {
self.context.as_ref()
}

fn replace_scheduler_context(&mut self, context: SchedulingContext) {
self.context = Some(context);
*self.result_with_timings.lock().expect("not poisoned") = None;
}
Copy link
Contributor

@ilya-bobyr ilya-bobyr Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming between these two functions is a bit inconsistent.
They both operate on the same field: context.
The first function talks about "scheduling context", while the second talks about "scheduler context".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, this is leftover from rename... fn replace_scheduling_context is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, Scheduler <=> Scheduling is so nuanced. i named context as SchedulingContext to indicate its lifecycle irrelevance to particular Scheduler (i.e. reinitialize carrys over the same SchedulingContext) . i'll comment this aspect as well later. thanks for pointing out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll comment this aspect as well later. thanks for pointing out!

done: b39e2c9

Comment on lines 114 to 122
fn scheduler_id(&self) -> SchedulerId {
self.id
}

fn scheduler_pool(&self) -> SchedulerPoolArc {
self.pool.clone()
}

fn schedule_execution(&self, transaction: &SanitizedTransaction, index: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor
Are these methods expected to be part of an API that has multiple other purposes?
If not, their names could probably be shortened a bit to id(), pool() and execute() respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i named schedule_execution to indicate non-blocking/async nature like the counterpart: schedule_termination. (termination will be non-blocking as well in the future).

Copy link
Contributor Author

@ryoqun ryoqun Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for id() and pool(), i thought it's too short at first. but rethink it later. (i really like shorter name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for id() and pool(), i thought it's too short at first. but rethink it later. (i really like shorter name)

i changed my mind: 21b7250

Comment on lines 142 to 145
// so, we're NOT scheduling at all; rather, just execute tx straight off. we doesn't need
// to solve inter-tx locking deps only in the case of single-thread fifo like this....
if result.is_ok() || !fail_fast {
*result = execute_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a person lacking some context, I must admit that the logic here is still not completely clear to me, even after reading the comment.
Is the idea that certain scheduler may want to execute transactions even if one of the previous transactions in the block have failed?

I think adding a comment to the

let fail_fast = match context.mode() {

line might clarify this. You clearly already made an effort there when you said

// this should be false, for (upcoming) BlockGeneration variant.

but it might still be lacking a higher level explanation as to the whole reasoning for the fail_fast existance in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under block generation mode, it's normal for some txes to fail. I'll write a comment. :)

Comment on lines +225 to +226
let debug = format!("{pool:#?}");
assert!(!debug.is_empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this always be true?
pool is an Arc<SchedulerPool>.
SchedulerPool is a struct with a default implementation for Debug.
And Arc forwards its Debug implementation to the contained type.

So debug will contain a debug output for a SchedulerPool instance. Which should never be an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is done for sanity check and coverage of Debug impl. i did similar thing Bank. maybe i need to comment both as such

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the "Codecov Report", or by "coverage" you are saying you want to actually exercise the Debug code?
For SchedulerPool Debug is written by a macro. So there seems to be no need for checking it.

@ryoqun ryoqun force-pushed the new-exec-code-path branch from 0ad961f to 3a7810c Compare April 23, 2023 05:25
@ryoqun ryoqun force-pushed the new-exec-code-path branch from 4dbbcbd to 9769288 Compare April 23, 2023 07:45
Comment on lines 3686 to 3689
// This is needed until we activate fix_recent_blockhashes because intra-slot
// recent_blockhash updates necessitates synchronization for consistent tx check_age
// handling.
self.wait_for_reusable_scheduler();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Comment on lines 765 to 814
let mut processing_states: Vec<State> = vec![State::Blocked; dependency_graph.len()];
let mut signature_indices: HashMap<&Signature, usize> =
HashMap::with_capacity(dependency_graph.len());
signature_indices.extend(
pending_transactions
.iter()
.enumerate()
.map(|(idx, tx)| (tx.signature(), idx)),
);

let mut is_done = false;
while !is_done {
is_done = true;
for idx in 0..processing_states.len() {
match processing_states[idx] {
State::Blocked => {
is_done = false;

// if all the dependent txs are executed, this transaction can be
// scheduled for execution.
if dependency_graph[idx]
.iter()
.all(|idx| matches!(processing_states[*idx], State::Done))
{
self.inner_scheduelr.schedule_execution(Arc::new((
pending_transactions[idx].clone(),
idx,
)));
// this idx can be scheduled and moved to processing
processing_states[idx] = State::Processing;
}
}
State::Processing => {
is_done = false;
}
State::Done => {}
}
}

if is_done {
break;
}

let mut executor_responses: Vec<_> = vec![receiver.recv().unwrap()];
executor_responses.extend(receiver.try_iter());
for r in &executor_responses {
processing_states[*signature_indices.get(r).unwrap()] = State::Done;
}
}
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<bike-shedding>

I wonder if instead of going through all dependency_graph.len() entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.

Something like

            #[derive(Copy, PartialEq, Eq)]
            enum State {
                Blocked,
                Processing,
                Done,
            }

            let mut state = vec![State::Blocked; dependency_graph.len()];
            let mut blocked = pending_transactions
                .iter()
                .enumerate()
                .map(|(idx, tx)| (idx, tx.signature()))
                .collect<Vec<_>>();

            let mut pending = Vec::with_capacity(dependency_graph.len());

            while !(blocked.is_empty() && pending.is_empty()) {
                for blocked_i in 0..blocked.len() {
                    if blocked_i >= blocked.len() {
                        break;
                    }

                    let (tx_idx, _signature) = blocked[blocked_i];
                    // if all the dependent txs are executed, this transaction can be
                    // scheduled for execution.
                    if dependency_graph[tx_idx]
                        .iter()
                        .all(|dependency_idx| state[*dependency_idx] == State::Done)
                    {
                        self.inner_scheduelr.schedule_execution(Arc::new((
                            pending_transactions[tx_idx].clone(),
                            tx_idx,
                        )));

                        pending.push(blocked.swap_remove(blocked_i));
                        state[tx_idx] = State::Processing;
                    }
                }

                assert!(
                    !pending.is_empty(),
                    "If there is no pending work but `blocked` is not empty we \
                     are in a deadlock"
                );

                let done_tx_sig = receiver.recv()
                    .expect("`pending` is non-empty");
                let pending_i = pending
                    .position(|(_idx, signature) signature == done_tx_sig)
                    .expect("`pending` must contain the completed transaction");

                let done_tx_idx = pending.swap_remove(pending_i).0;
                state[done_tx_idx] = State::Done;
            }
            Ok(())
        }

Considering it is testing code, not sure if it really matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of going through all dependency_graph.len() entries of the vec every time, and doing a search for the signature via a hash map it would make sense to just have a few mutating vectors.

Something like ...

yeah, it looks competitive with current algo, at least. cc: @buffalu possible algo improve for fast replay.

the .position part is linear scan but should fare well with HashMap as pending_transactions shouldn't be that large to begin with.

also, i think reviving receiver.try_iter() might worth.

... Considering it is testing code, not sure if it really matters.

np. thanks for suggestion. I leave these code as-is as you pointed out that it is bench code (the bench isn't trying to eval this code; it just want small scheduler logic).

but i just cc-ed people who care about it. :)

Comment on lines 80 to 103
pub fn self_arc(&self) -> Arc<Self> {
self.weak_self
.upgrade()
.expect("self-referencing Arc-ed pool")
}
}

impl InstalledSchedulerPool for SchedulerPool {
fn take_from_pool(&self, context: SchedulingContext) -> InstalledSchedulerBox {
let mut schedulers = self.schedulers.lock().expect("not poisoned");
// pop is intentional for filo, expecting relatively warmed-up scheduler due to having been
// returned recently
let maybe_scheduler = schedulers.pop();
if let Some(mut scheduler) = maybe_scheduler {
scheduler.replace_context(context);
scheduler
} else {
Box::new(PooledScheduler::<_, DefaultScheduleExecutionArg>::spawn(
self.self_arc(),
context,
DefaultTransactionHandler,
))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc<Self> is object safe: https://doc.rust-lang.org/reference/items/traits.html#object-safety

This is one of the things that Rust can do that C++ can not.

@ilya-bobyr
Copy link
Contributor

Maybe there is a way to name BankWithScheduler using something other than the actual list of the contained items? TxReadyBank, ReadyBank, SchedulableBank, just Scheduler(?), though perhaps a better name can be found. In which case, those bank.bank_cloned() calls might change into scheduler.bank_cloned(), or something of that sort.[1]

Yeah, i know i named BankWithScheduler poorly. ;) After all, I was kinda tired of all the code changes to move scheduer out of bank at that time... I'll think about naming a bit now.

hi, I'm back! sadly, i couldn't come up with definitive better name. ExtendedBank, BankContext, BankContainer, BankWrapper or WrappedBank..

These all seem overly generic. It looks to me, you are trying to create a pretty specific API.
One that allows transaction scheduling and that is it.
All these names look like a God pattern in disguise, especially the ExtendedBank.
They say nothing about the intent of the API - and that is bad.

Also, i like SchedulableBank among your suggestions. but none of these are rings my bell..

You can use it if you like it :P
I do not have any other suggestions at the moment.
Ultimately, it is your choice, of course.

@ilya-bobyr at the same time, i really dislike bank.bank_cloned() and bank.bank(). so tried to remove it entirely: 74b9439 maybe a bit controversial. what do you think?

Looking at just the diff, it seems nice that bank.bank_clone() is gone.
I would need to re-read the whole change once again to give a more holistic opinion.
It is not a big deal either way. And you like it more, I assume.
So, I call it a win :)

Comment on lines +251 to +255
pub fn clone_with_scheduler(&self) -> BankWithScheduler {
BankWithScheduler {
inner: self.inner.clone(),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be impl Clone for BankWithScheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would shadow <Arc<Bank> as Clone>::clone(), given the existence of impl Deref for BankWithScheduler... so i can't do it. maybe i need to comment about this deliberate subtlety.

Copy link
Contributor

@ilya-bobyr ilya-bobyr Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a suspicious that Deref abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P

Maybe remove the Deref? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.
Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a suspicious that Deref abuse will bring all that bad interactions that the C++ implicit conversions are bringing.
Here is a one indication of that :P

yeah, i was reluctant as well for the use of Deref because of these unwanted effects...

Maybe remove the Deref? I think Rust style is to avoid implicit conversions except for a very few clearly defined cases that are almost part of the language in the first place.

... that said, i still think removing Deref isn't desired, considering yet another rather large code churns by doing so..

also, it seems there are other funny Derefs already (just run $ git grep -A5 "impl.*Deref"). so, I'm not the first.. xD

Sorry if this suggestion goes completely against what you were doing.
Just making sure you consider it as a possibility :)

thanks for raising voices. I'll go as-is with some good (hopefully) doc: 3843793

@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 29, 2023

Also, i like SchedulableBank among your suggestions. but none of these are rings my bell..

You can use it if you like it :P I do not have any other suggestions at the moment. Ultimately, it is your choice, of course.

I've settled on BankWithScheduler, seems this name resonates well with ::clone_{with,without}_scheduler()s. ref: 3843793

@ryoqun ryoqun force-pushed the new-exec-code-path branch from bb37d91 to 60edf6b Compare June 29, 2023 14:30
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 14, 2023
@github-actions github-actions bot closed this Jul 24, 2023
@ryoqun ryoqun mentioned this pull request Oct 29, 2023
8 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants