Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Read-only Transaction Execution Design Document #130

Merged
merged 14 commits into from
Jan 28, 2023
215 changes: 215 additions & 0 deletions transactions/read-only/parallel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# Parallelize Read-only Transaction Execution

Continuing PR (https://github.com/AntelopeIO/leap/pull/558) (on branch https://github.com/AntelopeIO/leap/tree/send_read_only_trx), this document describes an approach to parallelize read-only transaction execution.

## Main Ideas
The node toggles between `write` and `read` windows. In `write` window, the node operates normally, except queuing read-only transactions for later parallel execution in `read` window. In `read` window, the node runs queued read-only transactions in a dedicated thread pool, while in the main and other threads runs operations which are safe to read-only transaction execution.

## Existing Operation Analysis
This section analyzes existing operations' thread safety to read-only transaction execution.

### Chain APIs
Chain APIs can be classified into reads and writes.
- Reads are those whose names start with `get_`, like `get_info`, `get_activated_protocol_features`, `get_block`, `get_block_info`. They do not modify states.
- Writes are the rest of requests: `compute_transaction`, `push_transaction`, `push_transactions`, `send_transaction`, `send_transaction2`, and `push_block`. They may modify states.

Chain APIs are received on the HTTP thread, processed on the main thread (and producer thread for non-get requests), and responses are sent on the HTTP thread.

| API | Global data modified | Safe to read-only trx? |
|---------------------------------|-------------------------------|------------------------|
| get_info | none | yes |
| get_activated_protocol_features | none | yes |
| get_block | none | yes |
| get_block_info | none | yes |
| get_block_header_state | none | yes |
| get_account | none | yes |
| get_code | none | yes |
| get_code_hash | none | yes |
| get_abi | none | yes |
| get_raw_code_and_abi | none | yes |
| get_raw_abi | none | yes |
| get_table_rows | none | yes |
| get_table_by_scope | none | yes |
| get_currency_balance | none | yes |
| get_currency_stats | none | yes |
| get_producers | none | yes |
| get_producer_schedule | none | yes |
| get_scheduled_transactions | none | yes |
| abi_json_to_bin | none | yes |
| abi_bin_to_json | none | yes |
| get_required_keys | none | yes |
| get_transaction_id | none | yes |
| get_consensus_parameters | none | yes |
| get_accounts_by_authorizers | none | yes |
| get_transaction_status | none | yes |
| send_read_only_transaction | none | yes |
| compute_transaction | main thread: temporally change chainbase | no |
| push_block | main thread: chainbase, forkdb, producer, controller | no |
| push_transaction | main thread: chainbase, forkdb, producer, controller | no |
| push_transactions | main thread: chainbase, forkdb, producer, controller | no |
| send_transaction | main thread: chainbase, forkdb, producer, controller | no |
| send_transaction2 | main thread: chainbase, forkdb, producer, controller | no |


### Producer APIs
They are received on the HTTP thread, processed on the main thread, and responses are sent on the HTTP thread.

| API | Global data modified | Safe to read-only trx? |
|-----------------------------|-------------------------------|------------------------|
| pause | main thread: producer's \_pause_production | yes |
| resume | main thread: producer's \_pause_production | yes |
| paused | | yes |
| get_runtime_options | | yes |
| update_runtime_options | main thread: producer plugin and controller configs | no (configs used by trx processing|
| add_greylist_accounts | main thread: controller resource_greylist | yes (not used in read-only trx handling. only used by max_bandwidth_billed_accounts_can_pay) |
| remove_greylist_accounts | main thread: controller resource_greylist | yes |
| get_greylist | | yes |
| get_whitelist_blacklist | | yes |
| set_whitelist_blacklist | main thread:controller blacklist, whitelist| no |
| get_integrity_hash | | yes |
| create_snapshot | | yes |
| get_scheduled_protocol_feature_activations | | yes |
| schedule_protocol_feature_activations | | no |
| get_supported_protocol_features | | yes |
| get_account_ram_corrections | | yes |
| get_unapplied_transactions | | yes |

### Net APIs
Net APIs do not mutate states. They are received on the HTTP thread, processed on the main thread, and responses are sent on the HTTP thread.
arhag marked this conversation as resolved.
Show resolved Hide resolved

| API | Global data modified | Safe to read-only trx? |
|-----------------------------|---------------------------------|------------------------|
| connect | main thread: net::connections | yes |
| disconnect | main thread: net::connections | yes |
| status | none | yes |
| connections | none | yes |


### Trace APIs
| API | Global data modified | Safe to read-only trx? |
|-----------------------------|---------------------------------|------------------------|
| get_block | none | yes |
| get_transaction_trace | none | yes |

### DB Size API
| API | Global data modified | read-only thread safe |
|-----------------------------|---------------------------------|------------------------|
| get | none | yes | |

### Net Messages
Net messages can be classified into sync and non-sync:
- Non-sync messages do not modify states.
- Sync messages are signed_block, packed_transaction. They may modify states.

| Message | Global data modified | threads involved | Safe to read-only trx?|
|-----------------------------|---------------------------------|-----------------------------|-----------------------|
| handshake | none | net, main (handshake check) | yes |
| go_away | none | net | yes |
| time | none | net | yes |
| notice | none | net | yes |
| request | none | net, main (controller::fetch_block_by_id()) | yes |
| sync_request | none | net, main (controller::fetch_block_by_number()) | yes |
| packed_transaction | chainbase, forkdb, producer, net, controller | net, producer, main | no |
| signed_block | chainbase, forkdb, producer, net, controller | net, producer, main | no |

### SHiP
SHiP receives blockchain state data, saves it to files, and sends data to nodes who request it.

| Request | Data modified | Safe to read-only trx? |
|-----------------------|---------------------------------|------------------------|
| get_status | internal | yes |
| get_blocks | internal | yes |
| get_blocks_ack | internal | yes |

## Design Decisions

### Window Toggling
To support toggling between `read` and `write` windows, configurable options `read-window-time`, `write-window-time`, and `read-only-max-queued-time` are provided.
- From `write` to `read`: When a new read-only transaction is queued, if the time the earliest transaction has been queued exceeds `read-only-max-queued-time`; or at the end of the `write` window and read-only transaction queue not empty.
- From `read` to `write`: when read-only transaction queue is emptied by the read-only threads; or at the end of `read` window.

In the following state diagram, `longest_queued_time = now - the time when the oldest trx was queued`, `write_window_deadline = time when write window starts + write-window-time`, and `read_window_deadline = time when read window starts + read-window-time`

```mermaid
flowchart TD
A(((write window))) -->|push a new trx| B[longest_queued_time < read-only-max-queued-time?]
greg7mdp marked this conversation as resolved.
Show resolved Hide resolved
greg7mdp marked this conversation as resolved.
Show resolved Hide resolved
B -->|yes| R(((read window)))
B -->|no| A
A --> D[write_window_deadline passed?]
D -->|yes| F[read-only trx queue empty?]
D -->|no| A
F -->|yes| A
F -->|no| R

R --> S[read-only trx queue empty?]
S -->|yes| A
S -->|no| R
R --> T[read_window_deadline passed?]
T -->|yes| A
T -->|no| R
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flowchart could be simplified to:

flowchart TD
    A(((write window))) -->|push a new ro trx| B[longest_queued_time > read-only-max-queued-time]
    B -->|yes| R(((read window)))
    B -->|no| A
    A --> D[write_window_deadline passed <b>AND</b> read-only trx queue not empty?]
    D -->|yes| R
    D -->|no| A
    
    R --> S[read-only trx queue empty <b>OR</b> read_window_deadline passed?]
    S -->|yes| A
    S -->|no| R
Loading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks for read-only transaction queue and read-only window time are done independently. That's why I separate them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reflect such details in a state diagram, if it doesn't add any useful information?


### Handling main thread functions that are not safe to read-only transaction execution in `read` Window
Several options are considered to handle
- Drop the functions. This in not acceptable as it changes the behavior.
- Re-post to the main thread. In `read` window, re-post a not-read-only-safe function back to the main thread. This is simple to implement but breaks the order of functions and waste time in popping and pushing.
- Modify appbase `execution_priority_queue` by adding a function type with value `read-only-safe` or `not-read-only-safe`. In `write` window, everything works as it is now. In `read` window, only `thread-safe` requests are dequeued and processed. This avoids introducing new queues and keeps the order of the request. But as the queue is a priority queue based on function priority, it is infeasible to incorporate the additional type into the priority in a single queue.
- Separate appbase queue into two queues: `read-only-safe` or `not-read-only-safe`. The function type is added when a function is posted.
- In `read` window, only functions in `read-only-safe` queue are executed.
- In `write` window, compare the priorities of the top functions in `read-only-safe` and `not-read-only-safe`. The one with higher priority is executed. If tied, three options are considered:
- `not-read-only-safe` function is favored
- randomly pick one
- add a time attribute to the functions and the older one is picked. This keeps the original behavior. Even though at a cost of the extra time field and an extra comparison, this option seems best.
heifner marked this conversation as resolved.
Show resolved Hide resolved
Copy link

@greg7mdp greg7mdp Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? My understanding was that during the write window, we are not supposed to process readonly transactions (instead of processing them, we queue them for later execution in parallel). So we should process only from the write queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In write window, we handle both read and write operations, but not read-only transactions. Will define the terms at the beginning of the document.

Copy link

@greg7mdp greg7mdp Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we have 3 queues?

  1. write transactions and write operations (only in write window)
  2. readonly transactions (only in read window)
  3. readonly operations (both read and write window)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do plan to have 3 queues:

  1. write operation queue (in appbase)
  2. read-only operation queue (in appbase)
  3. read-only transaction queue

Copy link

@greg7mdp greg7mdp Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the write transactions go?
Also shouldn't they all be in appbase since we typically want to process items from multiple queues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write transactions go to write operation queue. I will define the terms to make them less confusing.


### Read-only Transaction Priority Queue
The queue is used to store read-only transactions during `write` window. To maintain the time order when a transaction is put back into the queue for the next round due to `red` window deadline, a priority queue based on the first time when a transaction is queued is used.
```c++
struct read_only_trx {
fc::time_point initial_queued_time;
packed_transaction_ptr trx;
next_func_t next;
};

struct read_only_trx_less {
bool operator() (const read_only_trx& a, const read_only_trx& b) {
// earlier queued time has higher priority
return a.initial_queued_time > b.initial_queued_time;
}
};

std::priority_queue<read_only_trx, std::deque<read_only_trx>, read_only_trx_less> read_only_trx_queue;
heifner marked this conversation as resolved.
Show resolved Hide resolved
```

### Configuration Options

- `read-only-num-threads`: the number of threads in read-only transaction execution thread pool. Default to `0`. If it is `0`, read-only transactions are executed on the main thread sequentially as they arrive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to limit this according to how much virtual memory we have? AntelopeIO/leap#645

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is one of those knobs we need to set a limit on so we can grok what the maximal resource usage is. For example with read-only-num-threads = 512 & max_sync_call_depth = 4 the maximum number of simultaneous wasm executions is 2048 (or maybe 2560 depending how you define sync call depth). .512 or 1024 seem like good numbers to start with for now, imo

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to 0. If it is 0, read-only transactions are executed on the main thread sequentially as they arrive

Does this mean that, by default, this new functionality is disabled? Is it really what we want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Not sure what's a good number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this feature should normally not run on producer nodes, disabling it by default seems better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this feature should normally not run on producer nodes

So if not run on producer nodes, it will not improve the chain max TPS, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps indirectly by speeding up API (and PTP if servicing RPCs) nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this would change the max chain TPS, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the point of these changes. If anything, this will decrease max TPS of a single producer node. That is one thing we will need to test. We should verify that making the priority queue thread safe and other changes do not greatly decrease max TPS for a single producer node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Is there a document that describes the point of this change then? If not, maybe this document should start with a rationale for the changes. I originally assumed (apparently incorrectly) that the intent was to increase the chain TPS number.

heifner marked this conversation as resolved.
Show resolved Hide resolved
- `write-window-time`: time in milliseconds the `write` window lasts. Default to 500 milliseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add info on how this interacts with cpu-effort-percent and produce-time-offset-us and the last block versions of these.

heifner marked this conversation as resolved.
Show resolved Hide resolved
- `read-window-time`: time in milliseconds the `read` window lasts. Must be equal to or greater than `max-read-only-transaction-time`. Default to 200 milliseconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two config options? Why is read or write not just the remaining time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which remaining time? The cycle period doesn't necessarily have anything to do with block time intervals.

Copy link
Member

@heifner heifner Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Current

idle = get_info and other http requests are processed, push_trx are queued

BP Current
[ start-block, (cpu-effort-percent | max_block_cpu_usage), idle ]
[ process-block, cpu-effort-precent, idle ]

Validation Node Current
[ process-block, (process-trxs | http_requests) ]

  • Proposed in this design:

BP Current
[ start-block, (cpu-effort-percent | max_block_cpu_usage | write-window-time - start-block-time), read-window-time ]
[ process-block, (cpu-effort-precent | write-window-time - process-block-time), read-window-time ]

Validation Node Current
[ process-block, (write-window-time - process-block-time), read-window-time, idle ]

Correct?

- `read-only-max-queued-time`: time in milliseconds when is exceeded by the time the earliest transaction, node switches to `read` window, even it is before the end of `write` window.
heifner marked this conversation as resolved.
Show resolved Hide resolved
- `read-window-min-time`: time in milliseconds which must be remained in the `read` window when new transactions are scheduled for execution. Default to 5 milliseconds. This is to avoid unnecessary incomplete transaction execution.
heifner marked this conversation as resolved.
Show resolved Hide resolved
- `max-read-only-transaction-time`: time in milliseconds a read-only transaction can execute before being considered invalid. Default to 150 milliseconds. This option has already been implemented by #558
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the default, and widely used, max-transaction-time is 30ms. We recently merged AntelopeIO/leap#649 which interrupts start_block when a block is received. We also have AntelopeIO/leap#590 that allows a block to be propagated as long as the previous one has been consumed which I assume will make it in soon. Currently on-chain max_transaction_cpu_usage for EOS is 150ms.

As part of this effort, should read-only transactions (or all transactions) be killed when a block is received to be processed? With the read window and max-read-only-transaction-time being 150ms it does seem like we are potentially delaying block consumption longer than today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be case for leaving this as milliseconds to match max-transaction-time. If so, I think it should be called read-only-max-transaction-time-ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We should break the read window when a new block is received. We could lower the default values for read and write windows for rapidly toggling. between writes and reads.

Will change to read-only-max-transaction-time-ms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't think more than one cycle of write-window read-window would happen over a block interval. With having to fit transactions in those windows that doesn't seem reasonable to have more than one of each per block interval.


heifner marked this conversation as resolved.
Show resolved Hide resolved

## Thread Safety

- Safety between read-only transaction threads and other `nodeos` threads
- _main_ thread: The `main` thread only performs functions safe to read-only transaction execution.
heifner marked this conversation as resolved.
Show resolved Hide resolved
- _chain_ thread: `chain` threads are used in `apply_block`, `log_irreversible`, `finalize_block`, `create_block_state_future`. Those do not run while in `read` window.
- _net_ thread: It is used for low-level networking. No conflicts with read-only transaction execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AntelopeIO/leap#590 net threads can process block header validation. But this should be fine and will not conflict with read-only transaction execution.

- _http_ thread: It is used to receive requests and send back responses. No conflicts with read-only transaction execution.
- _prod_ thread: It is used in `on_incoming_transaction_async`, which is not running in `read` window
heifner marked this conversation as resolved.
Show resolved Hide resolved
- _resource monitor_ thread: Resource monitor does not have any conflicts with any transaction execution.
- Safety between read-only transaction threads: no writes are made into `chainbase` and global states when a read-only transaction is executed. This is achieved by PR #558

## Tests

- number of read-only threads is 0
- number of read-only threads is 1
- number of read-only transactions greater than number of threads
- `write-window-time` test
- `read-window-time` test
- `read-only-max-queued-time` test
- `read-window-min-time` test
- read-only transactions are processed within one read window
- read-only transactions are processed in multiple read windows
- initiate RPC requests while read-only transactions are executed