From 44b93a36ef88a576d14dcfba568046512b0f94d9 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Tue, 30 Mar 2021 18:36:27 +1000
Subject: [PATCH 01/34] Initial async RFC version

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 271 ++++++++++++++++++++
 1 file changed, 271 insertions(+)
 create mode 100644 book/src/dev/rfcs/drafts/xxxx-async-rust.md

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
new file mode 100644
index 00000000000..1a94fc2f59d
--- /dev/null
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -0,0 +1,271 @@
+- Feature Name: async_rust_constraints
+- Start Date: 2021-03-30
+- Design PR: [ZcashFoundation/zebra#0000](https://github.com/ZcashFoundation/zebra/pull/0000)
+- Zebra Issue: [ZcashFoundation/zebra#1593](https://github.com/ZcashFoundation/zebra/issues/1593)
+
+# Summary
+[summary]: #summary
+
+Zebra executes concurrent tasks using [async Rust](https://rust-lang.github.io/async-book/),
+with the [tokio](https://docs.rs/tokio/0.3.6/tokio/index.html) executor.
+At a higher level, Zebra also uses [`tower::Service`s](https://docs.rs/tower/0.4.1/tower/trait.Service.html),
+[`tower::Buffer`s](https://docs.rs/tower/0.4.1/tower/buffer/struct.Buffer.html),
+and our own [`tower-batch`](https://github.com/ZcashFoundation/zebra/tree/main/tower-batch) implementation.
+
+Zebra programmers need to carefully write async code so it doesn't deadlock or hang.
+This is particularly important for `poll`, `select`, `Buffer`, and `Batch`.
+
+# Motivation
+[motivation]: #motivation
+
+Like all concurrent programming, Zebra's code needs to obey certain constraints
+to avoid deadlocks and hangs. Unfortunately, Rust's tooling in these areas is
+still developing, so we need to check these constraints during development,
+reviews, and testing.
+
+# Definitions
+[definitions]: #definitions
+
+<!--
+Lay out explicit definitions of any terms that are newly introduced or which cause confusion during the RFC design process.
+-->
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+<!--
+Explain the proposal as if it was already included in the project and you were teaching it to another Zebra programmer. That generally means:
+
+- Introducing new named concepts.
+- Explaining the feature largely in terms of examples.
+- Explaining how Zebra programmers should *think* about the feature, and how it should impact the way they use Zebra. It should explain the impact as concretely as possible.
+- If applicable, provide sample error messages, deprecation warnings, migration guidance, or test strategies.
+- If applicable, describe the differences between teaching this to existing Zebra programmers and new Zebra programmers.
+
+For implementation-oriented RFCs (e.g. for Zebra internals), this section should focus on how Zebra contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms.
+-->
+
+TODO: complete the examples in this section
+
+## `Poll::Pending` Deadlock Avoidance
+[poll-pending-deadlock-avoidance]: #poll-pending-deadlock-avoidance
+
+Here's a deadlock avoidance example from #1735:
+```rust
+        // We acquire checkpoint readiness before block readiness, to avoid an unlikely
+        // hang during the checkpoint to block verifier transition. If the checkpoint and
+        // block verifiers are contending for the same buffer/batch, we want the checkpoint
+        // verifier to win, so that checkpoint verification completes, and block verification
+        // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
+        use futures::ready;
+        // The chain verifier holds one slot in each verifier, for each concurrent task.
+        // Therefore, any shared buffers or batches polled by these verifiers should double
+        // their bounds. (For example, the state service buffer.)
+        ready!(self
+            .checkpoint
+            .poll_ready(cx)
+            .map_err(VerifyChainError::Checkpoint))?;
+        ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
+        Poll::Ready(Ok(()))
+```
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Acquiring Buffer Slots or Mutexes
+[acquiring-buffer-slots-or-mutexes]: #acquiring-buffer-slots-or-mutexes
+
+Ideally, buffer slots or mutexes should be acquired one at a time, and held for as short a time
+as possible.
+
+If that's not possible, acquire slots and mutexs in the same order across cooperating tasks.
+If tasks acquire the same slots and mutexes in different orders, they can deadlock, each holding
+a lock that the other needs.
+
+Note: do not call `poll_ready` on multiple tasks, then match against the results. Use the `ready!`
+macro instead.
+
+If a task is waiting for service readiness, and that service depends on other futures to become ready,
+make sure those futures are scheduled in separate tasks using `tokio::spawn`.
+
+## `Poll::Pending` and Wakeups
+[poll-pending-and-wakeups]: #poll-pending-and-wakeups
+
+When returning `Poll::Pending`, `poll` functions must ensure that the task will be woken up when it is ready to make progress.
+
+In most cases, the `poll` function calls another `poll` function that schedules the task for wakeup.
+
+Any code that generates a new `Poll::Pending` should either have:
+* a `CORRECTNESS` comment explaining how the task is scheduled for wakeup, or
+* a wakeup implementation, with tests to ensure that the wakeup functions as expected.
+
+Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
+
+## Buffer and Batch
+[buffer-and-batch]: #buffer-and-batch
+
+The constraints imposed by the `tower::Buffer` and `tower::Batch` implementations are:
+1. `poll_ready` must be called **at least once** for each `call` 
+2. Once we've reserved a buffer slot, we always get `Poll::Ready` from a buffer, regardless of the
+   current readiness of the buffer or its underlying service
+3. The `Buffer`/`Batch` capacity limits the number of concurrently waiting tasks. Once this limit
+   is reached, further tasks will block, awaiting a free reservation.
+4. Some tasks can depend on other tasks before they resolve. (For example: block validation.)
+   If there are task dependencies, **the `Buffer`/`Batch` capacity must be larger than the
+   maximum number of concurrently waiting tasks**, or Zebra could deadlock (hang).
+
+These constraints are mitigated by:
+- the timeouts on network messages, block downloads and block verification, which restart verification if it hangs
+- the `Buffer`/`Batch` reservation release when response future is returned by the buffer/batch, even if the future doesn't complete
+  - in general, we should move as much work into futures as possible, unless the design requires sequential `call`s
+- larger `Buffer`/`Batch` bounds
+
+### Buffered Services
+[buffered-services]: #buffered-services
+
+A service should be provided wrapped in a `Buffer` if:
+* it is a complex service
+* it has multiple callers, or
+* it has a single caller than calls it multiple times concurrently.
+
+Services might also have other reasons for using a `Buffer`. These reasons should be documented.
+
+#### Choosing Buffer Bounds
+
+Zebra's `Buffer` bounds should be set to the maximum number of concurrent requests, plus 1:
+> it's advisable to set bound to be at least the maximum number of concurrent requests the `Buffer` will see
+https://docs.rs/tower/0.4.3/tower/buffer/struct.Buffer.html#method.new
+
+The extra slot protects us from future changes that add an extra caller, or extra concurrency.
+
+As a general rule, Zebra `Buffer`s should all have at least 3 slots (2 + 1), because most Zebra services can
+be called concurrently by:
+* the sync service, and
+* the inbound service.
+
+Services might also have other reasons for a larger bound. These reasons should be documented.
+
+We should limit `Buffer` lengths for services whose requests or responses contain `Block`s (or other large
+data items, such as `Transaction` vectors). A long `Buffer` full of `Block`s can significantly increase memory
+usage.
+
+Long `Buffer`s can also increase request latency. Latency isn't a concern for Zebra's core use case as a node
+software, but it might be an issue if wallets, exchanges, or block explorers want to use Zebra.
+
+## Awaiting Multiple Futures
+[awaiting-multiple-futures]: #awaiting-multiple-futures
+
+### Unbiased Selection
+[unbiased-selection]: #unbiased-selection
+
+The [`futures::select!`](https://docs.rs/futures/0.3.13/futures/macro.select.html) and
+[`tokio::select!`](https://docs.rs/tokio/0.3.6/tokio/macro.select.html) macros select
+ready arguments at random. 
+
+Also consdier the `FuturesUnordered` stream for unbiased selection of a large number
+of futures. However, this macro and stream require mapping all arguments to the same
+type.
+
+Consider mapping the returned type to a custom enum with module-specific names.
+
+### Biased Selection
+[biased-selection]: #biased-selection
+
+The [`futures::select`](https://docs.rs/futures/0.3.13/futures/future/fn.select.html)
+is biased towards its first argument. If the first argument is always ready, the second
+argument will never be returned. (This behavior is not documented or guaranteed.) This
+bias can cause starvation or hangs. Consider edge cases where queues are full, or there
+are a lot of messages. If in doubt:
+- put cancel oneshots first, then timers, then other futures
+- use the `select!` macro to ensure fairness
+
+Select's bias can be useful to ensure that cancel oneshots and timers are always
+executed first. Consider the `select_biased!` macro and `FuturesOrdered` stream
+for guaranteed ordered selection of futures. (However, this macro and stream require
+mapping all arguments to the same type.)
+
+The `futures::select` `Either` return type is complex, particularly when nested. This
+makes code hard to read and maintain. Map the `Either` to a custom enum.
+
+## Test Plan
+[test-plan]: #test-plan
+
+Zebra's existing acceptance and integration tests will catch most hangs and deadlocks.
+
+Some tests are only run after merging to `main`. If a recently merged PR fails on `main`,
+we revert the PR, and fix the failure.
+
+Some concurrency bugs only happen intermittently. Zebra developers should run regular
+full syncs to ensure that their code doesn't cause intermittent hangs. This is
+particularly important for code that modifies Zebra's highly concurrent crates:
+* `zebrad`
+* `zebra-network`
+* `zebra-state`
+* `zebra-consensus`
+* `tower-batch`
+* `tower-fallback`
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+Implementing and reviewing these constraints creates extra work for developers.
+But concurrency bugs slow down every developer, and impact users. And diagnosing
+those bugs can take a lot of developer effort.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+<!--
+- What makes this design a good design?
+- Is this design a good basis for later designs or implementations?
+- What other designs have been considered and what is the rationale for not choosing them?
+- What is the impact of not doing this?
+-->
+
+# Prior art
+[prior-art]: #prior-art
+
+<!--
+Discuss prior art, both the good and the bad, in relation to this proposal.
+A few examples of what this can include are:
+
+- For community proposals: Is this done by some other community and what were their experiences with it?
+- For other teams: What lessons can we learn from what other communities have done here?
+- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
+
+This section is intended to encourage you as an author to think about the lessons from other projects, to provide readers of your RFC with a fuller picture.
+If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if they are an adaptation from other projects.
+
+Note that while precedent set by other projects is some motivation, it does not on its own motivate an RFC.
+Please also take into consideration that Zebra sometimes intentionally diverges from common Zcash features and designs.
+-->
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+Can we catch these bugs using automated tests?
+
+How can we diagnose these kinds of issues faster and more reliably?
+
+# Future possibilities
+[future-possibilities]: #future-possibilities
+
+<!--
+Think about what the natural extension and evolution of your proposal would
+be and how it would affect Zebra and Zcash as a whole. Try to use this
+section as a tool to more fully consider all possible
+interactions with the project and cryptocurrency ecosystem in your proposal.
+Also consider how the this all fits into the roadmap for the project
+and of the relevant sub-team.
+
+This is also a good place to "dump ideas", if they are out of scope for the
+RFC you are writing but otherwise related.
+
+If you have tried and cannot think of any future possibilities,
+you may simply state that you cannot think of anything.
+
+Note that having something written down in the future-possibilities section
+is not a reason to accept the current or a future RFC; such notes should be
+in the section on motivation or rationale in this or subsequent RFCs.
+The section merely provides additional information.
+-->

From dc6bb1a37a8a94f1b2d1e900eb59743db3158c5f Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Tue, 6 Apr 2021 08:43:03 +1000
Subject: [PATCH 02/34] Add ticket hyperlinks

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 1a94fc2f59d..81cdada3fc4 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -1,6 +1,6 @@
 - Feature Name: async_rust_constraints
 - Start Date: 2021-03-30
-- Design PR: [ZcashFoundation/zebra#0000](https://github.com/ZcashFoundation/zebra/pull/0000)
+- Design PR: [ZcashFoundation/zebra#1965](https://github.com/ZcashFoundation/zebra/pull/1965)
 - Zebra Issue: [ZcashFoundation/zebra#1593](https://github.com/ZcashFoundation/zebra/issues/1593)
 
 # Summary
@@ -50,7 +50,7 @@ TODO: complete the examples in this section
 ## `Poll::Pending` Deadlock Avoidance
 [poll-pending-deadlock-avoidance]: #poll-pending-deadlock-avoidance
 
-Here's a deadlock avoidance example from #1735:
+Here's a deadlock avoidance example from [#1735](https://github.com/ZcashFoundation/zebra/pull/1735):
 ```rust
         // We acquire checkpoint readiness before block readiness, to avoid an unlikely
         // hang during the checkpoint to block verifier transition. If the checkpoint and

From 88ac8387d0f0d2f255f8e95e82b512ecbac88e99 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 18:19:26 +1000
Subject: [PATCH 03/34] Add a table of contents

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 81cdada3fc4..b009ce34642 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -72,6 +72,18 @@ Here's a deadlock avoidance example from [#1735](https://github.com/ZcashFoundat
 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation
 
+When dealing with async code in Zebra make sure the code implements what is suggested for each case:
+
+- [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
+- [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
+- [Buffer and Batch](#buffer-and-batch)
+  - [Buffered Services](#buffered-services)
+    - [Choosing Buffer Bounds](#choosing-buffer-bounds)
+- [Awaiting Multiple Futures](#awaiting-multiple-futures)
+  - [Unbiased Selection](#unbiased-selection)
+  - [Biased Selection](#biased-selection)
+- [Testing Async Code](#testing-async-code)
+
 ## Acquiring Buffer Slots or Mutexes
 [acquiring-buffer-slots-or-mutexes]: #acquiring-buffer-slots-or-mutexes
 

From 6ba7058c8e4d55e8b360e1d933d86d2b3bdfaf39 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 18:19:49 +1000
Subject: [PATCH 04/34] Reword the testing section

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index b009ce34642..6b2954a8df6 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -199,8 +199,8 @@ mapping all arguments to the same type.)
 The `futures::select` `Either` return type is complex, particularly when nested. This
 makes code hard to read and maintain. Map the `Either` to a custom enum.
 
-## Test Plan
-[test-plan]: #test-plan
+## Testing Async Code
+[testing-async-code]: #testing-async-code
 
 Zebra's existing acceptance and integration tests will catch most hangs and deadlocks.
 

From 40dd4a891054eb4ca703dd0a55cb01a692f0bf06 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 18:20:10 +1000
Subject: [PATCH 05/34] Add a toc anchor

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 6b2954a8df6..f91e505c587 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -143,6 +143,7 @@ A service should be provided wrapped in a `Buffer` if:
 Services might also have other reasons for using a `Buffer`. These reasons should be documented.
 
 #### Choosing Buffer Bounds
+[choosing-buffer-bounds]: #choosing-buffer-bounds
 
 Zebra's `Buffer` bounds should be set to the maximum number of concurrent requests, plus 1:
 > it's advisable to set bound to be at least the maximum number of concurrent requests the `Buffer` will see

From 185df6c1eb06f0b3c20e0833f266b3679dd40d19 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 18:21:30 +1000
Subject: [PATCH 06/34] Add some words that need definitions

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index f91e505c587..b12b60145e5 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -26,9 +26,11 @@ reviews, and testing.
 # Definitions
 [definitions]: #definitions
 
-<!--
-Lay out explicit definitions of any terms that are newly introduced or which cause confusion during the RFC design process.
--->
+- `constraint`
+- `CORRECTNESS comment`
+- `deadlock`
+- `starvation` or `livelock`
+- `hang`
 
 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation

From 65a92eb60ddb4079ec75bf781012c0021785a238 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:31:09 +1000
Subject: [PATCH 07/34] Revise introductory sections

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index b12b60145e5..0d4a35b37f4 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -6,22 +6,24 @@
 # Summary
 [summary]: #summary
 
+Zebra programmers need to carefully write async code so it doesn't deadlock or hang.
+This is particularly important for `poll`, `select`, `Buffer`, `Batch`, and `Mutex`.
+
 Zebra executes concurrent tasks using [async Rust](https://rust-lang.github.io/async-book/),
 with the [tokio](https://docs.rs/tokio/0.3.6/tokio/index.html) executor.
+
 At a higher level, Zebra also uses [`tower::Service`s](https://docs.rs/tower/0.4.1/tower/trait.Service.html),
 [`tower::Buffer`s](https://docs.rs/tower/0.4.1/tower/buffer/struct.Buffer.html),
-and our own [`tower-batch`](https://github.com/ZcashFoundation/zebra/tree/main/tower-batch) implementation.
-
-Zebra programmers need to carefully write async code so it doesn't deadlock or hang.
-This is particularly important for `poll`, `select`, `Buffer`, and `Batch`.
+and our own [`tower-batch`](https://github.com/ZcashFoundation/zebra/tree/main/tower-batch)
+implementation.
 
 # Motivation
 [motivation]: #motivation
 
-Like all concurrent programming, Zebra's code needs to obey certain constraints
-to avoid deadlocks and hangs. Unfortunately, Rust's tooling in these areas is
-still developing, so we need to check these constraints during development,
-reviews, and testing.
+Like all concurrent codebases, Zebra needs to obey certain constraints to avoid
+hangs. Unfortunately, Rust's tooling in these areas is still developing. So
+Zebra developers need to manually check these constraints during design,
+development, reviews, and testing.
 
 # Definitions
 [definitions]: #definitions

From e0c792ad6d465e4bea0b01f219ed5e841e98b63e Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:31:27 +1000
Subject: [PATCH 08/34] Add definitions

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 0d4a35b37f4..18d0ed4e2bd 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -28,11 +28,15 @@ development, reviews, and testing.
 # Definitions
 [definitions]: #definitions
 
-- `constraint`
-- `CORRECTNESS comment`
-- `deadlock`
-- `starvation` or `livelock`
-- `hang`
+- `hang`: a Zebra component stops making progress.
+- `constraint`: a rule that Zebra must follow to prevent `hang`s.
+- `CORRECTNESS comment`: the documentation for a `constraint` in Zebra's code.
+- `task`: an async task can execute code independently of other tasks, using
+        cooperative multitasking.
+- `deadlock`: a `hang` that stops an async task executing code.
+        For example: a task is never woken up.
+- `starvation` or `livelock`: a `hang` that executes code, but doesn't do
+        anything useful. For example: a loop never terminates.
 
 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation

From f380b2c7d5625289a322681a7df5aa74a4ee22f4 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:31:44 +1000
Subject: [PATCH 09/34] Write guide intro based on feedback

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 18d0ed4e2bd..450b759b0f3 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -41,14 +41,13 @@ development, reviews, and testing.
 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation
 
-<!--
-Explain the proposal as if it was already included in the project and you were teaching it to another Zebra programmer. That generally means:
+If you are designing, developing, or testing concurrent Zebra code, follow the
+patterns in these examples to avoid hangs.
 
-- Introducing new named concepts.
-- Explaining the feature largely in terms of examples.
-- Explaining how Zebra programmers should *think* about the feature, and how it should impact the way they use Zebra. It should explain the impact as concretely as possible.
-- If applicable, provide sample error messages, deprecation warnings, migration guidance, or test strategies.
-- If applicable, describe the differences between teaching this to existing Zebra programmers and new Zebra programmers.
+If you are reviewing concurrent Zebra designs or code, make sure that:
+- it is clear how the design or code avoids hangs
+- the design or code follows the patterns in these examples (as much as possible)
+- the concurrency constraints and risks are documented
 
 For implementation-oriented RFCs (e.g. for Zebra internals), this section should focus on how Zebra contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms.
 -->

From 05778ccd4482a9c622c95bea55117f629abc75ce Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:32:43 +1000
Subject: [PATCH 10/34] Add a code example for each reference section

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 183 +++++++++++++++++---
 1 file changed, 162 insertions(+), 21 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 450b759b0f3..49c0e4fbb62 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -49,31 +49,172 @@ If you are reviewing concurrent Zebra designs or code, make sure that:
 - the design or code follows the patterns in these examples (as much as possible)
 - the concurrency constraints and risks are documented
 
-For implementation-oriented RFCs (e.g. for Zebra internals), this section should focus on how Zebra contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms.
--->
+The [Reference](#reference-level-explanation) section contains in-depth
+background information about Rust async concurrency in Zebra.
+
+Here are some examples of concurrent designs and documentation in Zebra:
+
+## Registering Wakeups Before Returning Poll::Pending
+[wakeups-poll-pending]: #wakeups-poll-pending
+
+Here's a wakeup correctness example from
+[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/unready_service.rs#L43):
+
+<!-- copied from commit de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f on 2020-04-07 -->
+```rust
+// CORRECTNESS
+//
+// The current task must be scheduled for wakeup every time we return
+// `Poll::Pending`.
+//
+//`ready!` returns `Poll::Pending` when the service is unready, and
+// the inner `poll_ready` schedules this task for wakeup.
+//
+// `cancel.poll` also schedules this task for wakeup if it is canceled.
+let res = ready!(this
+    .service
+    .as_mut()
+    .expect("poll after ready")
+    .poll_ready(cx));
+```
+
+## Avoiding Deadlocks when Aquiring Buffer or Service Readiness
+[readiness-deadlock-avoidance]: #readiness-deadlock-avoidance
+
+Here's an example from [#1735](https://github.com/ZcashFoundation/zebra/pull/1735)
+which covers:
+- calling `poll_ready` before each `call`
+- deadlock avoidance when acquiring buffer slots
+- buffer bounds
+
+<!-- copied from the main branch on 2020-04-07 -->
+```rust
+// We acquire checkpoint readiness before block readiness, to avoid an unlikely
+// hang during the checkpoint to block verifier transition. If the checkpoint and
+// block verifiers are contending for the same buffer/batch, we want the checkpoint
+// verifier to win, so that checkpoint verification completes, and block verification
+// can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
+use futures::ready;
+// The chain verifier holds one slot in each verifier, for each concurrent task.
+// Therefore, any shared buffers or batches polled by these verifiers should double
+// their bounds. (For example, the state service buffer.)
+ready!(self
+    .checkpoint
+    .poll_ready(cx)
+    .map_err(VerifyChainError::Checkpoint))?;
+ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
+Poll::Ready(Ok(()))
+```
+
+## Prioritising Cancellation Futures
+[prioritising-cancellation-futures]: #prioritising-cancellation-futures
+
+Here's an example from [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
+which shows biased future selection using the `select` function:
+
+<!-- copied from commit b51070ed323de7a980cd89043947a2b0828e8bf8 on 2020-04-07 -->
+```rust
+// CORRECTNESS
+//
+// Currently, select prefers the first future if multiple
+// futures are ready.
+//
+// If multiple futures are ready, we want the cancellation
+// to take priority, then the timeout, then peer responses.
+let cancel = future::select(tx.cancellation(), timer_ref);
+match future::select(cancel, peer_rx.next()) {
+    ...
+}
+```
+
+## Sharing Progress between Multiple Futures
+[progress-multiple-futures]: #progress-multiple-futures
+
+Here's an example from [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
+which shows:
+- biased future selection using the `select!` macro
+- deadlock avoidance with `Mutex` locks
+- spawning independent tasks to avoid hangs
+- using timeouts to avoid hangs
+
+<!-- edited from commit b51070ed323de7a980cd89043947a2b0828e8bf8 on 2020-04-07 -->
+```rust
+// CORRECTNESS
+//
+// To avoid hangs and starvation, the crawler must:
+// - spawn a separate task for each handshake, so they can make progress
+//   independently (and avoid deadlocking each other)
+// - use the `select!` macro for all actions, because the `select` function
+//   is biased towards the first ready future
+
+loop {
+    let crawler_action = tokio::select! {
+        a = handshakes.next() => a,
+        a = crawl_timer.next() => a,
+        _ = demand_rx.next() => {
+            if let Some(candidate) = candidates.next().await {
+                // candidates.next has a short delay, and briefly holds the address
+                // book lock, so it shouldn't hang
+                DemandHandshake { candidate }
+            } else {
+                DemandCrawl
+            }
+        }
+    };
+
+    match crawler_action {
+        DemandHandshake { candidate } => {
+            // spawn each handshake into an independent task, so it can make
+            // progress independently of the crawls
+            let hs_join =
+                tokio::spawn(dial(candidate, connector.clone()));
+            handshakes.push(Box::pin(hs_join));
+        }
+        DemandCrawl => {
+            // update has timeouts, and briefly holds the address book
+            // lock, so it shouldn't hang
+            candidates.update().await?;
+        }
+        // handle handshake responses and the crawl timer
+    }
+}
+```
 
-TODO: complete the examples in this section
+## Integration Testing Async Code
+[integration-testing]: #integration-testing
 
-## `Poll::Pending` Deadlock Avoidance
-[poll-pending-deadlock-avoidance]: #poll-pending-deadlock-avoidance
+Here's an example from [`zebrad/tests/acceptance.rs`](https://github.com/ZcashFoundation/zebra/blob/main/zebrad/tests/acceptance.rs#L699)
+which shows:
+- tests for the async Zebra block download and verification pipeline
+- cancellation tests
+- reload tests after a restart
 
-Here's a deadlock avoidance example from [#1735](https://github.com/ZcashFoundation/zebra/pull/1735):
+<!-- edited from commit 5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a on 2020-04-07 -->
 ```rust
-        // We acquire checkpoint readiness before block readiness, to avoid an unlikely
-        // hang during the checkpoint to block verifier transition. If the checkpoint and
-        // block verifiers are contending for the same buffer/batch, we want the checkpoint
-        // verifier to win, so that checkpoint verification completes, and block verification
-        // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
-        use futures::ready;
-        // The chain verifier holds one slot in each verifier, for each concurrent task.
-        // Therefore, any shared buffers or batches polled by these verifiers should double
-        // their bounds. (For example, the state service buffer.)
-        ready!(self
-            .checkpoint
-            .poll_ready(cx)
-            .map_err(VerifyChainError::Checkpoint))?;
-        ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
-        Poll::Ready(Ok(()))
+/// Test if `zebrad` can sync some larger checkpoints on mainnet.
+#[test]
+fn sync_large_checkpoints_mainnet() -> Result<()> {
+    let reuse_tempdir = sync_until(
+        LARGE_CHECKPOINT_TEST_HEIGHT,
+        Mainnet,
+        STOP_AT_HEIGHT_REGEX,
+        LARGE_CHECKPOINT_TIMEOUT,
+        None,
+    )?;
+
+    // if stopping corrupts the rocksdb database, zebrad might hang or crash here
+    // if stopping does not write the rocksdb database to disk, Zebra will
+    // sync, rather than stopping immediately at the configured height
+    sync_until(
+        (LARGE_CHECKPOINT_TEST_HEIGHT - 1).unwrap(),
+        Mainnet,
+        "previous state height is greater than the stop height",
+        STOP_ON_LOAD_TIMEOUT,
+        Some(reuse_tempdir),
+    )?;
+
+    Ok(())
+}
 ```
 
 # Reference-level explanation

From c90389ce7ed8a3ef48afec03014b35629518ddcb Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:33:29 +1000
Subject: [PATCH 11/34] Reorder sections and update TOC

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 33 +++++++++++----------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 49c0e4fbb62..6f4d87bf7e2 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -220,10 +220,9 @@ fn sync_large_checkpoints_mainnet() -> Result<()> {
 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation
 
-When dealing with async code in Zebra make sure the code implements what is suggested for each case:
-
-- [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
+The reference section contains in-depth information about concurrency in Zebra:
 - [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
+- [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
 - [Buffer and Batch](#buffer-and-batch)
   - [Buffered Services](#buffered-services)
     - [Choosing Buffer Bounds](#choosing-buffer-bounds)
@@ -232,6 +231,21 @@ When dealing with async code in Zebra make sure the code implements what is sugg
   - [Biased Selection](#biased-selection)
 - [Testing Async Code](#testing-async-code)
 
+Most Zebra designs or code changes will only touch on one or two of these areas.
+
+## `Poll::Pending` and Wakeups
+[poll-pending-and-wakeups]: #poll-pending-and-wakeups
+
+When returning `Poll::Pending`, `poll` functions must ensure that the task will be woken up when it is ready to make progress.
+
+In most cases, the `poll` function calls another `poll` function that schedules the task for wakeup.
+
+Any code that generates a new `Poll::Pending` should either have:
+* a `CORRECTNESS` comment explaining how the task is scheduled for wakeup, or
+* a wakeup implementation, with tests to ensure that the wakeup functions as expected.
+
+Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
+
 ## Acquiring Buffer Slots or Mutexes
 [acquiring-buffer-slots-or-mutexes]: #acquiring-buffer-slots-or-mutexes
 
@@ -248,19 +262,6 @@ macro instead.
 If a task is waiting for service readiness, and that service depends on other futures to become ready,
 make sure those futures are scheduled in separate tasks using `tokio::spawn`.
 
-## `Poll::Pending` and Wakeups
-[poll-pending-and-wakeups]: #poll-pending-and-wakeups
-
-When returning `Poll::Pending`, `poll` functions must ensure that the task will be woken up when it is ready to make progress.
-
-In most cases, the `poll` function calls another `poll` function that schedules the task for wakeup.
-
-Any code that generates a new `Poll::Pending` should either have:
-* a `CORRECTNESS` comment explaining how the task is scheduled for wakeup, or
-* a wakeup implementation, with tests to ensure that the wakeup functions as expected.
-
-Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
-
 ## Buffer and Batch
 [buffer-and-batch]: #buffer-and-batch
 

From 37affdfda7b0936f73898de22a6a72c45c83f6be Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:33:50 +1000
Subject: [PATCH 12/34] Minor reference updates

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 6f4d87bf7e2..1801ad0b417 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -266,7 +266,7 @@ make sure those futures are scheduled in separate tasks using `tokio::spawn`.
 [buffer-and-batch]: #buffer-and-batch
 
 The constraints imposed by the `tower::Buffer` and `tower::Batch` implementations are:
-1. `poll_ready` must be called **at least once** for each `call` 
+1. `poll_ready` must be called **at least once** for each `call`
 2. Once we've reserved a buffer slot, we always get `Poll::Ready` from a buffer, regardless of the
    current readiness of the buffer or its underlying service
 3. The `Buffer`/`Batch` capacity limits the number of concurrently waiting tasks. Once this limit
@@ -317,12 +317,18 @@ software, but it might be an issue if wallets, exchanges, or block explorers wan
 ## Awaiting Multiple Futures
 [awaiting-multiple-futures]: #awaiting-multiple-futures
 
+When awaiting multiple futures, Zebra can use biased or unbiased selection.
+
+Typically, we prefer unbiased selection, so that if multiple futures are ready,
+they each have a chance of completing. But if one of the futures needs to take
+priority (for example, cancellation), you might want to use biased selection.
+
 ### Unbiased Selection
 [unbiased-selection]: #unbiased-selection
 
 The [`futures::select!`](https://docs.rs/futures/0.3.13/futures/macro.select.html) and
 [`tokio::select!`](https://docs.rs/tokio/0.3.6/tokio/macro.select.html) macros select
-ready arguments at random. 
+ready arguments at random.
 
 Also consdier the `FuturesUnordered` stream for unbiased selection of a large number
 of futures. However, this macro and stream require mapping all arguments to the same
@@ -338,7 +344,7 @@ is biased towards its first argument. If the first argument is always ready, the
 argument will never be returned. (This behavior is not documented or guaranteed.) This
 bias can cause starvation or hangs. Consider edge cases where queues are full, or there
 are a lot of messages. If in doubt:
-- put cancel oneshots first, then timers, then other futures
+- put shutdown or cancel oneshots first, then timers, then other futures
 - use the `select!` macro to ensure fairness
 
 Select's bias can be useful to ensure that cancel oneshots and timers are always

From 40525143aba5250485771a0d5523fe9f66a40573 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Wed, 7 Apr 2021 19:35:30 +1000
Subject: [PATCH 13/34] Delete irrelevant code from an example

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 1801ad0b417..810db7f9669 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -94,7 +94,6 @@ which covers:
 // block verifiers are contending for the same buffer/batch, we want the checkpoint
 // verifier to win, so that checkpoint verification completes, and block verification
 // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
-use futures::ready;
 // The chain verifier holds one slot in each verifier, for each concurrent task.
 // Therefore, any shared buffers or batches polled by these verifiers should double
 // their bounds. (For example, the state service buffer.)

From f2f5fc361818e32bbc87df3e3cb6c193d82fa578 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 08:45:47 +1000
Subject: [PATCH 14/34] Link to code examples using commit hashes

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 810db7f9669..27b58655462 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -58,7 +58,7 @@ Here are some examples of concurrent designs and documentation in Zebra:
 [wakeups-poll-pending]: #wakeups-poll-pending
 
 Here's a wakeup correctness example from
-[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/unready_service.rs#L43):
+[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f/zebra-network/src/peer_set/unready_service.rs#L43):
 
 <!-- copied from commit de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f on 2020-04-07 -->
 ```rust
@@ -182,7 +182,7 @@ loop {
 ## Integration Testing Async Code
 [integration-testing]: #integration-testing
 
-Here's an example from [`zebrad/tests/acceptance.rs`](https://github.com/ZcashFoundation/zebra/blob/main/zebrad/tests/acceptance.rs#L699)
+Here's an example from [`zebrad/tests/acceptance.rs`](https://github.com/ZcashFoundation/zebra/blob/5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a/zebrad/tests/acceptance.rs#L699)
 which shows:
 - tests for the async Zebra block download and verification pipeline
 - cancellation tests

From 34c5163ea4ec786a29c4b1e87a4d8663c6d08fa0 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 08:59:05 +1000
Subject: [PATCH 15/34] Link to PR and commit for each code example

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 27b58655462..a36e3ca6dc8 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -58,7 +58,8 @@ Here are some examples of concurrent designs and documentation in Zebra:
 [wakeups-poll-pending]: #wakeups-poll-pending
 
 Here's a wakeup correctness example from
-[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f/zebra-network/src/peer_set/unready_service.rs#L43):
+[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f/zebra-network/src/peer_set/unready_service.rs#L43)
+in [#1954](https://github.com/ZcashFoundation/zebra/pull/1954):
 
 <!-- copied from commit de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f on 2020-04-07 -->
 ```rust
@@ -81,13 +82,14 @@ let res = ready!(this
 ## Avoiding Deadlocks when Aquiring Buffer or Service Readiness
 [readiness-deadlock-avoidance]: #readiness-deadlock-avoidance
 
-Here's an example from [#1735](https://github.com/ZcashFoundation/zebra/pull/1735)
+Here's an example from [zebra-consensus/src/chain.rs](https://github.com/ZcashFoundation/zebra/blob/3af57ece7ae5d43cfbcb6a9215433705aad70b80/zebra-consensus/src/chain.rs#L73)
+in [#1735](https://github.com/ZcashFoundation/zebra/pull/1735)
 which covers:
 - calling `poll_ready` before each `call`
 - deadlock avoidance when acquiring buffer slots
 - buffer bounds
 
-<!-- copied from the main branch on 2020-04-07 -->
+<!-- copied from commit 306fa882148382299c8c31768d5360c0fa23c4d0 on 2020-04-07 -->
 ```rust
 // We acquire checkpoint readiness before block readiness, to avoid an unlikely
 // hang during the checkpoint to block verifier transition. If the checkpoint and
@@ -108,10 +110,11 @@ Poll::Ready(Ok(()))
 ## Prioritising Cancellation Futures
 [prioritising-cancellation-futures]: #prioritising-cancellation-futures
 
-Here's an example from [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
+Here's an example from [connection.rs](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L423)
+in [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
 which shows biased future selection using the `select` function:
 
-<!-- copied from commit b51070ed323de7a980cd89043947a2b0828e8bf8 on 2020-04-07 -->
+<!-- copied from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
 ```rust
 // CORRECTNESS
 //
@@ -129,14 +132,15 @@ match future::select(cancel, peer_rx.next()) {
 ## Sharing Progress between Multiple Futures
 [progress-multiple-futures]: #progress-multiple-futures
 
-Here's an example from [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
+Here's an example from [peer_set/initialize.rs](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer_set/initialize.rs#L326)
+in [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
 which shows:
 - biased future selection using the `select!` macro
 - deadlock avoidance with `Mutex` locks
 - spawning independent tasks to avoid hangs
 - using timeouts to avoid hangs
 
-<!-- edited from commit b51070ed323de7a980cd89043947a2b0828e8bf8 on 2020-04-07 -->
+<!-- edited from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
 ```rust
 // CORRECTNESS
 //
@@ -183,6 +187,7 @@ loop {
 [integration-testing]: #integration-testing
 
 Here's an example from [`zebrad/tests/acceptance.rs`](https://github.com/ZcashFoundation/zebra/blob/5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a/zebrad/tests/acceptance.rs#L699)
+in [#1193](https://github.com/ZcashFoundation/zebra/pull/1193)
 which shows:
 - tests for the async Zebra block download and verification pipeline
 - cancellation tests

From b85a837b8ff2bcf755edcdba714ada5455221851 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 09:05:40 +1000
Subject: [PATCH 16/34] Fix typos

Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index a36e3ca6dc8..2ff358cce1a 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -291,7 +291,7 @@ These constraints are mitigated by:
 A service should be provided wrapped in a `Buffer` if:
 * it is a complex service
 * it has multiple callers, or
-* it has a single caller than calls it multiple times concurrently.
+* it has a single caller that calls it multiple times concurrently.
 
 Services might also have other reasons for using a `Buffer`. These reasons should be documented.
 
@@ -334,7 +334,7 @@ The [`futures::select!`](https://docs.rs/futures/0.3.13/futures/macro.select.htm
 [`tokio::select!`](https://docs.rs/tokio/0.3.6/tokio/macro.select.html) macros select
 ready arguments at random.
 
-Also consdier the `FuturesUnordered` stream for unbiased selection of a large number
+Also consider the `FuturesUnordered` stream for unbiased selection of a large number
 of futures. However, this macro and stream require mapping all arguments to the same
 type.
 

From 63cef5708b074728ad41fa3c33c7201d0b895a12 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 09:07:34 +1000
Subject: [PATCH 17/34] Delete empty sections

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 51 ---------------------
 1 file changed, 51 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 2ff358cce1a..659a318d34b 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -384,60 +384,9 @@ Implementing and reviewing these constraints creates extra work for developers.
 But concurrency bugs slow down every developer, and impact users. And diagnosing
 those bugs can take a lot of developer effort.
 
-# Rationale and alternatives
-[rationale-and-alternatives]: #rationale-and-alternatives
-
-<!--
-- What makes this design a good design?
-- Is this design a good basis for later designs or implementations?
-- What other designs have been considered and what is the rationale for not choosing them?
-- What is the impact of not doing this?
--->
-
-# Prior art
-[prior-art]: #prior-art
-
-<!--
-Discuss prior art, both the good and the bad, in relation to this proposal.
-A few examples of what this can include are:
-
-- For community proposals: Is this done by some other community and what were their experiences with it?
-- For other teams: What lessons can we learn from what other communities have done here?
-- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
-
-This section is intended to encourage you as an author to think about the lessons from other projects, to provide readers of your RFC with a fuller picture.
-If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if they are an adaptation from other projects.
-
-Note that while precedent set by other projects is some motivation, it does not on its own motivate an RFC.
-Please also take into consideration that Zebra sometimes intentionally diverges from common Zcash features and designs.
--->
-
 # Unresolved questions
 [unresolved-questions]: #unresolved-questions
 
 Can we catch these bugs using automated tests?
 
 How can we diagnose these kinds of issues faster and more reliably?
-
-# Future possibilities
-[future-possibilities]: #future-possibilities
-
-<!--
-Think about what the natural extension and evolution of your proposal would
-be and how it would affect Zebra and Zcash as a whole. Try to use this
-section as a tool to more fully consider all possible
-interactions with the project and cryptocurrency ecosystem in your proposal.
-Also consider how the this all fits into the roadmap for the project
-and of the relevant sub-team.
-
-This is also a good place to "dump ideas", if they are out of scope for the
-RFC you are writing but otherwise related.
-
-If you have tried and cannot think of any future possibilities,
-you may simply state that you cannot think of anything.
-
-Note that having something written down in the future-possibilities section
-is not a reason to accept the current or a future RFC; such notes should be
-in the section on motivation or rationale in this or subsequent RFCs.
-The section merely provides additional information.
--->

From 061e9de5a46f53c03f03604ac49a7ea8a04f0d29 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 09:15:54 +1000
Subject: [PATCH 18/34] Add tracing and metrics in a monitoring section

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 659a318d34b..b3e7d0057ee 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -377,6 +377,16 @@ particularly important for code that modifies Zebra's highly concurrent crates:
 * `tower-batch`
 * `tower-fallback`
 
+## Monitoring Async Code
+[monitoring-async-code]: #monitoring-async-code
+
+Zebra uses the following crates for monitoring and diagnostics:
+* [tracing](https://tracing.rs/tracing/): tracing events, spans, logging
+* [tracing-futures](https://docs.rs/tracing-futures/): future and async function instrumentation
+* [metrics](https://docs.rs/metrics-core/) with a [prometheus exporter](https://docs.rs/metrics-exporter-prometheus/)
+
+These introspection tools are also useful during testing.
+
 # Drawbacks
 [drawbacks]: #drawbacks
 

From 654801e25a867dffc6272a1f2fd370a678bc64a3 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 09:16:44 +1000
Subject: [PATCH 19/34] Update TOC

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index b3e7d0057ee..d3a19a51056 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -234,6 +234,7 @@ The reference section contains in-depth information about concurrency in Zebra:
   - [Unbiased Selection](#unbiased-selection)
   - [Biased Selection](#biased-selection)
 - [Testing Async Code](#testing-async-code)
+- [Monitoring Async Code](#monitoring-async-code)
 
 Most Zebra designs or code changes will only touch on one or two of these areas.
 

From 84be5b6f0daad28cc214175fe64f2b1223333952 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 09:17:44 +1000
Subject: [PATCH 20/34] Remove redundant version in docs.rs link

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index d3a19a51056..b5c664ab04d 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -10,7 +10,7 @@ Zebra programmers need to carefully write async code so it doesn't deadlock or h
 This is particularly important for `poll`, `select`, `Buffer`, `Batch`, and `Mutex`.
 
 Zebra executes concurrent tasks using [async Rust](https://rust-lang.github.io/async-book/),
-with the [tokio](https://docs.rs/tokio/0.3.6/tokio/index.html) executor.
+with the [tokio](https://docs.rs/tokio/) executor.
 
 At a higher level, Zebra also uses [`tower::Service`s](https://docs.rs/tower/0.4.1/tower/trait.Service.html),
 [`tower::Buffer`s](https://docs.rs/tower/0.4.1/tower/buffer/struct.Buffer.html),

From bf975680bdc8f923d6f036d83ab29db2d8f31ec3 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 10:06:49 +1000
Subject: [PATCH 21/34] Add zebra-client buffer bound changes

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index b5c664ab04d..7ccdf8a7dfe 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -305,10 +305,11 @@ https://docs.rs/tower/0.4.3/tower/buffer/struct.Buffer.html#method.new
 
 The extra slot protects us from future changes that add an extra caller, or extra concurrency.
 
-As a general rule, Zebra `Buffer`s should all have at least 3 slots (2 + 1), because most Zebra services can
+As a general rule, Zebra `Buffer`s should all have at least 5 slots, because most Zebra services can
 be called concurrently by:
-* the sync service, and
-* the inbound service.
+* the sync service,
+* the inbound service, and
+* multiple concurrent `zebra-client` blockchain scanning tasks.
 
 Services might also have other reasons for a larger bound. These reasons should be documented.
 

From 480c420d203c4804c194d37c1eb9792d7a2ca527 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 10:07:10 +1000
Subject: [PATCH 22/34] Add memory usage calculations in buffer

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 7ccdf8a7dfe..c5d95bb91cb 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -317,6 +317,9 @@ We should limit `Buffer` lengths for services whose requests or responses contai
 data items, such as `Transaction` vectors). A long `Buffer` full of `Block`s can significantly increase memory
 usage.
 
+For example, parsing a malicious 2 MB block can take up to 12 MB of RAM. So a 5 slot buffer can use 60 MB
+of RAM.
+
 Long `Buffer`s can also increase request latency. Latency isn't a concern for Zebra's core use case as a node
 software, but it might be an issue if wallets, exchanges, or block explorers want to use Zebra.
 

From 71a994aa16399e9ab22a4d7e55a6dbae5fd45947 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 10:13:25 +1000
Subject: [PATCH 23/34] Clarify buffer/batch wording

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index c5d95bb91cb..2dcc7c1a1d9 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -280,9 +280,9 @@ The constraints imposed by the `tower::Buffer` and `tower::Batch` implementation
    If there are task dependencies, **the `Buffer`/`Batch` capacity must be larger than the
    maximum number of concurrently waiting tasks**, or Zebra could deadlock (hang).
 
-These constraints are mitigated by:
-- the timeouts on network messages, block downloads and block verification, which restart verification if it hangs
-- the `Buffer`/`Batch` reservation release when response future is returned by the buffer/batch, even if the future doesn't complete
+We also avoid hangs because:
+- the timeouts on network messages, block downloads, and block verification will restart verification if it hangs
+- `Buffer` and `Batch` release their reservations when response future is returned by the buffered/batched service, even if the returned future hangs
   - in general, we should move as much work into futures as possible, unless the design requires sequential `call`s
 - larger `Buffer`/`Batch` bounds
 

From df9ae30099bcf62e9a96af5d2d3bb506c0729e50 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 12:31:36 +1000
Subject: [PATCH 24/34] Add tracing and metrics examples

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 50 +++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 2dcc7c1a1d9..a0ed9670d9a 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -221,6 +221,56 @@ fn sync_large_checkpoints_mainnet() -> Result<()> {
 }
 ```
 
+## Instrumenting Async Functions
+[instrumenting-async-functions]: #instrumenting-async-functions
+
+Here's an example of instrumenting an async function using `tracing`
+from [`sync/downloads.rs`](https://github.com/ZcashFoundation/zebra/blob/306fa882148382299c8c31768d5360c0fa23c4d0/zebrad/src/components/sync/downloads.rs#L128):
+<!-- there is no original PR for this code, it has been changed a lot -->
+
+<!-- copied from commit 306fa882148382299c8c31768d5360c0fa23c4d0 on 2020-04-08 -->
+```rust
+/// Queue a block for download and verification.
+///
+/// This method waits for the network to become ready, and returns an error
+/// only if the network service fails. It returns immediately after queuing
+/// the request.
+#[instrument(level = "debug", skip(self), fields(%hash))]
+pub async fn download_and_verify(&mut self, hash: block::Hash) -> Result<(), Report> {
+    ...
+}
+```
+
+## Tracing and Metrics in Async Functions
+[tracing-metrics-async-functions]: #tracing-metrics-async-functions
+
+Here's an example from [`connection.rs`](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L585)
+<!-- there is no original PR for this code, it has been changed a lot -->
+which shows:
+- trace and debug logs using the `tracing` crate
+- spans using the `tracing` crate
+- counters using the `metrics` crate
+
+<!-- copied from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
+```rust
+/// Handle an incoming client request, possibly generating outgoing messages to the
+/// remote peer.
+///
+/// NOTE: the caller should use .instrument(msg.span) to instrument the function.
+async fn handle_client_request(&mut self, req: InProgressClientRequest) {
+    trace!(?req.request);
+
+    let InProgressClientRequest { request, tx, span } = req;
+
+    if tx.is_canceled() {
+        metrics::counter!("peer.canceled", 1);
+        tracing::debug!("ignoring canceled request");
+        return;
+    }
+    ...
+}
+```
+
 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation
 

From 22f0d74296e6d60850537f47fab052be69350dee Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 14:10:44 +1000
Subject: [PATCH 25/34] Link the guide to the reference

And expand the guide descriptions
---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 150 +++++++++++++-------
 1 file changed, 98 insertions(+), 52 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index a0ed9670d9a..0448db62a70 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -33,8 +33,13 @@ development, reviews, and testing.
 - `CORRECTNESS comment`: the documentation for a `constraint` in Zebra's code.
 - `task`: an async task can execute code independently of other tasks, using
         cooperative multitasking.
-- `deadlock`: a `hang` that stops an async task executing code.
-        For example: a task is never woken up.
+- `contention`: slower execution because multiple tasks are waiting to
+        acquire a lock, buffer/batch slot, or readiness.
+- `missed wakeup`: a task `hang`s because it is never scheduled for wakeup. 
+- `deadlock`: a `hang` that stops an async task executing code, because it
+        is waiting for a lock, slot, or task readiness.
+        For example: a task is waiting for a service to be ready, but the
+        service readiness depends on that task making progress.
 - `starvation` or `livelock`: a `hang` that executes code, but doesn't do
         anything useful. For example: a loop never terminates.
 
@@ -57,9 +62,15 @@ Here are some examples of concurrent designs and documentation in Zebra:
 ## Registering Wakeups Before Returning Poll::Pending
 [wakeups-poll-pending]: #wakeups-poll-pending
 
-Here's a wakeup correctness example from
-[unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f/zebra-network/src/peer_set/unready_service.rs#L43)
-in [#1954](https://github.com/ZcashFoundation/zebra/pull/1954):
+To avoid missed wakeups, futures must schedule a wakeup before they return
+`Poll::Pending`. For more details, see the [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
+section.
+
+Zebra's [unready_service.rs](https://github.com/ZcashFoundation/zebra/blob/de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f/zebra-network/src/peer_set/unready_service.rs#L43) uses the `ready!` macro to correctly handle
+`Poll::Pending` from the inner service.
+
+You can see some similar constraints in
+[pull request #1954](https://github.com/ZcashFoundation/zebra/pull/1954).
 
 <!-- copied from commit de6d1c93f3e4f9f4fd849176bea6b39ffc5b260f on 2020-04-07 -->
 ```rust
@@ -82,20 +93,27 @@ let res = ready!(this
 ## Avoiding Deadlocks when Aquiring Buffer or Service Readiness
 [readiness-deadlock-avoidance]: #readiness-deadlock-avoidance
 
-Here's an example from [zebra-consensus/src/chain.rs](https://github.com/ZcashFoundation/zebra/blob/3af57ece7ae5d43cfbcb6a9215433705aad70b80/zebra-consensus/src/chain.rs#L73)
-in [#1735](https://github.com/ZcashFoundation/zebra/pull/1735)
-which covers:
+To avoid deadlocks, readiness and locks must be acquired in a consistent order.
+For more details, see the [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
+section.
+
+Zebra's [`ChainVerifier`](https://github.com/ZcashFoundation/zebra/blob/3af57ece7ae5d43cfbcb6a9215433705aad70b80/zebra-consensus/src/chain.rs#L73)
+avoids deadlocks, contention, and errors by:
 - calling `poll_ready` before each `call`
-- deadlock avoidance when acquiring buffer slots
-- buffer bounds
+- acquiring buffer slots for the earlier verifier first (based on blockchain order)
+- ensuring that buffers are large enough for concurrent tasks
+
+<!-- This fix was in https://github.com/ZcashFoundation/zebra/pull/1735 , but it's
+a partial revert, so the PR is a bit confusing. -->
 
-<!-- copied from commit 306fa882148382299c8c31768d5360c0fa23c4d0 on 2020-04-07 -->
+<!-- edited from commit 306fa882148382299c8c31768d5360c0fa23c4d0 on 2020-04-07 -->
 ```rust
 // We acquire checkpoint readiness before block readiness, to avoid an unlikely
 // hang during the checkpoint to block verifier transition. If the checkpoint and
 // block verifiers are contending for the same buffer/batch, we want the checkpoint
 // verifier to win, so that checkpoint verification completes, and block verification
 // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
+//
 // The chain verifier holds one slot in each verifier, for each concurrent task.
 // Therefore, any shared buffers or batches polled by these verifiers should double
 // their bounds. (For example, the state service buffer.)
@@ -107,39 +125,23 @@ ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
 Poll::Ready(Ok(()))
 ```
 
-## Prioritising Cancellation Futures
-[prioritising-cancellation-futures]: #prioritising-cancellation-futures
-
-Here's an example from [connection.rs](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L423)
-in [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
-which shows biased future selection using the `select` function:
-
-<!-- copied from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
-```rust
-// CORRECTNESS
-//
-// Currently, select prefers the first future if multiple
-// futures are ready.
-//
-// If multiple futures are ready, we want the cancellation
-// to take priority, then the timeout, then peer responses.
-let cancel = future::select(tx.cancellation(), timer_ref);
-match future::select(cancel, peer_rx.next()) {
-    ...
-}
-```
-
 ## Sharing Progress between Multiple Futures
 [progress-multiple-futures]: #progress-multiple-futures
 
-Here's an example from [peer_set/initialize.rs](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer_set/initialize.rs#L326)
-in [#1950](https://github.com/ZcashFoundation/zebra/pull/1950)
-which shows:
-- biased future selection using the `select!` macro
-- deadlock avoidance with `Mutex` locks
-- spawning independent tasks to avoid hangs
+To avoid starvation and deadlocks, tasks that depend on multiple futures
+should make progress on all of those futures. This is particularly
+important for tasks that depend on their own outputs. For more details,
+see the [Unbiased Selection](#unbiased-selection)
+section.
+
+Zebra's [peer crawler task](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer_set/initialize.rs#L326)
+avoids starvation and deadlocks by:
+- sharing progress between any ready futures using the `select!` macro
+- spawning independent tasks to avoid hangs (see [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes))
 - using timeouts to avoid hangs
 
+You can see a range of hang fixes in [pull request #1950](https://github.com/ZcashFoundation/zebra/pull/1950).
+
 <!-- edited from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
 ```rust
 // CORRECTNESS
@@ -183,15 +185,50 @@ loop {
 }
 ```
 
+## Prioritising Cancellation Futures
+[prioritising-cancellation-futures]: #prioritising-cancellation-futures
+
+To avoid starvation, cancellation futures must take priority over other futures,
+if multiple futures are ready. For more details, see the [Biased Selection](#biased-selection)
+section.
+
+Zebra's [connection.rs](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L423)
+avoids hangs by prioritising the cancel and timer futures over the peer
+receiver future. Under heavy load, the peer receiver future could always
+be ready with a new message, starving the cancel or timer futures.
+
+You can see a range of hang fixes in [pull request #1950](https://github.com/ZcashFoundation/zebra/pull/1950).
+
+<!-- copied from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
+```rust
+// CORRECTNESS
+//
+// Currently, select prefers the first future if multiple
+// futures are ready.
+//
+// If multiple futures are ready, we want the cancellation
+// to take priority, then the timeout, then peer responses.
+let cancel = future::select(tx.cancellation(), timer_ref);
+match future::select(cancel, peer_rx.next()) {
+    ...
+}
+```
+
 ## Integration Testing Async Code
 [integration-testing]: #integration-testing
 
-Here's an example from [`zebrad/tests/acceptance.rs`](https://github.com/ZcashFoundation/zebra/blob/5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a/zebrad/tests/acceptance.rs#L699)
-in [#1193](https://github.com/ZcashFoundation/zebra/pull/1193)
-which shows:
-- tests for the async Zebra block download and verification pipeline
-- cancellation tests
-- reload tests after a restart
+Sometimes, it is difficult to unit test async code, because it has complex
+dependencies. For more details, see the [Testing Async Code](#testing-async-code)
+section.
+
+[`zebrad`'s acceptance tests](https://github.com/ZcashFoundation/zebra/blob/5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a/zebrad/tests/acceptance.rs#L699)
+run short Zebra syncs on the Zcash mainnet or testnet. These acceptance tests
+make sure that `zebrad` can:
+- sync blocks using its async block download and verification pipeline
+- cancel a sync
+- reload disk state after a restart
+
+These tests were introduced in [pull request #1193](https://github.com/ZcashFoundation/zebra/pull/1193).
 
 <!-- edited from commit 5bf0a2954e9df3fad53ad57f6b3a673d9df47b9a on 2020-04-07 -->
 ```rust
@@ -224,8 +261,12 @@ fn sync_large_checkpoints_mainnet() -> Result<()> {
 ## Instrumenting Async Functions
 [instrumenting-async-functions]: #instrumenting-async-functions
 
-Here's an example of instrumenting an async function using `tracing`
-from [`sync/downloads.rs`](https://github.com/ZcashFoundation/zebra/blob/306fa882148382299c8c31768d5360c0fa23c4d0/zebrad/src/components/sync/downloads.rs#L128):
+Sometimes, it is difficult to debug async code, because there are many tasks
+running concurrently. For more details, see the [Monitoring Async Code](#monitoring-async-code)
+section.
+
+Zebra runs instrumentation on some of its async function using `tracing`.
+Here's an instrumentation example from Zebra's [sync block downloader](https://github.com/ZcashFoundation/zebra/blob/306fa882148382299c8c31768d5360c0fa23c4d0/zebrad/src/components/sync/downloads.rs#L128):
 <!-- there is no original PR for this code, it has been changed a lot -->
 
 <!-- copied from commit 306fa882148382299c8c31768d5360c0fa23c4d0 on 2020-04-08 -->
@@ -244,13 +285,18 @@ pub async fn download_and_verify(&mut self, hash: block::Hash) -> Result<(), Rep
 ## Tracing and Metrics in Async Functions
 [tracing-metrics-async-functions]: #tracing-metrics-async-functions
 
-Here's an example from [`connection.rs`](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L585)
-<!-- there is no original PR for this code, it has been changed a lot -->
-which shows:
-- trace and debug logs using the `tracing` crate
-- spans using the `tracing` crate
+Sometimes, it is difficult to monitor async code, because there are many tasks
+running concurrently. For more details, see the [Monitoring Async Code](#monitoring-async-code)
+section.
+
+Zebra's [client requests](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer/connection.rs#L585)
+are monitored via:
+- trace and debug logs using `tracing` crate
+- related work spans using the `tracing` crate
 - counters using the `metrics` crate
 
+<!-- there is no original PR for this code, it has been changed a lot -->
+
 <!-- copied from commit 375c8d8700764534871f02d2d44f847526179dab on 2020-04-08 -->
 ```rust
 /// Handle an incoming client request, possibly generating outgoing messages to the

From ae18b8e974aa2c244b6478cd229d01549800a647 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 14:16:07 +1000
Subject: [PATCH 26/34] Expand on introspection

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 0448db62a70..cdb9155900b 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -486,7 +486,11 @@ Zebra uses the following crates for monitoring and diagnostics:
 * [tracing-futures](https://docs.rs/tracing-futures/): future and async function instrumentation
 * [metrics](https://docs.rs/metrics-core/) with a [prometheus exporter](https://docs.rs/metrics-exporter-prometheus/)
 
-These introspection tools are also useful during testing.
+These introspection tools are also useful during testing:
+- `tracing` logs individual events 
+  - spans track related work through the download and verification pipeline
+- `metrics` monitors overall progress and error rates
+  - labels split counters or gauges into different categories (for example, by peer address)
 
 # Drawbacks
 [drawbacks]: #drawbacks

From 2ce751c9dbd7f3dcc3742e6b812fb22b4924e6d5 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 8 Apr 2021 14:16:56 +1000
Subject: [PATCH 27/34] Make unordered lists consistent

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index cdb9155900b..985f337616c 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -471,20 +471,20 @@ we revert the PR, and fix the failure.
 Some concurrency bugs only happen intermittently. Zebra developers should run regular
 full syncs to ensure that their code doesn't cause intermittent hangs. This is
 particularly important for code that modifies Zebra's highly concurrent crates:
-* `zebrad`
-* `zebra-network`
-* `zebra-state`
-* `zebra-consensus`
-* `tower-batch`
-* `tower-fallback`
+- `zebrad`
+- `zebra-network`
+- `zebra-state`
+- `zebra-consensus`
+- `tower-batch`
+- `tower-fallback`
 
 ## Monitoring Async Code
 [monitoring-async-code]: #monitoring-async-code
 
 Zebra uses the following crates for monitoring and diagnostics:
-* [tracing](https://tracing.rs/tracing/): tracing events, spans, logging
-* [tracing-futures](https://docs.rs/tracing-futures/): future and async function instrumentation
-* [metrics](https://docs.rs/metrics-core/) with a [prometheus exporter](https://docs.rs/metrics-exporter-prometheus/)
+- [tracing](https://tracing.rs/tracing/): tracing events, spans, logging
+- [tracing-futures](https://docs.rs/tracing-futures/): future and async function instrumentation
+- [metrics](https://docs.rs/metrics-core/) with a [prometheus exporter](https://docs.rs/metrics-exporter-prometheus/)
 
 These introspection tools are also useful during testing:
 - `tracing` logs individual events 

From 5140a5b8614ee4d759ae688024cb60f9689f2d10 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 22 Apr 2021 17:31:34 +1000
Subject: [PATCH 28/34] Add atomics rules and example

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 53 +++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 985f337616c..c67716dfb92 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -214,6 +214,42 @@ match future::select(cancel, peer_rx.next()) {
 }
 ```
 
+## Atomic Shutdown Flag
+[atomic-shutdown-flag]: #atomic-shutdown-flag
+
+As of April 2021, Zebra implements some shutdown checks using an atomic `bool`.
+
+Zebra's [shutdown.rs](https://github.com/ZcashFoundation/zebra/blob/24bf952e982bde28eb384b211659159d46150f63/zebra-chain/src/shutdown.rs)
+avoids data races and missed updates by using the strongest memory
+ordering (`SeqCst`).
+
+We plan to replace this raw atomic code with a channel, see [#1678](https://github.com/ZcashFoundation/zebra/issues/1678).
+
+```rust
+/// A flag to indicate if Zebra is shutting down.
+///
+/// Initialized to `false` at startup.
+pub static IS_SHUTTING_DOWN: AtomicBool = AtomicBool::new(false);
+
+/// Returns true if the application is shutting down.
+///
+/// Returns false otherwise.
+pub fn is_shutting_down() -> bool {
+    // ## Correctness:
+    //
+    // Since we're shutting down, and this is a one-time operation,
+    // performance is not important. So we use the strongest memory
+    // ordering.
+    // https://doc.rust-lang.org/nomicon/atomics.html#sequentially-consistent
+    IS_SHUTTING_DOWN.load(Ordering::SeqCst)
+}
+
+/// Sets the Zebra shutdown flag to `true`.
+pub fn set_shutting_down() {
+    IS_SHUTTING_DOWN.store(true, Ordering::SeqCst);
+}
+```
+
 ## Integration Testing Async Code
 [integration-testing]: #integration-testing
 
@@ -329,6 +365,7 @@ The reference section contains in-depth information about concurrency in Zebra:
 - [Awaiting Multiple Futures](#awaiting-multiple-futures)
   - [Unbiased Selection](#unbiased-selection)
   - [Biased Selection](#biased-selection)
+- [Using Atomics](#using-atomics)
 - [Testing Async Code](#testing-async-code)
 - [Monitoring Async Code](#monitoring-async-code)
 
@@ -460,6 +497,22 @@ mapping all arguments to the same type.)
 The `futures::select` `Either` return type is complex, particularly when nested. This
 makes code hard to read and maintain. Map the `Either` to a custom enum.
 
+## Using Atomics
+[using-atomics]: #using-atomics
+
+If you're considering using atomics, prefer:
+1. a safe, tested abstraction
+2. using the strongest memory ordering (`Ordering::SeqCst`)
+3. using a weaker memory ordering, with:
+  - a correctness comment,
+  - multithreaded tests with a concurrency permutation harness like [loom](https://github.com/tokio-rs/loom), and
+  - benchmarks to prove that the low-level code is faster.
+
+In Zebra, we try to use safe abstractions, and write obviously correct code. It takes
+a lot of effort to write, test, and maintain low-level code. Almost all of our
+performance-critical code is in cryptographic libraries. And our biggest performance
+gains from those libraries come from async batch cryptography.
+
 ## Testing Async Code
 [testing-async-code]: #testing-async-code
 

From 46d4c5e821fe9b6e1a655353117c33c4e7652d3f Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Thu, 22 Apr 2021 18:01:28 +1000
Subject: [PATCH 29/34] Tidy atomic example and link to reference

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index c67716dfb92..323579dd6e2 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -221,10 +221,11 @@ As of April 2021, Zebra implements some shutdown checks using an atomic `bool`.
 
 Zebra's [shutdown.rs](https://github.com/ZcashFoundation/zebra/blob/24bf952e982bde28eb384b211659159d46150f63/zebra-chain/src/shutdown.rs)
 avoids data races and missed updates by using the strongest memory
-ordering (`SeqCst`).
+ordering ([`SeqCst`](https://doc.rust-lang.org/nomicon/atomics.html#sequentially-consistent)).
 
 We plan to replace this raw atomic code with a channel, see [#1678](https://github.com/ZcashFoundation/zebra/issues/1678).
 
+<!-- edited from commit 24bf952e982bde28eb384b211659159d46150f63 on 2020-04-22 -->
 ```rust
 /// A flag to indicate if Zebra is shutting down.
 ///
@@ -232,8 +233,6 @@ We plan to replace this raw atomic code with a channel, see [#1678](https://gith
 pub static IS_SHUTTING_DOWN: AtomicBool = AtomicBool::new(false);
 
 /// Returns true if the application is shutting down.
-///
-/// Returns false otherwise.
 pub fn is_shutting_down() -> bool {
     // ## Correctness:
     //
@@ -502,7 +501,7 @@ makes code hard to read and maintain. Map the `Either` to a custom enum.
 
 If you're considering using atomics, prefer:
 1. a safe, tested abstraction
-2. using the strongest memory ordering (`Ordering::SeqCst`)
+2. using the strongest memory ordering ([`SeqCst`](https://doc.rust-lang.org/nomicon/atomics.html#sequentially-consistent))
 3. using a weaker memory ordering, with:
   - a correctness comment,
   - multithreaded tests with a concurrency permutation harness like [loom](https://github.com/tokio-rs/loom), and

From 8bc7123eda80d76cd2ba0e63f32a5e9d938bf672 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Fri, 30 Apr 2021 11:07:27 +1000
Subject: [PATCH 30/34] Mention TurboWish as a future diagnostic tool

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 323579dd6e2..852b359279e 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -539,7 +539,7 @@ Zebra uses the following crates for monitoring and diagnostics:
 - [metrics](https://docs.rs/metrics-core/) with a [prometheus exporter](https://docs.rs/metrics-exporter-prometheus/)
 
 These introspection tools are also useful during testing:
-- `tracing` logs individual events 
+- `tracing` logs individual events
   - spans track related work through the download and verification pipeline
 - `metrics` monitors overall progress and error rates
   - labels split counters or gauges into different categories (for example, by peer address)
@@ -557,3 +557,6 @@ those bugs can take a lot of developer effort.
 Can we catch these bugs using automated tests?
 
 How can we diagnose these kinds of issues faster and more reliably?
+  - [TurboWish](https://blog.pnkfx.org/blog/2021/04/26/road-to-turbowish-part-1-goals/)
+    looks really promising for task, channel, and future introspection. But as of April
+    2021, it's still in early development.

From b71809f834ead4384e501367f08775569fceb440 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Fri, 30 Apr 2021 11:31:35 +1000
Subject: [PATCH 31/34] Recommend using futures-aware types

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 67 +++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 852b359279e..3013d0c1273 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -90,6 +90,55 @@ let res = ready!(this
     .poll_ready(cx));
 ```
 
+## Futures-Aware Mutexes
+[futures-aware-mutexes]: #futures-aware-mutexes
+
+To avoid hangs or slowdowns, use futures-aware types. For more details, see the
+[Futures-Aware Types](#futures-aware-types) section.
+
+Zebra's [`Handshake`](https://github.com/ZcashFoundation/zebra/blob/a63c2e8c40fa847a86d00c754fb10a4729ba34e5/zebra-network/src/peer/handshake.rs#L204)
+won't block other tasks on its thread, because it uses `futures::lock::Mutex`:
+
+<!-- copied from commit a63c2e8c40fa847a86d00c754fb10a4729ba34e5 on 2020-04-30 -->
+```rust
+pub async fn negotiate_version(
+    peer_conn: &mut Framed<TcpStream, Codec>,
+    addr: &SocketAddr,
+    config: Config,
+    nonces: Arc<futures::lock::Mutex<HashSet<Nonce>>>,
+    user_agent: String,
+    our_services: PeerServices,
+    relay: bool,
+) -> Result<(Version, PeerServices), HandshakeError> {
+    // Create a random nonce for this connection
+    let local_nonce = Nonce::default();
+    // # Correctness
+    //
+    // It is ok to wait for the lock here, because handshakes have a short
+    // timeout, and the async mutex will be released when the task times
+    // out.
+    nonces.lock().await.insert(local_nonce);
+
+    ...
+}
+```
+
+Zebra's [`Inbound service`](https://github.com/ZcashFoundation/zebra/blob/0203d1475a95e90eb6fd7c4101caa26aeddece5b/zebrad/src/components/inbound.rs#L238)
+can't use an async-aware mutex for its `AddressBook`, because the mutex is shared
+with non-async code. It only holds the mutex to clone the address book, reducing
+the amount of time that other tasks on its thread are blocked:
+
+<!-- copied from commit 0203d1475a95e90eb6fd7c4101caa26aeddece5b on 2020-04-30 -->
+```rust
+// # Correctness
+//
+// Briefly hold the address book threaded mutex while
+// cloning the address book. Then sanitize after releasing
+// the lock.
+let peers = address_book.lock().unwrap().clone();
+let mut peers = peers.sanitized();
+```
+
 ## Avoiding Deadlocks when Aquiring Buffer or Service Readiness
 [readiness-deadlock-avoidance]: #readiness-deadlock-avoidance
 
@@ -357,6 +406,7 @@ async fn handle_client_request(&mut self, req: InProgressClientRequest) {
 
 The reference section contains in-depth information about concurrency in Zebra:
 - [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
+- [Futures-Aware Types](#futures-aware-types)
 - [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
 - [Buffer and Batch](#buffer-and-batch)
   - [Buffered Services](#buffered-services)
@@ -383,6 +433,23 @@ Any code that generates a new `Poll::Pending` should either have:
 
 Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
 
+## Futures-Aware Types
+[futures-aware-types]: #futures-aware-types
+
+Use futures-aware types, rather than types which will block the current thread.
+
+For example:
+- Use `futures::lock::Mutex` rather than `std::sync::Mutex`
+- Use `tokio::time::{sleep, timeout}` rather than `std::thread::sleep`
+
+Always qualify ambiguous names like `Mutex` and `sleep`, so that it is obvious
+when a call will block.
+
+If you are unable to use futures-aware types:
+- block the thread for as short a time as possible
+- document the correctness of each blocking call
+- consider re-designing the code to use `tower::Services`, or other futures-aware types
+
 ## Acquiring Buffer Slots or Mutexes
 [acquiring-buffer-slots-or-mutexes]: #acquiring-buffer-slots-or-mutexes
 

From 045967d612a8a18bdff16cf0a8ca3583504315e9 Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Fri, 30 Apr 2021 12:06:19 +1000
Subject: [PATCH 32/34] Improve the acquiring locks section

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 42 +++++++++++++--------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 3013d0c1273..75ac7f4d0b9 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -35,7 +35,11 @@ development, reviews, and testing.
         cooperative multitasking.
 - `contention`: slower execution because multiple tasks are waiting to
         acquire a lock, buffer/batch slot, or readiness.
-- `missed wakeup`: a task `hang`s because it is never scheduled for wakeup. 
+- `missed wakeup`: a task `hang`s because it is never scheduled for wakeup.
+- `lock`: exclusive access to a shared resource. Locks stop other code from
+          running until they are released. For example, a mutex, buffer slot,
+          or service readiness.
+- `critical section`: code that is executed while holding a lock.
 - `deadlock`: a `hang` that stops an async task executing code, because it
         is waiting for a lock, slot, or task readiness.
         For example: a task is waiting for a service to be ready, but the
@@ -143,7 +147,7 @@ let mut peers = peers.sanitized();
 [readiness-deadlock-avoidance]: #readiness-deadlock-avoidance
 
 To avoid deadlocks, readiness and locks must be acquired in a consistent order.
-For more details, see the [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
+For more details, see the [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness)
 section.
 
 Zebra's [`ChainVerifier`](https://github.com/ZcashFoundation/zebra/blob/3af57ece7ae5d43cfbcb6a9215433705aad70b80/zebra-consensus/src/chain.rs#L73)
@@ -186,7 +190,7 @@ section.
 Zebra's [peer crawler task](https://github.com/ZcashFoundation/zebra/blob/375c8d8700764534871f02d2d44f847526179dab/zebra-network/src/peer_set/initialize.rs#L326)
 avoids starvation and deadlocks by:
 - sharing progress between any ready futures using the `select!` macro
-- spawning independent tasks to avoid hangs (see [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes))
+- spawning independent tasks to avoid hangs (see [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness))
 - using timeouts to avoid hangs
 
 You can see a range of hang fixes in [pull request #1950](https://github.com/ZcashFoundation/zebra/pull/1950).
@@ -407,7 +411,7 @@ async fn handle_client_request(&mut self, req: InProgressClientRequest) {
 The reference section contains in-depth information about concurrency in Zebra:
 - [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
 - [Futures-Aware Types](#futures-aware-types)
-- [Acquiring Buffer Slots or Mutexes](#acquiring-buffer-slots-or-mutexes)
+- [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness)
 - [Buffer and Batch](#buffer-and-batch)
   - [Buffered Services](#buffered-services)
     - [Choosing Buffer Bounds](#choosing-buffer-bounds)
@@ -450,21 +454,29 @@ If you are unable to use futures-aware types:
 - document the correctness of each blocking call
 - consider re-designing the code to use `tower::Services`, or other futures-aware types
 
-## Acquiring Buffer Slots or Mutexes
-[acquiring-buffer-slots-or-mutexes]: #acquiring-buffer-slots-or-mutexes
+## Acquiring Buffer Slots, Mutexes, or Readiness
+[acquiring-buffer-slots-mutexes-readiness]: #acquiring-buffer-slots-mutexes-readiness
 
-Ideally, buffer slots or mutexes should be acquired one at a time, and held for as short a time
-as possible.
+Ideally, buffer slots, mutexes, or readiness should be:
+- acquired with one lock per critical section, and
+- held for as short a time as possible.
 
-If that's not possible, acquire slots and mutexs in the same order across cooperating tasks.
-If tasks acquire the same slots and mutexes in different orders, they can deadlock, each holding
-a lock that the other needs.
+If multiple locks are required for a critical section, acquire them in the same
+order any time those locks are used. If tasks acquire multiple locks in different
+orders, they can deadlock, each holding a lock that the other needs.
 
-Note: do not call `poll_ready` on multiple tasks, then match against the results. Use the `ready!`
-macro instead.
+If a buffer, mutex, future or service has complex readiness dependencies,
+schedule those dependencies separate tasks using `tokio::spawn`. Otherwise,
+it might deadlock due to a dependency loop within a single executor task.
+
+In all of these cases:
+- make critical sections as short as possible, and
+- do not depend on other tasks or locks inside the critical section.
 
-If a task is waiting for service readiness, and that service depends on other futures to become ready,
-make sure those futures are scheduled in separate tasks using `tokio::spawn`.
+### Acquiring Service Readiness
+
+Note: do not call `poll_ready` on multiple tasks, then match against the results. Use the `ready!`
+macro instead, to acquire service readiness in a consistent order.
 
 ## Buffer and Batch
 [buffer-and-batch]: #buffer-and-batch

From 154966e5071ad98c10481b69df6c532f14adbdac Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Fri, 30 Apr 2021 12:06:41 +1000
Subject: [PATCH 33/34] Add an example of a compiler error that prevents
 deadlock

---
 book/src/dev/rfcs/drafts/xxxx-async-rust.md | 43 +++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
index 75ac7f4d0b9..f440f6aa8ea 100644
--- a/book/src/dev/rfcs/drafts/xxxx-async-rust.md
+++ b/book/src/dev/rfcs/drafts/xxxx-async-rust.md
@@ -178,6 +178,49 @@ ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
 Poll::Ready(Ok(()))
 ```
 
+## Critical Section Compiler Errors
+[critical-section-compiler-errors]: #critical-section-compiler-errors
+
+To avoid deadlocks or slowdowns, critical sections should be as short as
+possible, and they should not depend on any other tasks.
+For more details, see the [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness)
+section.
+
+Zebra's [`CandidateSet`](https://github.com/ZcashFoundation/zebra/blob/0203d1475a95e90eb6fd7c4101caa26aeddece5b/zebra-network/src/peer_set/candidate_set.rs#L257)
+must release a `std::sync::Mutex` lock before awaiting a `tokio::time::Sleep`
+future. This ensures that the threaded mutex lock isn't held over the `await`
+point.
+
+If the lock isn't dropped, compilation fails, because the mutex lock can't be
+sent between threads.
+
+<!-- edited from commit 0203d1475a95e90eb6fd7c4101caa26aeddece5b on 2020-04-30 -->
+```rust
+// # Correctness
+//
+// In this critical section, we hold the address mutex, blocking the
+// current thread, and all async tasks scheduled on that thread.
+//
+// To avoid deadlocks, the critical section:
+// - must not acquire any other locks
+// - must not await any futures
+//
+// To avoid hangs, any computation in the critical section should
+// be kept to a minimum.
+let reconnect = {
+    let mut guard = self.address_book.lock().unwrap();
+    ...
+    let reconnect = guard.reconnection_peers().next()?;
+
+    let reconnect = MetaAddr::new_reconnect(&reconnect.addr, &reconnect.services);
+    guard.update(reconnect);
+    reconnect
+};
+
+// SECURITY: rate-limit new candidate connections
+sleep.await;
+```
+
 ## Sharing Progress between Multiple Futures
 [progress-multiple-futures]: #progress-multiple-futures
 

From ecae07ff3af70711b09090e8f66fc382a18c7bfc Mon Sep 17 00:00:00 2001
From: teor <teor@riseup.net>
Date: Tue, 4 May 2021 09:04:40 +1000
Subject: [PATCH 34/34] Rename async Rust RFC and move it out of drafts

---
 .../{drafts/xxxx-async-rust.md => 0011-async-rust-in-zebra.md}    | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename book/src/dev/rfcs/{drafts/xxxx-async-rust.md => 0011-async-rust-in-zebra.md} (100%)

diff --git a/book/src/dev/rfcs/drafts/xxxx-async-rust.md b/book/src/dev/rfcs/0011-async-rust-in-zebra.md
similarity index 100%
rename from book/src/dev/rfcs/drafts/xxxx-async-rust.md
rename to book/src/dev/rfcs/0011-async-rust-in-zebra.md