Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error #1655

teor2345 · 2021-01-29T02:48:15Z

Is your feature request related to a problem? Please describe.

In #1620, we added logging for TryRecvError::Closed in the Inbound service.

But we should really be returning the error to drive_peer_request, and handling TryRecvError::Closed as a fatal error (or a shutdown), rather than an overloaded connection.

Describe the solution you'd like

Remove the error log for TryRecvError::Closed in Inbound::poll_ready
In drive_peer_request, pass through Overloaded errors from the load_shed layer as PeerError::Overloaded
In drive_peer_request, handle TryRecvError::Closed from the Inbound service by shutting down or panicking, because the network is no longer available

Describe alternatives you've considered

Do nothing: the logs from #1620 will show us if these errors actually happen in practice

Additional Context

Currently, there are only two error types returned by the inbound service's poll_ready:

TryRecvError::Closed from Inbound::poll_ready
Overloaded from the load_shed service layer

        let inbound = ServiceBuilder::new()
            .load_shed()
            .buffer(20)
            .service(Inbound::new(setup_rx, state.clone(), verifier.clone()));

zebra/zebrad/src/commands/start.rs

Line 67 in fa02b26

let inbound = ServiceBuilder::new()

The text was updated successfully, but these errors were encountered:

Also: - handle errors in service readiness the same as errors in requests Closes #1655

This fix prevents hangs and deadlocks during initialization, particularly when there are a small number of valid peers in the initial peer config (or from the DNS seeders). Security: Correctly handle the minimum peer connection interval Previously, if we hadn't had a connection for a while, we'd allow a lot of connections all at once, until we'd caught up. Security: sleep MIN_PEER_CONNECTION_INTERVAL between initial handshakes This prevents denial of service if the local network is constrained, and the seeders return a large number of peers. Only wait for ready handshakes Drain all waiting handshakes when enough have succeeded Refactor MetaAddr to enable security fixes Track multiple last used times for each peer: - Add separate untrusted_last_seen, attempt, success, and failed time fields (#1868, #1876, #1848) - Add the new fields to the peer states, so they only appear in states where they are valid - Insert initial seed peers in the AddressBook in the correct states Create a new MetaAddrChange type for AddressBook changes: - Ignore invalid state changes - Ignore updates to the untrusted last seen time (but update the services field) - If we get a gossiped or alternate change for a seed peer, use the last seen and services info - Once a peer has responded, don't go back to the NeverResponded... states - Update the address book metrics - Optimise getting the next connection address from the address book Do an extra crawl for each handshake on startup And whenever there aren't many recently live peers. Remove duplicate initial crawl code This change uses the candidate set for initial seed peers, gossiped peers, and alternate peers. It significantly reduces the complexity of the initialization code. (By about 200 lines.) Apply readiness timeout to each fanout Also get the fanout limit from the number of recently live peers. Launch each CandidateSet fanout in its own task Spawn each `CandidateSet::update` in its own task Move `CandidateSet::next` into the handshake task Move all crawler awaits and threaded locks into spawned tasks In this commit: - Move sending PeerSet changes into a spawned task - Move the locking in `CandidateSet::report_failed` into a spawned task Increase the peer set buffer size for concurrent fanouts Launch sync fanouts concurrently, with peer set readiness timeouts Wait for seed peers before the first crawl WIP: Add a timeout to crawl addr requests This is a workaround for a zcashd response rate-limit. Move AddressBook::lock() onto a blocking thread Process all ready timestamp changes each time the task runs Wait for the initial crawl before launching the syncer Security: Limit unverified blocks to avoid memory DoS Also document the security implications of changing these limits. Drop early inbound requests to avoid load shedding during network setup Stop closing connections when the inbound service is overloaded SECURITY: Make buffer sizes dynamically depend on the config This change significantly increases the inbound buffer size, increasing memory denial of service risks. However, users can reduce the buffer size using existing related config options. These risks are documented under the relevant configs. Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error Also: - handle errors in service readiness the same as errors in requests Closes #1655

Security: Spawn a separate task for each initial handshake This fix prevents hangs and deadlocks during initialization, particularly when there are a small number of valid peers in the initial peer config (or from the DNS seeders). Security: Correctly handle the minimum peer connection interval Previously, if we hadn't had a connection for a while, we'd allow a lot of connections all at once, until we'd caught up. Security: sleep MIN_PEER_CONNECTION_INTERVAL between initial handshakes This prevents denial of service if the local network is constrained, and the seeders return a large number of peers. Only wait for ready handshakes Drain all waiting handshakes when enough have succeeded Refactor MetaAddr to enable security fixes Track multiple last used times for each peer: - Add separate untrusted_last_seen, attempt, success, and failed time fields (#1868, #1876, #1848) - Add the new fields to the peer states, so they only appear in states where they are valid - Insert initial seed peers in the AddressBook in the correct states Create a new MetaAddrChange type for AddressBook changes: - Ignore invalid state changes - Ignore updates to the untrusted last seen time (but update the services field) - If we get a gossiped or alternate change for a seed peer, use the last seen and services info - Once a peer has responded, don't go back to the NeverResponded... states - Update the address book metrics - Optimise getting the next connection address from the address book Do an extra crawl for each handshake on startup And whenever there aren't many recently live peers. Remove duplicate initial crawl code This change uses the candidate set for initial seed peers, gossiped peers, and alternate peers. It significantly reduces the complexity of the initialization code. (By about 200 lines.) Apply readiness timeout to each fanout Also get the fanout limit from the number of recently live peers. Launch each CandidateSet fanout in its own task Spawn each `CandidateSet::update` in its own task Move `CandidateSet::next` into the handshake task Move all crawler awaits and threaded locks into spawned tasks In this commit: - Move sending PeerSet changes into a spawned task - Move the locking in `CandidateSet::report_failed` into a spawned task Increase the peer set buffer size for concurrent fanouts Launch sync fanouts concurrently, with peer set readiness timeouts Wait for seed peers before the first crawl WIP: Add a timeout to crawl addr requests This is a workaround for a zcashd response rate-limit. Move AddressBook::lock() onto a blocking thread Process all ready timestamp changes each time the task runs Wait for the initial crawl before launching the syncer Security: Limit unverified blocks to avoid memory DoS Also document the security implications of changing these limits. Drop early inbound requests to avoid load shedding during network setup Stop closing connections when the inbound service is overloaded SECURITY: Make buffer sizes dynamically depend on the config This change significantly increases the inbound buffer size, increasing memory denial of service risks. However, users can reduce the buffer size using existing related config options. These risks are documented under the relevant configs. Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error Also: - handle errors in service readiness the same as errors in requests Closes #1655

teor2345 · 2022-02-24T01:54:58Z

This ticket is either obsolete or not needed.

teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage P-Low labels Jan 29, 2021

teor2345 mentioned this issue Jan 29, 2021

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

mpguerra removed the S-needs-triage Status: A bug report needs triage label Feb 18, 2021

teor2345 added a commit that referenced this issue May 24, 2021

Treat TryRecvError::Closed in Inbound::poll_ready as a fatal error

353f8a8

Also: - handle errors in service readiness the same as errors in requests Closes #1655

mpguerra mentioned this issue Jan 27, 2022

Tracking: zebrad Usability #2310

Closed

49 tasks

teor2345 added the S-incomplete label Feb 24, 2022

teor2345 closed this as completed Feb 24, 2022

arya2 mentioned this issue Mar 31, 2023

feat(script): Add binary for finding references to closed issues #6347

Merged

6 tasks

mpguerra mentioned this issue Apr 11, 2023

Tracking: TODOs with closed tasks #6281

Closed

4 tasks

mpguerra closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error #1655

Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error #1655

teor2345 commented Jan 29, 2021 •

edited

Loading

teor2345 commented Feb 24, 2022

Treat TryRecvError::Closed in Inbound::poll_ready as a fatal error #1655

Treat TryRecvError::Closed in Inbound::poll_ready as a fatal error #1655

Comments

teor2345 commented Jan 29, 2021 • edited Loading

teor2345 commented Feb 24, 2022

Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error #1655

Treat `TryRecvError::Closed` in `Inbound::poll_ready` as a fatal error #1655

teor2345 commented Jan 29, 2021 •

edited

Loading