Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

teor2345 · 2021-03-09T07:30:38Z

Motivation

Zebra will keep trying individual Failed peers, even if they have never succeeded. This is a distributed denial of service risk, and it places extra load on the network.

Scheduling

We should fix this issue before NU5 mainnet activation, so this bug doesn't cause a denial of service from old Zebra versions when NU6 activates.

Suggestions

This fix depends on #1849, #1867, and #1871.

Zebra should stop trying to contact peers that haven't had a successful connection for 3 days. We've chosen this time to allow admins to restart their nodes after a weekend failure. (We might want to change this to a longer timeframe in a future upgrade, once Zebra is stable.)

Zebra should delete peers from the AddressBook where the:

last_success_time is older than 3 days
last_success_time is None, and the untrusted_last_seen is older than 3 days (requires the far-future fix in Security: Zebra should stop believing far-future last_seen times from peers #1871)
this task should be triggered by an interval timer that runs every minute, see crawl_and_dial for an example of this kind of address book task
- the task should directly run an address book method containing these checks, to make testing easier
- the task should also be run if the address book gets too large, but we'll do that in Security: Zebra's address book can use all available memory #1873

Zebra should also:

provide an accessor method on MetaAddr to simplify the interface to these times

To avoid sending old peers to other nodes, Zebra should:

keep the earliest last seen times gossiped by other peers
avoid sending unreachable DNS seeder addresses to peers
- these addresses don't have any times, so we have no way of removing them from the network
before sending unreachable gossiped addresses, subtract 30 minutes from their last seen times
- this effectively becomes a time-to-live for unreachable addresses, giving them 6 gossips before Zebra stops forwarding them (based on the 3 hour gossip limit from Security: Zebra should stop gossiping unreachable addresses to other nodes, Action: re-deploy all nodes #1867)
- the Bitcoin reference says we shouldn't change these times, but we already sanitize them

To avoid accepting old peers from other nodes, Zebra should:

ignore peers that are older than 3 days
- Zebra should count back 3 days from the newest peer timestamp sent by the other peer, to compensate for clock skew (otherwise, different local and remote clocks could make us delete all the peers)
- For details, see the clock skew adjustment algorithm in Security: Zebra should stop believing far-future last_seen times from peers #1871

Property testing:

Generate some random MetaAddrs
Put them in an AddressBook
Run the deletion method on the address book
Make sure that old last_success_times and untrusted_last_seen_times are handled correctly

For testing:

add a debug_peer_deletion_age config that sets the deletion age and interval timer
add an acceptance test that sets it to 15 seconds
add a log when all the peers are deleted, and check for that log - "hint: upgrade or restart Zebra"

Performance Analysis

If we check for old peers every time a new peer is requested, we could spend a lot of time checking for deletions. Instead, we should scan the address book at regular intervals in a new task.

Alternatives

This is a critical security issue, so we must do something.

We could keep a failure count for each peer. This design has usability issues on unreliable networks, because all the peers can fail at the same time. (Our reconnection rate limit and peer deletion timeout already limit us to ~2000 failed connections per peer over 3 days.)

We could filter peers whenever the AddressBook is accessed, rather than deleting them. But this is tricky - it might need interior mutability. It would also require an abstraction layer, to make sure we intercept all accesses.

We could avoid deleting peers in NeverAttempted... states, so that we try each peer at least once before deleting it. But this would risk a memory DoS, if we get a lot of gossiped peers.

Context

zcashd does not have this issue.

zcashd has a support interval of around 16 weeks between required upgrades, with new versions coming out every 6 weeks.

Follow-Up Tasks

This fix doesn't deal with denial of service from nodes that are regularly restarted, that's #1870.

The text was updated successfully, but these errors were encountered:

teor2345 · 2021-06-23T23:30:57Z

This fix is required for NU5 mainnet activation. If the issue happens on testnet, we should be able to work around it.

teor2345 changed the title ~~Zebra should eventually stop trying to contact peers that always fail~~ Running Zebra nodes should eventually stop trying to contact peers that always fail Mar 9, 2021

teor2345 mentioned this issue Mar 9, 2021

Make old Zebra versions eventually refuse to run #1870

Closed

7 tasks

mpguerra removed the S-needs-triage Status: A bug report needs triage label Mar 10, 2021

teor2345 added P-Critical and removed P-High labels Mar 16, 2021

mpguerra modified the milestones: 2021 Sprint 7, 2021 Sprint 8 - NU5 Testnet Activation Mar 17, 2021

teor2345 added P-High and removed P-Critical labels Mar 24, 2021

This comment has been minimized.

Sign in to view

teor2345 removed this from the 2021 Sprint 8 - NU5 Testnet Activation milestone Mar 24, 2021

teor2345 mentioned this issue Mar 28, 2021

Zebra should store a persistent list of recent successful peers #1956

Closed

teor2345 changed the title ~~Running Zebra nodes should eventually stop trying to contact peers that always fail~~ Security: Running Zebra nodes should eventually stop trying to contact peers that always fail May 17, 2021

teor2345 added this to the 2021 Sprint 11 - Zcon2 milestone May 17, 2021

teor2345 mentioned this issue May 18, 2021

Security: Zebra should stop gossiping unreachable addresses to other nodes, Action: re-deploy all nodes #1867

Closed

11 tasks

teor2345 modified the milestones: 2021 Sprint 11 - Zcon2, 2021 Sprint 14 May 27, 2021

teor2345 modified the milestones: 2021 Sprint 14, 2021 Sprint 13 Jun 7, 2021

This was referenced Jun 8, 2021

After network upgrade activation, reject new connections from outdated peers #1334

Closed

Stop sending addresses of outdated peers to other peers #2261

Closed

mpguerra modified the milestones: 2021 Sprint 13, 2021 Sprint 14 Jun 14, 2021

teor2345 mentioned this issue Jun 15, 2021

Security: Zebra's address book can use all available memory #1873

Closed

mpguerra modified the milestones: 2021 Sprint 14, 2021 Sprint 15 Jun 23, 2021

teor2345 modified the milestones: 2021 Sprint 15, 2021 Sprint 16 Jun 24, 2021

teor2345 modified the milestones: 2021 Sprint 16, 2021 Sprint 21 Jul 14, 2021

mpguerra modified the milestones: 2021 Sprint 21, 2021 Sprint 23 Oct 1, 2021

mpguerra modified the milestones: 2021 Sprint 23, 2021 Sprint 22 Oct 14, 2021

mpguerra mentioned this issue Oct 15, 2021

Update the README goals and scope for our first Beta release #2857

Closed

mpguerra assigned jvff Oct 26, 2021

jvff mentioned this issue Nov 5, 2021

Security: Avoid reconnecting to peers that are likely unreachable #3030

Merged

3 tasks

teor2345 closed this as completed in #3030 Nov 10, 2021

arya2 mentioned this issue Mar 31, 2023

feat(script): Add binary for finding references to closed issues #6347

Merged

6 tasks

mpguerra mentioned this issue Apr 11, 2023

Tracking: TODOs with closed tasks #6281

Closed

4 tasks

teor2345 mentioned this issue Oct 19, 2023

bug: zebrad will not reconnect after an internet connection failure and restore #7772

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

teor2345 commented Mar 9, 2021 •

edited

Loading

This comment has been minimized.

teor2345 commented Jun 23, 2021

Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

Comments

teor2345 commented Mar 9, 2021 • edited Loading

Motivation

Scheduling

Suggestions

Performance Analysis

Alternatives

Context

Follow-Up Tasks

This comment has been minimized.

teor2345 commented Jun 23, 2021

teor2345 commented Mar 9, 2021 •

edited

Loading