Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865

Closed
3 of 15 tasks
teor2345 opened this issue Mar 9, 2021 · 2 comments · Fixed by #3030
Closed
3 of 15 tasks
Assignees
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-remote-node-overload Zebra can overload other nodes on the network I-unbounded-growth Zebra keeps using resources, without any limit

Comments

@teor2345
Copy link
Contributor

teor2345 commented Mar 9, 2021

Motivation

Zebra will keep trying individual Failed peers, even if they have never succeeded. This is a distributed denial of service risk, and it places extra load on the network.

Scheduling

We should fix this issue before NU5 mainnet activation, so this bug doesn't cause a denial of service from old Zebra versions when NU6 activates.

Suggestions

This fix depends on #1849, #1867, and #1871.

Zebra should stop trying to contact peers that haven't had a successful connection for 3 days. We've chosen this time to allow admins to restart their nodes after a weekend failure. (We might want to change this to a longer timeframe in a future upgrade, once Zebra is stable.)

Zebra should delete peers from the AddressBook where the:

Zebra should also:

  • provide an accessor method on MetaAddr to simplify the interface to these times

To avoid sending old peers to other nodes, Zebra should:

  • keep the earliest last seen times gossiped by other peers
  • avoid sending unreachable DNS seeder addresses to peers
    • these addresses don't have any times, so we have no way of removing them from the network
  • before sending unreachable gossiped addresses, subtract 30 minutes from their last seen times

To avoid accepting old peers from other nodes, Zebra should:

Property testing:

  • Generate some random MetaAddrs
  • Put them in an AddressBook
  • Run the deletion method on the address book
  • Make sure that old last_success_times and untrusted_last_seen_times are handled correctly

For testing:

  • add a debug_peer_deletion_age config that sets the deletion age and interval timer
  • add an acceptance test that sets it to 15 seconds
  • add a log when all the peers are deleted, and check for that log - "hint: upgrade or restart Zebra"

Performance Analysis

If we check for old peers every time a new peer is requested, we could spend a lot of time checking for deletions. Instead, we should scan the address book at regular intervals in a new task.

Alternatives

This is a critical security issue, so we must do something.

We could keep a failure count for each peer. This design has usability issues on unreliable networks, because all the peers can fail at the same time. (Our reconnection rate limit and peer deletion timeout already limit us to ~2000 failed connections per peer over 3 days.)

We could filter peers whenever the AddressBook is accessed, rather than deleting them. But this is tricky - it might need interior mutability. It would also require an abstraction layer, to make sure we intercept all accesses.

We could avoid deleting peers in NeverAttempted... states, so that we try each peer at least once before deleting it. But this would risk a memory DoS, if we get a lot of gossiped peers.

Context

zcashd does not have this issue.

zcashd has a support interval of around 16 weeks between required upgrades, with new versions coming out every 6 weeks.

Follow-Up Tasks

This fix doesn't deal with denial of service from nodes that are regularly restarted, that's #1870.

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-unbounded-growth Zebra keeps using resources, without any limit I-remote-node-overload Zebra can overload other nodes on the network S-needs-triage Status: A bug report needs triage P-High labels Mar 9, 2021
@teor2345 teor2345 changed the title Zebra should eventually stop trying to contact peers that always fail Running Zebra nodes should eventually stop trying to contact peers that always fail Mar 9, 2021
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Mar 10, 2021
@teor2345

This comment has been minimized.

@teor2345 teor2345 changed the title Running Zebra nodes should eventually stop trying to contact peers that always fail Security: Running Zebra nodes should eventually stop trying to contact peers that always fail May 17, 2021
@teor2345 teor2345 added this to the 2021 Sprint 11 - Zcon2 milestone May 17, 2021
@teor2345
Copy link
Contributor Author

This fix is required for NU5 mainnet activation. If the issue happens on testnet, we should be able to work around it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-remote-node-overload Zebra can overload other nodes on the network I-unbounded-growth Zebra keeps using resources, without any limit
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants