-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865
Closed
3 of 15 tasks
Labels
A-rust
Area: Updates to Rust code
C-bug
Category: This is a bug
C-security
Category: Security issues
I-heavy
Problems with excessive memory, disk, or CPU usage
I-remote-node-overload
Zebra can overload other nodes on the network
I-unbounded-growth
Zebra keeps using resources, without any limit
Milestone
Comments
7 tasks
This comment has been minimized.
This comment has been minimized.
11 tasks
This was referenced Jun 8, 2021
This fix is required for NU5 mainnet activation. If the issue happens on testnet, we should be able to work around it. |
3 tasks
6 tasks
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-rust
Area: Updates to Rust code
C-bug
Category: This is a bug
C-security
Category: Security issues
I-heavy
Problems with excessive memory, disk, or CPU usage
I-remote-node-overload
Zebra can overload other nodes on the network
I-unbounded-growth
Zebra keeps using resources, without any limit
Motivation
Zebra will keep trying individual
Failed
peers, even if they have never succeeded. This is a distributed denial of service risk, and it places extra load on the network.Scheduling
We should fix this issue before NU5 mainnet activation, so this bug doesn't cause a denial of service from old Zebra versions when NU6 activates.
Suggestions
This fix depends on #1849, #1867, and #1871.
Zebra should stop trying to contact peers that haven't had a successful connection for 3 days. We've chosen this time to allow admins to restart their nodes after a weekend failure. (We might want to change this to a longer timeframe in a future upgrade, once Zebra is stable.)
Zebra should delete peers from the AddressBook where the:
last_success_time
is older than 3 dayslast_success_time
isNone
, and theuntrusted_last_seen
is older than 3 days (requires the far-future fix in Security: Zebra should stop believing far-future last_seen times from peers #1871)crawl_and_dial
for an example of this kind of address book taskZebra should also:
MetaAddr
to simplify the interface to these timesTo avoid sending old peers to other nodes, Zebra should:
To avoid accepting old peers from other nodes, Zebra should:
Property testing:
MetaAddr
sAddressBook
last_success_time
s anduntrusted_last_seen_time
s are handled correctlyFor testing:
debug_peer_deletion_age
config that sets the deletion age and interval timer15
secondsPerformance Analysis
If we check for old peers every time a new peer is requested, we could spend a lot of time checking for deletions. Instead, we should scan the address book at regular intervals in a new task.
Alternatives
This is a critical security issue, so we must do something.
We could keep a failure count for each peer. This design has usability issues on unreliable networks, because all the peers can fail at the same time. (Our reconnection rate limit and peer deletion timeout already limit us to ~2000 failed connections per peer over 3 days.)
We could filter peers whenever the
AddressBook
is accessed, rather than deleting them. But this is tricky - it might need interior mutability. It would also require an abstraction layer, to make sure we intercept all accesses.We could avoid deleting peers in
NeverAttempted...
states, so that we try each peer at least once before deleting it. But this would risk a memory DoS, if we get a lot of gossiped peers.Context
zcashd
does not have this issue.zcashd
has a support interval of around 16 weeks between required upgrades, with new versions coming out every 6 weeks.Follow-Up Tasks
This fix doesn't deal with denial of service from nodes that are regularly restarted, that's #1870.
The text was updated successfully, but these errors were encountered: