-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make concurrent dialing more aggressive #3656
Conversation
@@ -31,6 +31,9 @@ use std::{sync::Arc, time::Duration}; | |||
|
|||
pub use libp2p::bandwidth::BandwidthSinks; | |||
|
|||
/// Timeout after which a TCP connection attempt is considered failed. | |||
const TCP_DIAL_TIMEOUT: Duration = Duration::from_secs(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: The default timeout is in the order of minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my linux box it's 2 min 10 secs.
.max_negotiating_inbound_streams(2048) | ||
// Increase the default dial concurrency factor 8 to 16 to help with cases where DHT | ||
// has plenty of stale peer addresses. | ||
.dial_concurrency_factor(NonZeroU8::new(16).expect("0 < 16 < 256; qed")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: I think this is fine, but do you think there is a chance we might run into issues with the total number of file descriptors opened for the concurrent requests? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is theoretically possible, especially for nodes with high out peer count. All my systems show ulimit -n
as 1024. So, it means maximum 125 simultaneous connections to peers with bloated DHT records before the change, and 60 connections after the change.
Don't think the practical probability of hitting that many peers at once with bloated DHT records is high, but at least something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I think @bkchr's suggestion of pruning addresses more aggressively would be worth exploring as well. The number of known addresses shown in #3519 (comment) looks like a bug to me.
Yes, hopefully that one is fixed by #3657 you already reviewed. |
As this doesn't help with reaching validators with stale DHT records, I don't think we should increase the default concurrency factor. As for the TCP dial timeout, I'm not that sure, but unless we see any benefits from setting it to 20 secs, I would keep everything as is. So, I'm inclined to close this PR. |
This PR changes the default value for concurrent dials from 8 to 16, and sets the TCP dial timeout to 20 secs. This should help connecting to nodes that have a lot of stale addresses in the DHT.
This is a remediation for #3519 (collator side).