-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swarm: potential issue with dial limiter #1553
Comments
Digging through the data there a few things I see:
|
Yeah, big +1 here. I honestly think 5s should be plenty here.
Yeah, this is a good idea. We will have to come up with a good heuristic here, but it should be pretty simple to do. |
So I just realized that there may be another problem: Example:
This is another thing fixable by making timeouts more granular / smarter |
Damn. Yeah, that dial timeout is in the wrong place. It should be per connection. However, moving the dial timeout may make things worse: a single peer with a ton of addresses could clog the dialer forever. So, we may need two (max time dialing a single peer, max time per connection). |
Not really as we allow only 8 concurrent dial per peer at a time, but yeah, it's still an issue |
Status: We now have this. |
@whyrusleeping It's possible that in your second example there was 1 or more dials parked by the limiter due to FD limits being exceeded. Not sure where the tracing kicks in in your logging gadget, but if it's when the dial is actually inflight, that would explain why the trace was 1min long but it only contained two adjacent ~0.5ms spans. |
So this isnt necessarily an implementation bug, but rather a design bug. While investigating connectivity issues i put together a log scraper coalescer thingy that gathers information about each overall dial attempt and each of its individual dials. The full log is here, and should be pretty interesting for anyone who cares about dials.
I'm making this issue because I see a lot of dials like this one:
Where the overall dial takes about 30 seconds, but the sum of the dials involved take only ~3.5s.
Then theres also this one:
Where the overall dial operation takes a full minute, but only contains two dials, each of which took ~0 time.
The text was updated successfully, but these errors were encountered: