-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC connections are culled... sometimes #9061
Comments
There is no transport prioritization logic in libp2p, it just fire connections concurrently and the first one to work is used. We don't expect any particular pattern here, so it doing whatever it do is expected. |
That's okay, and I don't really care whether I hold QUIC or TCP connections. My problem is rather about why the node decides to close a large number of connections sometimes, and what I could do to avoid this. |
We're going to prioritize looking at #9041 first, and then see where that leaves this. |
Well, we moved to a different provider and gave it all some more CPU, which seems to have fixed this. It's unclear if it was the networking at our old provider, or if it was CPU time... EDIT: scratch that, we're running v0.17, but with:
EDIT2: scratch that again, sorry... we were running v0.16, but with above settings. |
@mrd0ll4r I've somehow missed: "ConnMgr": {
"GracePeriod": "0s",
"HighWater": 100000,
"LowWater": 0,
"Type": "basic"
}, This tell the connection manager to instantly close all connections when you are above 100K. "ConnMgr": {
"Type": "none"
}, |
At the time this came up that didn't work and I don't know if it does now. I think I tried a few configurations with the connection manager, and this one behave the most like I wanted, as in, take all the connections it can get. So far, I haven't reached 100k yet, but I'll let you know what happens if I do :) |
Another update: The connection drops are back, on new hardware with new networking, so I don't think it's related to that anymore. You can have a look at one of them here: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=1674369895849&to=1674377132096 I have access to the nodes, so I could get pprof info and whatnot, but I don't know when these things happen, so... not really. EDIT: More info: They reappeared when we upgraded from 0.16 to 0.17 |
This seems like a libp2p issue, so looping in @marten-seemann @MarcoPolo |
Thanks for not giving up on this :) Update: I upgraded to v0.18, which made the entire situation much worse. I then changed our plugin to not keep long-living Bitswap streams to everyone, which has reduced CPU time considerably. I suppose I'm hitting (other peers') resource manager limits, which may be why this has become more expensive recently. Interestingly however: while I was running 0.17, CPU usage was fine. Anyway, now we're really just passively sitting there and listening. Connection drops have become less frequent. So, my current theory is that it's somehow CPU related, not sure exactly how though. In general, the machine running this is never overloaded, looking at CPU time at least... Dashboard to look at stuff: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=now-30d&to=now&refresh=5m |
Not sure I understand what the issue is here. Is it connections being killed by your node, or by remote nodes. If they're being killed by your node, the answer probably lies in the connection manager, as it's the only place in libp2p where we kill connections unless explicitly asked to by the application. If it's by the remote node, this is hard to debug, as we haven't gotten around to implementing libp2p/specs#479 yet. |
As it's a large number of connections at the same time, I'd be very surprised if it were caused by the (unsynchronized) rest of the network, so I'm guessing it's my node. Interestingly, however, is that it happens on both of our nodes at the same time. They run on the same host, which leads me to think it's either load or networking related, on that node.
With the idea being that the network is some 40 to 50k peers at the moment, so 100k should never be hit. And indeed, I never hit 100k, but connections are still culled. On the other hand, my usage is not exactly normal, so... yeah. I don't want to take up your resources with this. It's weird and messes with our size estimates, but it's not terrible or anything. |
@mrd0ll4r a lot changed since 0.18.0, please upgrade to 0.18.1 and try (in order):
This will help us with keeping debugging/discussion on track and scoped to specific bug. |
Checklist
Installation method
built from source
Version
Config
Description
Attaching prometheus metrics for
ipfs_p2p_peers_total
over some time:I have the ConnMgr configured as described in #9041, which generally seems to work.
In almost all the cases, when QUIC connections are culled, they are replaced with connections from other transports.
(The IPv6 case above had both culled, but that might have been networking-related.)
Also, as can be seen, this happens independently for IPv4 and IPv6.
In many (maybe all) of the cases, I see some other things at the same time:
Prometheus (running on the same host, so probably not network-related) fails to get a few scrapes:
The memory usage (
go_memstats_alloc_bytes
) of the node spikes, but the VM has 16 GiB of memory, and there's not much else running, so I don't think that's the root cause, maybe more of a symptomCPU usage spikes (and this might actually hit the limit on the VM, which as 4 cores)
A large GC pause occurs (this is
irate(go_gc_duration_seconds_sum[1m])
)Number of FDs jumps, which indicates that new TCP connections are opened (since QUIC probably uses one FD for all connections, and TCP uses one per connection, iirc)
I know our usage is not exactly standard, but it would still be nice to figure out what's going on here.
We're running with
IPFS_FD_MAX=100000 LIBP2P_SWARM_FD_LIMIT=100000
andwarn
logging.I think I didn't see anything suspicious in the logs at that level, but we have quite a lot of logging go on, so that's not 100% certain.
Let me know how I can further help diagnose this issue!
The text was updated successfully, but these errors were encountered: