-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ping: Repeated errors emitted from ping behaviour #4410
Comments
Also to add, a workaround for me would be to make sure that the peer condition isnt |
Thank you for reporting @dariusc93. Seems like the ping handler is continuously trying to open a new stream to the remote: rust-libp2p/protocols/ping/src/handler.rs Lines 324 to 330 in e36e8ce
for that new outbound stream to then fail. rust-libp2p/protocols/ping/src/handler.rs Lines 182 to 211 in e36e8ce
Ideally we would switch into Do you know why the go-libp2p node drops the stream instead of denying the ping protocol? Can you post the logs of the go node here? |
We do that already if the negotiation fails. I don't think we should switch to I am not sure we should do anything different for subsequent IO errors other than reporting them? As of #3947, it is the user's responsibility to implement a connection management policy based on the ping protocol. |
I havent look to much into it yet but suspect that that its related to the multiple connections made that something go-libp2p doesnt like from what i gathered. At least related to the ping behaviour. Can look more into it a little later though and collect the logs from go-libp2p side.
While we can report them, if the failure rate continue to go up for that handler without resetting, I do believe that it should be made inactive for obvious reason rather than for it to be closed by the user, especially if that connection is successfully active in other behaviours (which seems to be the case from what I can tell). Of course, if the user wish to close it, then that is their choice too, but most beginners may not know that to be the case and may end up being subjected to a DoS due to the recurring events if they run into this issue. |
What would you say is a reasonable setting for this? I am struggling to come up with a good value here.
What DoS case are you seeing? The worst I can think of is some kind of log spam if the user decided to log every error that is coming in. |
I would probably say anything over 10 might be reasonable since i dont think there is a case where there would be more than 10 consecutive errors before the rate actually resets, although setting it as high as 50 could also be reasonable too.
From what I've observed, the biggest one would be high cpu usage. If there is more than one handler reporting errors like it was described above (which can be the case), the usage would be higher with the potential of rendering the host unusable or unstable as a result (assuming the node doesnt get killed by the system first) |
|
I am confused about that. Why would the CPU usage high? I think the actual bug here is that we don't have a delay in the ping state machine on an error: rust-libp2p/protocols/ping/src/handler.rs Lines 290 to 332 in b9028e8
If we fail to open an outbound ping stream, we should still wait for the ping delay before we try again and not hammer the node with another stream. |
That could be the reason for the high cpu usage is that its trying again right away only for it to get an error and continue to emit the events from the connection handler. Wasnt sure if that was intended to send right after a failure though (havent looked at the ping spec to determine that). Something like the following should help with that diff --git a/protocols/ping/src/handler.rs b/protocols/ping/src/handler.rs
index 0cf5c6e5..d9f14e4a 100644
--- a/protocols/ping/src/handler.rs
+++ b/protocols/ping/src/handler.rs
@@ -149,6 +149,8 @@ pub struct Handler {
state: State,
/// The peer we are connected to.
peer: PeerId,
+ /// delay after a failure is reported
+ failure_delay: Option<Delay>,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
@@ -176,6 +178,7 @@ impl Handler {
outbound: None,
inbound: None,
state: State::Active,
+ failure_delay: None,
}
}
@@ -271,6 +274,12 @@ impl ConnectionHandler for Handler {
}
loop {
+ if let Some(delay) = self.failure_delay.as_mut() {
+ if delay.poll_unpin(cx).is_pending() {
+ return Poll::Pending;
+ }
+ }
+
// Check for outbound ping failures.
if let Some(error) = self.pending_errors.pop_back() {
log::debug!("Ping failure: {:?}", error);
@@ -283,10 +292,13 @@ impl ConnectionHandler for Handler {
// that use a single substream, since every successful ping
// resets `failures` to `0`.
if self.failures > 1 {
+ self.failure_delay = Some(Delay::new(Duration::from_secs(5)));
return Poll::Ready(ConnectionHandlerEvent::NotifyBehaviour(Err(error)));
}
}
+ self.failure_delay = None;
+
// Continue outbound pings.
match self.outbound.take() {
Some(OutboundState::Ping(mut ping)) => match ping.poll_unpin(cx) { (duration is just an example, could be set to a second, which should be more than enough, but it might be better to move I still do feel that if the failure rate do continue to increase that the handler should be made inactive since it makes no since for it to continue to open a stream if all that will return is an error after so many tries. Would there be any objections against such an implementation? Eg if the rate reaches 20 to set it to be inactive. Should be more than enough |
Thanks for the suggestion. Mind opening that as a PR? The thing I'd change is that instead of introducing another delay, we can simply poll the same interval we already have also in the Regarding the max failures I am a bit unsure:
The idea of the ping protocol is that it is extremely simple. If a node cannot even handle that, the connection is likely broken and you can / should close it. Whether you do that after the first, second or 20th error is up to you but I don't see why we should add an extra code path just for the ping protocol to be disabled. With the above bug-fix, we should no longer as easily exhaust the resource limit of other nodes. |
I went on and made a PR (though I will go back shortly and change it to use the interval there), but will leave out the snippet regarding the state change after reaching a specific failure rate since the code above should prevent the high cpu usage. While I can agree that a connection could likely be broken and should be close in this case, I have also seen other protocols like gossipsub operate on the same connection too so its hard to really say. |
Summary
I dont know if this is considered as a bug in the handler in the ping protocol, or if something is going on between go-libp2p and rust-libp2p, but when dialing a node of go-libp2p that is discovered via mdns (in my case, indirectly by adding an explicit peer to gossipsub) would cause one or more of the connections to report a "connection is closed" and/or "eof" error from the ping event. This other node does support the ping protocol, but event with the error would continue to be emitted unless the connection is closed via
Swarm::close_connection
. However, this may or may not yield undesirable results when interacting with the node in some way if the connection handler is operational in a different behaviour or protocol.Debug Output
A lot is omitted due to it repeating the error and can provide a sample code used if requestedPossible Solution
Maybe change the state of the connection handler to inactive to prevent more errors from being reported to swarm especially if there is no change in failure rate or if the failure rate is high (anything over 10 might be worth making inactive?), though this doesnt exactly target the cause of "connection is closed" error.
Version
Would you like to work on fixing this bug?
Maybe, depending on the best way to go about this besides changing the state
The text was updated successfully, but these errors were encountered: