-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992
Comments
I can take this on. |
With the new synchronous link hand-off, I don't think this is needed anymore? @joostjager @cfromknecht? |
I added the following comment on #6467 though it could be applicable here. |
We'd likely want to also do this as well: #3003. As that'll ensure that we'll actually D/C if peers don't reply to pong messages. There's likely some shared surface area w.r.t the implementation of that and this. |
Bumping this, might be the solution to detect some of the stale Tor connections we run into at times. |
May be this can also prevent cascade FC from happening in first place similar to #7683 |
@BhaagBoseDK potentially, if the cause is a stale TCP connection we haven't identified. If the cascade is due to things taking longer than then CLTV delta to resolve on the outgoing link, then there's no way to avoid also force closing the incoming link, since at that point there's a race condition. |
Given the other issues we've seen with stuck state machines in lnd, does it make sense to detect this based on the state machine and not (just) ping/pong? Bit of a bandaid but at least it'll resolve the worst case outcome. |
@TheBlueMatt so far I haven't seen an logs or goroutine stack traces that indicate the state machine is actually deadlocked or live locked on something. Assuming the main loop is unblocked and just waiting on a signal, then it'll disconnect after 1 minute (configurable). At that point, assuming we either side is actually reachable (tor caveats apply, etc), then a reconnect should happen eventually. The idea of this issue was that we might be able to detect a stale connection sooner, and just avoid sending out a new HTLC all together (cancel back a potential forward). In the scenario above, if the link doesn't come back in time, then we should cancel back the HTLC (HTLC got in the mailbox of the outgoing link, but it never ACK'd it so we cancel back): Lines 695 to 700 in 51a23b0
Lines 515 to 553 in 51a23b0
|
I'm not super familiar with lnd's internals, so I have no idea what would cause it, all I can say is I've seen two instances of confirmed-LND peers which were at least they were alive enough to ping-pong for days across a TCP socket (and one was able to log Indeed, disconnecting from a peer that isn't responsive to ping for 10-20 seconds is pretty important :) |
I am currently working on this issue. The current state of things is that #3003 has been solved with #7828 , awaiting merge. This will ensure that invalid or untimely pong responses cause LND to disconnect from the peer. Implicitly this should mean that every connection we have is live. This information can be as stale as the interval, though, and we may wish to require a more stringent version of this check for messages that update the channel state. It seems that the main bulk of this work is to make the htlcswitch verify the liveness of the peer before attempting to queue any irrevocable messages for sending. There are potentially other things that can be done here though. While we do enforce pong responses for our peers as of #7828, it need not be our only source of information when it comes to assessing liveness. In fact, any message we receive from the peer should imply that we have a live connection to them. However, if we need to assess whether the Link itself is live then there may be additional things we need to check. |
fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(. |
The behavior as of 0.16.3 is to actually disconnect/recycle when we detect that things are stalled (no reply to a If for w/e reason the lnd node is the one that's stalled (exactly why, idk), then I'd expect w/e other implementation to attempt to recycle the connection. If that doesn't do the trick, then maybe would point to a hardware/networking/tor/database issue with the peer. In all these instances, is the peer solely accepting and creating connections over Tor? |
No, in this case it was a Big Iron node processing a pretty substantial volume that was sending some update messages and we send a CS and then things went silent for a day. They claimed a day later to be on 16.4rc, but it's possible they had just upgraded or were misinformed. We've also now shipped code to disconnect after a minute or so of silence so I doubt we'll nag about this one again, but even a minute is a UX-destroying timeline :(. |
We probably need a different issue to cover that since this one is about LND's ability to assess the presence of the peer, not its ability to respond to peer messages. |
Today in
lnd
we don't attempt to perform any sort of liveness checks for the target link before we attempt to initiate or forward a multi-hop payment. It could be the case that by the we go to send theUpdateAddHLTC
and correspondingCommitSig
message, the link has already died (peer disconnect). Today, if this happens, then the switch doesn't detect it and the HTLC dangles there uncommitted. Although it wasn't added to the outgoing commitment transaction, we don't yet cancel from the incoming link. In general, the switch should be aware that this change was never actually fully committed and cancel it back, but the current auto-resend mechanism in the link after a commitment diff is written prevents it from doing this. A candidate for lower hanging fruit is to implement thisLinkLivenessChecker
to attempt to prevent something like this from happening in the first place.A potential interface resembles something like the following:
The abstract interface will allow the switch to determine if a link if even eligible to forward an HTLC. Depending on the initial implementation, we may also want to combine this with the concept of an "active" or "eligible to forward" link.
A simple liveness checking mechanism which is used by some of the other implementation, is to send a
Ping
message to the channel peer if we haven't heard from them in a while. The time period that constitutes "a while" has yet to be parameterized. If we hear aPong
message back from the remote peer, then we can assume that it's "live".It's worth noting that the above mechanism doesn't really tell us if the link is live or not, but rather the incoming message ingestion loop of our peer. A channels' state machine could be borked, yet the node able to read in messages no problem.
Steps To Completion
Create the new
LinkLivenessChecker
interface.Implement a ping/pong checker for this new interface.
Consolidate the existing active and eligible to forward logic within the switch with this new method.
During the process of forwarding, the switch should now (indirectly or directly) query this new interface.
The text was updated successfully, but these errors were encountered: