Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

Open
4 tasks
Roasbeef opened this issue Apr 23, 2019 · 16 comments
Open
4 tasks
Assignees
Labels
enhancement Improvements to existing features / behaviour htlcswitch intermediate Issues suitable for developers moderately familiar with the codebase and LN p2p Code related to the peer-to-peer behaviour P3 might get fixed, nice to have

Comments

@Roasbeef
Copy link
Member

Today in lnd we don't attempt to perform any sort of liveness checks for the target link before we attempt to initiate or forward a multi-hop payment. It could be the case that by the we go to send the UpdateAddHLTC and corresponding CommitSig message, the link has already died (peer disconnect). Today, if this happens, then the switch doesn't detect it and the HTLC dangles there uncommitted. Although it wasn't added to the outgoing commitment transaction, we don't yet cancel from the incoming link. In general, the switch should be aware that this change was never actually fully committed and cancel it back, but the current auto-resend mechanism in the link after a commitment diff is written prevents it from doing this. A candidate for lower hanging fruit is to implement this LinkLivenessChecker to attempt to prevent something like this from happening in the first place.

A potential interface resembles something like the following:

type LinkLivenessChecker interface {
    IsLive(chanPoint wire.OutPoint, chanPeer lnpeer.Peer) bool
}

The abstract interface will allow the switch to determine if a link if even eligible to forward an HTLC. Depending on the initial implementation, we may also want to combine this with the concept of an "active" or "eligible to forward" link.

A simple liveness checking mechanism which is used by some of the other implementation, is to send a Ping message to the channel peer if we haven't heard from them in a while. The time period that constitutes "a while" has yet to be parameterized. If we hear a Pong message back from the remote peer, then we can assume that it's "live".

It's worth noting that the above mechanism doesn't really tell us if the link is live or not, but rather the incoming message ingestion loop of our peer. A channels' state machine could be borked, yet the node able to read in messages no problem.

Steps To Completion

  • Create the new LinkLivenessChecker interface.

  • Implement a ping/pong checker for this new interface.

  • Consolidate the existing active and eligible to forward logic within the switch with this new method.

  • During the process of forwarding, the switch should now (indirectly or directly) query this new interface.

@Roasbeef Roasbeef added enhancement Improvements to existing features / behaviour p2p Code related to the peer-to-peer behaviour intermediate Issues suitable for developers moderately familiar with the codebase and LN htlcswitch P3 might get fixed, nice to have labels Apr 23, 2019
@mlerner
Copy link
Contributor

mlerner commented Oct 15, 2019

I can take this on.

@Roasbeef
Copy link
Member Author

With the new synchronous link hand-off, I don't think this is needed anymore? @joostjager @cfromknecht?

@BhaagBoseDK
Copy link
Contributor

I added the following comment on #6467 though it could be applicable here.

#6467 (comment)

@Roasbeef
Copy link
Member Author

We'd likely want to also do this as well: #3003. As that'll ensure that we'll actually D/C if peers don't reply to pong messages. There's likely some shared surface area w.r.t the implementation of that and this.

@Roasbeef
Copy link
Member Author

Bumping this, might be the solution to detect some of the stale Tor connections we run into at times.

@BhaagBoseDK
Copy link
Contributor

May be this can also prevent cascade FC from happening in first place similar to #7683

@Roasbeef
Copy link
Member Author

@BhaagBoseDK potentially, if the cause is a stale TCP connection we haven't identified.

If the cascade is due to things taking longer than then CLTV delta to resolve on the outgoing link, then there's no way to avoid also force closing the incoming link, since at that point there's a race condition.

@TheBlueMatt
Copy link

TheBlueMatt commented May 15, 2023

Given the other issues we've seen with stuck state machines in lnd, does it make sense to detect this based on the state machine and not (just) ping/pong? Bit of a bandaid but at least it'll resolve the worst case outcome.

@Roasbeef
Copy link
Member Author

@TheBlueMatt so far I haven't seen an logs or goroutine stack traces that indicate the state machine is actually deadlocked or live locked on something. Assuming the main loop is unblocked and just waiting on a signal, then it'll disconnect after 1 minute (configurable). At that point, assuming we either side is actually reachable (tor caveats apply, etc), then a reconnect should happen eventually.

The idea of this issue was that we might be able to detect a stale connection sooner, and just avoid sending out a new HTLC all together (cancel back a potential forward).

In the scenario above, if the link doesn't come back in time, then we should cancel back the HTLC (HTLC got in the mailbox of the outgoing link, but it never ACK'd it so we cancel back):

lnd/htlcswitch/mailbox.go

Lines 695 to 700 in 51a23b0

// FailAdd fails an UpdateAddHTLC that exists within the mailbox, removing it
// from the in-memory replay buffer. This will prevent the packet from being
// delivered after the link restarts if the switch has remained online. The
// generated LinkError will show an OutgoingFailureDownstreamHtlcAdd
// FailureDetail.
func (m *memoryMailBox) FailAdd(pkt *htlcPacket) {

lnd/htlcswitch/mailbox.go

Lines 515 to 553 in 51a23b0

// If we have a pending Add, we'll also construct the
// deadline so we can fail it back if we are unable to
// deliver any message in time. We also dereference the
// nextAdd's packet, since we will need access to it in
// the case we are delivering it and/or if the deadline
// expires.
//
// NOTE: It's possible after this point for add to be
// nil, but this can only occur when addOutbox is also
// nil, hence we won't accidentally deliver a nil
// packet.
if nextAdd != nil {
add = nextAdd.pkt
deadline = nextAdd.deadline(m.cfg.clock)
}
select {
case pktOutbox <- nextRep:
m.pktCond.L.Lock()
// Only advance the repHead if this Settle or
// Fail is still at the head of the queue.
if m.repHead != nil && m.repHead == nextRepEl {
m.repHead = m.repHead.Next()
}
m.pktCond.L.Unlock()
case addOutbox <- add:
m.pktCond.L.Lock()
// Only advance the addHead if this Add is still
// at the head of the queue.
if m.addHead != nil && m.addHead == nextAddEl {
m.addHead = m.addHead.Next()
}
m.pktCond.L.Unlock()
case <-deadline:
log.Debugf("Expiring add htlc with "+
"keystone=%v", add.keystone())
m.FailAdd(add)

@TheBlueMatt
Copy link

@TheBlueMatt so far I haven't seen an logs or goroutine stack traces that indicate the state machine is actually deadlocked or live locked on something. Assuming the main loop is unblocked and just waiting on a signal, then it'll disconnect after 1 minute (configurable).

I'm not super familiar with lnd's internals, so I have no idea what would cause it, all I can say is I've seen two instances of confirmed-LND peers which were at least they were alive enough to ping-pong for days across a TCP socket (and one was able to log failing link: unable to complete dance with error: remote unresponsive). So maybe the bug is simply that the disconnection there doesn't work right? Anyway, sounds like this isn't right right issue to discuss it, so we should move back to 7682.

Indeed, disconnecting from a peer that isn't responsive to ping for 10-20 seconds is pretty important :)

@Roasbeef Roasbeef added P1 MUST be fixed or reviewed and removed P3 might get fixed, nice to have labels May 31, 2023
@saubyk saubyk modified the milestones: v0.18.0, v0.17.1 May 31, 2023
@saubyk saubyk modified the milestones: v0.17.2, v0.17.1 Jun 15, 2023
@ProofOfKeags
Copy link
Collaborator

I am currently working on this issue. The current state of things is that #3003 has been solved with #7828 , awaiting merge. This will ensure that invalid or untimely pong responses cause LND to disconnect from the peer. Implicitly this should mean that every connection we have is live. This information can be as stale as the interval, though, and we may wish to require a more stringent version of this check for messages that update the channel state.

It seems that the main bulk of this work is to make the htlcswitch verify the liveness of the peer before attempting to queue any irrevocable messages for sending.

There are potentially other things that can be done here though. While we do enforce pong responses for our peers as of #7828, it need not be our only source of information when it comes to assessing liveness. In fact, any message we receive from the peer should imply that we have a live connection to them. However, if we need to assess whether the Link itself is live then there may be additional things we need to check.

@ProofOfKeags
Copy link
Collaborator

The current plan is to ship #7828 and wait and see if that solves some of these problems and we will potentially revive this effort when we can better define what, in addition to what is accomplished by #7828, is needed to solve the underlying problem.

@TheBlueMatt
Copy link

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(.

@Roasbeef
Copy link
Member Author

Roasbeef commented Aug 1, 2023

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(

The behavior as of 0.16.3 is to actually disconnect/recycle when we detect that things are stalled (no reply to a commit_sig message after 1 minute). The introduction of this behavior actually helped us expose an existing concurrency bug which was cool.

If for w/e reason the lnd node is the one that's stalled (exactly why, idk), then I'd expect w/e other implementation to attempt to recycle the connection. If that doesn't do the trick, then maybe would point to a hardware/networking/tor/database issue with the peer. In all these instances, is the peer solely accepting and creating connections over Tor?

@TheBlueMatt
Copy link

TheBlueMatt commented Aug 1, 2023

No, in this case it was a Big Iron node processing a pretty substantial volume that was sending some update messages and we send a CS and then things went silent for a day. They claimed a day later to be on 16.4rc, but it's possible they had just upgraded or were misinformed. We've also now shipped code to disconnect after a minute or so of silence so I doubt we'll nag about this one again, but even a minute is a UX-destroying timeline :(.

@ProofOfKeags
Copy link
Collaborator

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(.

We probably need a different issue to cover that since this one is about LND's ability to assess the presence of the peer, not its ability to respond to peer messages.

@saubyk saubyk removed this from the High Priority milestone Aug 8, 2023
@Roasbeef Roasbeef added P3 might get fixed, nice to have and removed P1 MUST be fixed or reviewed labels Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to existing features / behaviour htlcswitch intermediate Issues suitable for developers moderately familiar with the codebase and LN p2p Code related to the peer-to-peer behaviour P3 might get fixed, nice to have
Projects
None yet
Development

No branches or pull requests

6 participants