htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

Roasbeef · 2019-04-23T01:43:06Z

Today in lnd we don't attempt to perform any sort of liveness checks for the target link before we attempt to initiate or forward a multi-hop payment. It could be the case that by the we go to send the UpdateAddHLTC and corresponding CommitSig message, the link has already died (peer disconnect). Today, if this happens, then the switch doesn't detect it and the HTLC dangles there uncommitted. Although it wasn't added to the outgoing commitment transaction, we don't yet cancel from the incoming link. In general, the switch should be aware that this change was never actually fully committed and cancel it back, but the current auto-resend mechanism in the link after a commitment diff is written prevents it from doing this. A candidate for lower hanging fruit is to implement this LinkLivenessChecker to attempt to prevent something like this from happening in the first place.

A potential interface resembles something like the following:

type LinkLivenessChecker interface {
    IsLive(chanPoint wire.OutPoint, chanPeer lnpeer.Peer) bool
}

The abstract interface will allow the switch to determine if a link if even eligible to forward an HTLC. Depending on the initial implementation, we may also want to combine this with the concept of an "active" or "eligible to forward" link.

A simple liveness checking mechanism which is used by some of the other implementation, is to send a Ping message to the channel peer if we haven't heard from them in a while. The time period that constitutes "a while" has yet to be parameterized. If we hear a Pong message back from the remote peer, then we can assume that it's "live".

It's worth noting that the above mechanism doesn't really tell us if the link is live or not, but rather the incoming message ingestion loop of our peer. A channels' state machine could be borked, yet the node able to read in messages no problem.

Steps To Completion

Create the new LinkLivenessChecker interface.
Implement a ping/pong checker for this new interface.
Consolidate the existing active and eligible to forward logic within the switch with this new method.
During the process of forwarding, the switch should now (indirectly or directly) query this new interface.

The text was updated successfully, but these errors were encountered:

mlerner · 2019-10-15T02:21:02Z

I can take this on.

Roasbeef · 2020-07-16T18:02:15Z

With the new synchronous link hand-off, I don't think this is needed anymore? @joostjager @cfromknecht?

BhaagBoseDK · 2022-05-08T07:21:05Z

I added the following comment on #6467 though it could be applicable here.

#6467 (comment)

Roasbeef · 2022-11-30T04:04:43Z

We'd likely want to also do this as well: #3003. As that'll ensure that we'll actually D/C if peers don't reply to pong messages. There's likely some shared surface area w.r.t the implementation of that and this.

Roasbeef · 2023-05-11T18:51:40Z

Bumping this, might be the solution to detect some of the stale Tor connections we run into at times.

BhaagBoseDK · 2023-05-12T11:36:12Z

May be this can also prevent cascade FC from happening in first place similar to #7683

Roasbeef · 2023-05-12T21:20:00Z

@BhaagBoseDK potentially, if the cause is a stale TCP connection we haven't identified.

If the cascade is due to things taking longer than then CLTV delta to resolve on the outgoing link, then there's no way to avoid also force closing the incoming link, since at that point there's a race condition.

TheBlueMatt · 2023-05-15T18:46:20Z

Given the other issues we've seen with stuck state machines in lnd, does it make sense to detect this based on the state machine and not (just) ping/pong? Bit of a bandaid but at least it'll resolve the worst case outcome.

Roasbeef · 2023-05-15T18:57:02Z

@TheBlueMatt so far I haven't seen an logs or goroutine stack traces that indicate the state machine is actually deadlocked or live locked on something. Assuming the main loop is unblocked and just waiting on a signal, then it'll disconnect after 1 minute (configurable). At that point, assuming we either side is actually reachable (tor caveats apply, etc), then a reconnect should happen eventually.

The idea of this issue was that we might be able to detect a stale connection sooner, and just avoid sending out a new HTLC all together (cancel back a potential forward).

In the scenario above, if the link doesn't come back in time, then we should cancel back the HTLC (HTLC got in the mailbox of the outgoing link, but it never ACK'd it so we cancel back):

lnd/htlcswitch/mailbox.go

Lines 695 to 700 in 51a23b0

    
           // FailAdd fails an UpdateAddHTLC that exists within the mailbox, removing it 
        
           // from the in-memory replay buffer. This will prevent the packet from being 
        
           // delivered after the link restarts if the switch has remained online. The 
        
           // generated LinkError will show an OutgoingFailureDownstreamHtlcAdd 
        
           // FailureDetail. 
        
           func (m *memoryMailBox) FailAdd(pkt *htlcPacket) {

lnd/htlcswitch/mailbox.go

Lines 515 to 553 in 51a23b0

    
           // If we have a pending Add, we'll also construct the 
        
           // deadline so we can fail it back if we are unable to 
        
           // deliver any message in time. We also dereference the 
        
           // nextAdd's packet, since we will need access to it in 
        
           // the case we are delivering it and/or if the deadline 
        
           // expires. 
        
           // 
        
           // NOTE: It's possible after this point for add to be 
        
           // nil, but this can only occur when addOutbox is also 
        
           // nil, hence we won't accidentally deliver a nil 
        
           // packet. 
        
           if nextAdd != nil { 
        
           	add = nextAdd.pkt 
        
           	deadline = nextAdd.deadline(m.cfg.clock) 
        
           } 
        
           select { 
        
           case pktOutbox <- nextRep: 
        
           	m.pktCond.L.Lock() 
        
           	// Only advance the repHead if this Settle or 
        
           	// Fail is still at the head of the queue. 
        
           	if m.repHead != nil && m.repHead == nextRepEl { 
        
           		m.repHead = m.repHead.Next() 
        
           	} 
        
           	m.pktCond.L.Unlock() 
        
           case addOutbox <- add: 
        
           	m.pktCond.L.Lock() 
        
           	// Only advance the addHead if this Add is still 
        
           	// at the head of the queue. 
        
           	if m.addHead != nil && m.addHead == nextAddEl { 
        
           		m.addHead = m.addHead.Next() 
        
           	} 
        
           	m.pktCond.L.Unlock() 
        
           case <-deadline: 
        
           	log.Debugf("Expiring add htlc with "+ 
        
           		"keystone=%v", add.keystone()) 
        
           	m.FailAdd(add)

TheBlueMatt · 2023-05-15T22:09:55Z

@TheBlueMatt so far I haven't seen an logs or goroutine stack traces that indicate the state machine is actually deadlocked or live locked on something. Assuming the main loop is unblocked and just waiting on a signal, then it'll disconnect after 1 minute (configurable).

I'm not super familiar with lnd's internals, so I have no idea what would cause it, all I can say is I've seen two instances of confirmed-LND peers which were at least they were alive enough to ping-pong for days across a TCP socket (and one was able to log failing link: unable to complete dance with error: remote unresponsive). So maybe the bug is simply that the disconnection there doesn't work right? Anyway, sounds like this isn't right right issue to discuss it, so we should move back to 7682.

Indeed, disconnecting from a peer that isn't responsive to ping for 10-20 seconds is pretty important :)

ProofOfKeags · 2023-07-24T20:04:28Z

I am currently working on this issue. The current state of things is that #3003 has been solved with #7828 , awaiting merge. This will ensure that invalid or untimely pong responses cause LND to disconnect from the peer. Implicitly this should mean that every connection we have is live. This information can be as stale as the interval, though, and we may wish to require a more stringent version of this check for messages that update the channel state.

It seems that the main bulk of this work is to make the htlcswitch verify the liveness of the peer before attempting to queue any irrevocable messages for sending.

There are potentially other things that can be done here though. While we do enforce pong responses for our peers as of #7828, it need not be our only source of information when it comes to assessing liveness. In fact, any message we receive from the peer should imply that we have a live connection to them. However, if we need to assess whether the Link itself is live then there may be additional things we need to check.

ProofOfKeags · 2023-07-26T19:13:54Z

The current plan is to ship #7828 and wait and see if that solves some of these problems and we will potentially revive this effort when we can better define what, in addition to what is accomplished by #7828, is needed to solve the underlying problem.

TheBlueMatt · 2023-07-30T22:26:33Z

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(.

Roasbeef · 2023-08-01T00:38:59Z

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(

The behavior as of 0.16.3 is to actually disconnect/recycle when we detect that things are stalled (no reply to a commit_sig message after 1 minute). The introduction of this behavior actually helped us expose an existing concurrency bug which was cool.

If for w/e reason the lnd node is the one that's stalled (exactly why, idk), then I'd expect w/e other implementation to attempt to recycle the connection. If that doesn't do the trick, then maybe would point to a hardware/networking/tor/database issue with the peer. In all these instances, is the peer solely accepting and creating connections over Tor?

TheBlueMatt · 2023-08-01T01:03:19Z

No, in this case it was a Big Iron node processing a pretty substantial volume that was sending some update messages and we send a CS and then things went silent for a day. They claimed a day later to be on 16.4rc, but it's possible they had just upgraded or were misinformed. We've also now shipped code to disconnect after a minute or so of silence so I doubt we'll nag about this one again, but even a minute is a UX-destroying timeline :(.

ProofOfKeags · 2023-08-01T15:31:57Z

fwiw I've seen one or two cases of a (claimed) LND 16.3/16.4 peer which appears to stop responding to channel commitment messages for hours still :(.

We probably need a different issue to cover that since this one is about LND's ability to assess the presence of the peer, not its ability to respond to peer messages.

Roasbeef added enhancement Improvements to existing features / behaviour p2p Code related to the peer-to-peer behaviour intermediate Issues suitable for developers moderately familiar with the codebase and LN htlcswitch P3 might get fixed, nice to have labels Apr 23, 2019

Roasbeef mentioned this issue May 27, 2021

Unexplainable clock drift in LND causing issues #5340

Closed

This was referenced May 2, 2022

Avoid force-close: ping peer before sending HTLC #6482

Closed

Outgoing Channel force closed without any HTLC on channel. #6467

Closed

Roasbeef mentioned this issue Nov 30, 2022

[bug]: force close for unclear reason #7180

Closed

M1ch43lV mentioned this issue May 14, 2023

Probing rkfg/regolancer#57

Closed

Roasbeef added P1 MUST be fixed or reviewed and removed P3 might get fixed, nice to have labels May 31, 2023

saubyk modified the milestones: v0.18.0, v0.17.1 May 31, 2023

saubyk assigned ProofOfKeags Jun 6, 2023

saubyk modified the milestones: v0.17.2, v0.17.1 Jun 15, 2023

saubyk removed this from the High Priority milestone Aug 8, 2023

Roasbeef added P3 might get fixed, nice to have and removed P1 MUST be fixed or reviewed labels Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

Roasbeef commented Apr 23, 2019

mlerner commented Oct 15, 2019

Roasbeef commented Jul 16, 2020

BhaagBoseDK commented May 8, 2022

Roasbeef commented Nov 30, 2022

Roasbeef commented May 11, 2023

BhaagBoseDK commented May 12, 2023

Roasbeef commented May 12, 2023

TheBlueMatt commented May 15, 2023 •

edited

Loading

Roasbeef commented May 15, 2023

TheBlueMatt commented May 15, 2023

ProofOfKeags commented Jul 24, 2023

ProofOfKeags commented Jul 26, 2023

TheBlueMatt commented Jul 30, 2023

Roasbeef commented Aug 1, 2023

TheBlueMatt commented Aug 1, 2023 •

edited

Loading

ProofOfKeags commented Aug 1, 2023

htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

htlcswitch: introduce new LinkLivenessChecker interace and ping implementation #2992

Comments

Roasbeef commented Apr 23, 2019

Steps To Completion

mlerner commented Oct 15, 2019

Roasbeef commented Jul 16, 2020

BhaagBoseDK commented May 8, 2022

Roasbeef commented Nov 30, 2022

Roasbeef commented May 11, 2023

BhaagBoseDK commented May 12, 2023

Roasbeef commented May 12, 2023

TheBlueMatt commented May 15, 2023 • edited Loading

Roasbeef commented May 15, 2023

TheBlueMatt commented May 15, 2023

ProofOfKeags commented Jul 24, 2023

ProofOfKeags commented Jul 26, 2023

TheBlueMatt commented Jul 30, 2023

Roasbeef commented Aug 1, 2023

TheBlueMatt commented Aug 1, 2023 • edited Loading

ProofOfKeags commented Aug 1, 2023

TheBlueMatt commented May 15, 2023 •

edited

Loading

TheBlueMatt commented Aug 1, 2023 •

edited

Loading