Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Consider failing HTLC backwards before upstream claims on-chain #7683

Open
BhaagBoseDK opened this issue May 10, 2023 · 29 comments
Open
Assignees
Labels
force closes P1 MUST be fixed or reviewed
Milestone

Comments

@BhaagBoseDK
Copy link
Contributor

Background

Consider an HTLC chain

Peer A -> Peer B -> Offline Peer

And assume Peer B Force Closes on Offline Peer due to HTLC missing in remote commitment upon expiry of HTLC.

The Force Close transaction is stuck in mempool for 144 blocks (CLTV delta of Peer B)

Now after 144 blocks, the peer A will also force close on peer B just because peer B has not failed the HTLC backward.

This causes a cascade of FC in current mempool (and specially with peers with shorter CLTV delta).

There is a similar case with LDK -> lightningdevkit/rust-lightning#2275

Logs:
Peer B force closes on an offline peer after HTLC expiry.

/home/umbrel/lndlog/lnd.log.754.gz:2023-05-08 06:18:56.124 [INF] CNCT: ChannelArbitrator(0b228050fd8eeecf22073086a8885faf0c4d2bc02ad9480f23767368da411905:0): immediately failing htlc=3837313232636338363662613938653434353830343430613764383636646666 from remote commitment
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-08 06:18:56.204 [INF] CNCT: ChannelArbitrator(0b228050fd8eeecf22073086a8885faf0c4d2bc02ad9480f23767368da411905:0): force closing chan

The force close transaction is still in mempool.
144 blocks later peer A also force closed in a cascade

/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.021 [INF] CNCT: Unilateral close of ChannelPoint(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0) detected
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.048 [WRN] HSWC: ChannelLink(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): remote peer has closed on-chain
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.050 [INF] HSWC: ChannelLink(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): exited
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.050 [INF] CNCT: ChannelArbitrator(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): remote party has closed channel out on-chain

The second force close would have been prevented if HTLC was failed backward by peer B after force close with Offline Peer.

Your environment

  • version of lnd
    "version": "0.16.2-beta commit=v0.16.2-beta",

  • which operating system (uname -a on *Nix)

Linux umbrel 5.10.17-v8+ #1421 SMP PREEMPT Thu May 27 14:01:37 BST 2021 aarch64 GNU/Linux

Steps to reproduce

See background.

Expected behaviour

When Peer B force closes on offline peer/forward peer, it should immediately fail the HTLC backward to prevent peer A force close.

Actual behaviour

Cascade of Force Close down the chain.

@BhaagBoseDK BhaagBoseDK added bug Unintended code behaviour needs triage labels May 10, 2023
@ellemouton
Copy link
Collaborator

ellemouton commented May 10, 2023

The thing is that Peer B wants to make sure that they can claim the timeout path before they fail back the htlc. Otherwise there is a chance that the offline peer comes back online just in time & then claims the success path. This will mean a loss of funds for peer B if they have already failed the htlc back to peer A

@TheBlueMatt
Copy link

TheBlueMatt commented May 10, 2023

Sure but if you're already out of time on the backwards path you run that risk anyway? We're thinking about this on the LDK end and I'm a bit torn but it does seem like the "common case" here is the cascading failure, not the attack, though it's possible that changes with package relay.

@BhaagBoseDK
Copy link
Contributor Author

the offline peer should only be able to claim until the HTLC expiry. It should have no bearing on when the FC transaction is confirmed on-chain.

@ellemouton
Copy link
Collaborator

Sure but if you're already out of time on the backwards path you run that risk anyway?

Ah, that is a good point.

the offline peer should only be able to claim until the HTLC expiry.

unforch that is not possible to enforce with bitcoin Script. After the htlc expiry, the output becomes a free-for-all if the pre-image is known.

@BhaagBoseDK
Copy link
Contributor Author

BhaagBoseDK commented May 10, 2023

another point is this specific case of "missing HTLC in remote commitment". In the FC from peer B -> offline peer, the HTLC was not in the remote commitment (therefore there is no possibility for the offline peer to come back and claim it later). The FC transaction does not even have the HTLC.
8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd
In this case it could be safely failed the HTLC with peer A.

@ziggie1984
Copy link
Collaborator

another point is this specific case of "missing HTLC in remote commitment". In the FC from peer B -> offline peer, the HTLC was not in the remote commitment (therefore there is no possibility for the offline peer to come back and claim it later). The FC transaction does not even have the HTLC.
8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd
In this case it could be safely failed the HTLC with peer A.

Not really because you have to make sure that your Commitment (without the HTLC) is confirmed, because your Peer has a valid Commitment Transaction with the HTLC included (at least you think it has one, you cannot be sure it did not receive it because he was offline before), this means this HTLC could very much be confirmed when your Peer has the preimage and decides to go onchain.

@ziggie1984
Copy link
Collaborator

Ok I was having the same case with and Incoming and Outgoing HTLC being stuck because the outgoing HTLC was going onchain (and did not confirm until the incoming HTLC would run into the timeout). But luckily my incoming HTLC was failed back because of a positive side_effect of the interceptor. Basically the interceptor will fail all expiring incoming HTLCs which are close to expiry [13 blocks away]. (https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L470). Thats exactly what happened in my case, it canceled it exactly 13 blocks before timeout.

I think the important code part is here:

https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L293

here we cancel all incoming HTLCs although their outgoing counterpart is not resolved yet, at least we do not check whether there is an Outgoing HTLC on the downstream channel.

This failing of an incoming HTLC where the outgoing is still stuck is pretty new (9 months) and really hard to test in regtest mode (filling the mempool with unconfirmed TX), could you look into it @joostjager whether my analysis is correct :)

What I am saying is basically when you have an Interceptor running it will fail back Incoming HTLCs although their Outgoing counterpart is still not resolved yet. I think it's good, because otherwise your Peer will Force-Close on you anyways and you will lose the second channel.

@yyforyongyu
Copy link
Member

Given the setup, Peer Alice -> Peer Bob -> Offline Peer Charlie, if Charlie is offline during the whole time, then yeah it's safe to cancel the HTLC, but you can't be sure. If Charlie comes online after the FC, there are two scenarios,

  1. the HTLC hasn't timed out yet, Charlie can claim the HTLC via the preimage path, Bob will extract the preimage from mempool and settle the incoming HTLC with Alice.
  2. the HTLC has timed out, and the sweeping tx is not yet confirmed, Charlie can still claim it by racing against Bob. In this case, Bob will extract the preimage and settle the HTLC with Alice.

This means Bob would not lose the HTLC if Charlie decides to come online and claim it for w/e reason. However, if Bob cancels the HTLC with Alice after the FC, he is at risk of losing it if Charlie decides to cheat.

So IMO canceling early is not a good choice. Instead, assuming this is an anchor channel, the most feasible way is to fee bump the force close tx.

@yyforyongyu yyforyongyu added force closes mempool and removed bug Unintended code behaviour labels May 10, 2023
@Crypt-iQ
Copy link
Collaborator

Think we can close this

@BhaagBoseDK
Copy link
Contributor Author

Ok I was having the same case with and Incoming and Outgoing HTLC being stuck because the outgoing HTLC was going onchain (and did not confirm until the incoming HTLC would run into the timeout). But luckily my incoming HTLC was failed back because of a positive side_effect of the interceptor. Basically the interceptor will fail all expiring incoming HTLCs which are close to expiry [13 blocks away]. (https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L470). Thats exactly what happened in my case, it canceled it exactly 13 blocks before timeout.

I think the important code part is here:

https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L293

here we cancel all incoming HTLCs although their outgoing counterpart is not resolved yet, at least we do not check whether there is an Outgoing HTLC on the downstream channel.

This failing of an incoming HTLC where the outgoing is still stuck is pretty new (9 months) and really hard to test in regtest mode (filling the mempool with unconfirmed TX), could you look into it @joostjager whether my analysis is correct :)

What I am saying is basically when you have an Interceptor running it will fail back Incoming HTLCs although their Outgoing counterpart is still not resolved yet. I think it's good, because otherwise your Peer will Force-Close on you anyways and you will lose the second channel.

If this is possible in interceptor why not in standard lnd?

@ziggie1984
Copy link
Collaborator

This means Bob would not lose the HTLC if Charlie decides to come online and claim it for w/e reason. However, if Bob cancels the HTLC with Alice after the FC, he is at risk of losing it if Charlie decides to cheat.

So IMO canceling early is not a good choice. Instead, assuming this is an anchor channel, the most feasible way is to fee bump the force close tx.

Not sure if you read my comment, but having an active Interceptor will cancel it back although the downstream HTLC is not resolved, I think its unintended behaviour (see my comment above) should I investigate it further @yyforyongyu ?

@Crypt-iQ
Copy link
Collaborator

I think there are some false assumptions going on here, LND will cancel back dust HTLCs (e.g. not on the commitment tx) here:

case htlc.OutputIndex < 0:
log.Tracef("ChannelArbitrator(%v): immediately "+
"failing dust htlc=%x", c.cfg.ChanPoint,
htlc.RHash[:])
actionMap[HtlcFailNowAction] = append(
actionMap[HtlcFailNowAction], htlc,

which then gets failed back to the incoming channel here:
for _, htlc := range htlcs {
failMsg := ResolutionMsg{
SourceChan: c.cfg.ShortChanID,
HtlcIndex: htlc.HtlcIndex,
Failure: failureMsg,
}
msgsToSend = append(msgsToSend, failMsg)
}

So either peer A force closed for another reason or there is a separate bug

@Crypt-iQ
Copy link
Collaborator

Reopening for discussion

@Crypt-iQ Crypt-iQ reopened this May 11, 2023
@BhaagBoseDK
Copy link
Contributor Author

BhaagBoseDK commented May 11, 2023

This FC is still not confirmed in mempool and which is why peer B has not removed/failed the HTLC with peer A.

This is not true, the log line "immediately failing..." means that the HTLC was dust and failed backwards.

The HTLC is missing in remote commitment because the peer is offline and therefore has not acked the HTLC.

There would be two commitments, the remote pending commitment and the remote commitment. It would be in the remote pending commitment.

@yyforyongyu
Copy link
Member

@ziggie1984 yes please!

@Crypt-iQ
Copy link
Collaborator

Accidentally edited instead of commenting, but here's my comment:

This FC is still not confirmed in mempool and which is why peer B has not removed/failed the HTLC with peer A.

This is not true, the log line "immediately failing..." means that the HTLC was dust and failed backwards.

The HTLC is missing in remote commitment because the peer is offline and therefore has not acked the HTLC.

There would be two commitments, the remote pending commitment and the remote commitment. It would be in the remote pending commitment.

@BhaagBoseDK
Copy link
Contributor Author

Well in that edit you seem to have remove relevant information.

-> The HTLC in question was 20015. Is that dust?
-> The HTLC was in remote pending commitment. So upon expiry peer B force closed with offline peer. See txn 8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd. You can see the HTLC is not present in this force close (because it was not acked by offline peer).
-> This txn was not confirmed for 144 blocks due to congested mempool. So after 144 blocks (CLTV delta of peer B), peer A force closed on peer B. See txn 8dcdcb446b3cbfc38e6164e03592c4593654d29426e27c036d4948f7403d509a. The HTLC is present in this transaction indicating it was not failed back.

@Crypt-iQ
Copy link
Collaborator

Relevant log line is here:

log.Infof("ChannelArbitrator(%v): immediately failing "+
meaning that the HTLC is failed back, but there is perhaps a bug in the code somewhere which we can't diagnose without logs

@ziggie1984
Copy link
Collaborator

ziggie1984 commented May 15, 2023

I analysed this situation further and can conclude that LND will not cancel back the HTLC (if it's not dust) and will hold onto it until the peer FCs the outgoing HTLC (without a registered Interceptor).

With or without a registered interceptor LND will fail the incoming HTLC back without verifying that the outgoing HTLC is still active iff the incoming HTLC runs into the RejectDelta of 13 blocks AND the ChannelLink is reconnected.

Scenario: Alice => Bob => Carol

Bob has a increased RejectDelta: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L562

Replaced it with (40+78)

Now Carol creates a hold-invoice, Bob registers an interceptor. I will now mine 3 blocks so I come into the RejectDelta of Bob, and I only need to Reconnect/Disconnect and the Incoming HTLC fails although the outgoing is still not resolved.

Log of Bob (as expected):

2023-05-15 10:53:58.060 [DBG] HSWC: Interception rejected because htlc expires too soon: circuit=(Chan ID=204:1:0, HTLC ID=3), height=216, incoming_timeout=333
2023-05-15 10:53:58.060 [DBG] HSWC: ChannelLink(3bb535672973053a3184cf77ced48583204c4252521221518c05e619fbcccd19:0): queueing removal of FAIL closed circuit: (Chan ID=204:1:0, HTLC ID=3)->(Chan ID=199:2:0, HTLC ID=0)

Now I cancel back the holdinvoice on Carol's node:

Now the logs show as expected on Bob's node:

2023-05-15 11:03:09.644 [ERR] HSWC: unable to find target channel for HTLC fail: channel ID = 199:2:0, HTLC ID = 1
2023-05-15 11:03:09.644 [ERR] HSWC: Unhandled error while reforwarding htlc settle/fail over htlcswitch: unable to find target channel for HTLC fail: channel ID = 199:2:0, HTLC ID = 1
2023-05-15 11:03:10.114 [DBG] HSWC: Sent 0 satoshis and received 0 satoshis in the last 10 seconds (0.100000 tx/sec)

Before fixing this issue, I would like to propose a Config-Setting where the Node-Runner can decide whether he is willing to bear the risk and cancel back Incoming HTLCs when the Outgoing HTLC is still not resolved (not worth maybe sweeping because chain fees are too high). Otherwise I find this "bug" kind of handy for now to cancel back if I want to in case my Outgoing HTLCs are not resolved in time.

To fix this issue we definitely need to check if there is still an outgoing HTLC at play before canceling back.

@Crypt-iQ
Copy link
Collaborator

@BhaagBoseDK can you share logs so we can diagnose the bug when we get a chance

@Crypt-iQ
Copy link
Collaborator

@BhaagBoseDK can you save your logs for the specific htlc and for this time so somebody can look at it when this gets prioritized?

@SunnySarahNode
Copy link

I managed to catch a very good case.

Me: LND 0.17.3, bitcoind 25 (full indexed, mempool = 3Gb), Ubuntu, clearnet only
My peer: unknown, tor only

I have a HTLC in our channel:

{
	"incoming": false,
	"amount": "54572",
	"hash_lock": "c99e...83f6",
	"expiration_height": 821335,
	"htlc_index": "87642",
	"forwarding_channel": "896115171070509057",
	"forwarding_htlc_index": "78605"
}

The known part of the route is :

(1) (someone)
(2 - me) 03423790614f023e3c0cdaa654a3578e919947e4c3a14bf5044e7c787ebd11af1a
(3 - my peer) 021720a04a2094ccff4c56bd6ab20f7e36e0af17cb0d3b90ea00ce0f07bd51cf8c
(4) 0284e3ca3753632c51a7d9a156370161ce2a19af41dbf4966eecf74bf3f7ba0a79
(5) (someone)

The channel between (3) and (4) was FCed : tx ce960ba459e62fbbe9178130de89fb595afa8ffb390b954d3e3f3aaf4e0f3f56

The relevant HTLC - in the channel between (3) and (4) - went onchain :

image

In our channel - between (2) and (3) - this HTLC is still alive and its status does not change. There are no records in the log (HSWC is in DEBUG mode) with this HTLC. Other contracts appear and are getting resolved as usual in this channel, both nodes are ok and online, our channel is active and enabled on both sides, my HSWC works as usual.

Obviously the channel closing transaction will not be mined until the HTLC expires and our channel - between (2) and (3) - is doomed to be FCed.

Reconnecting or restarting the nodes (both mine and my peer's) doesn't help.

Question 0 is : do I understand the situation correctly?

...and if yes...

Question 1 is : how was this guy (4) able to FC the channel between (3) and (4) with a fee 9 sats/vbyte while the normal fee at that moment was more than 100? Сan I do the same with my channels? ;)

Question 2 is : I definitely don't want to pay a 100000+ sats fee for this ability of that guy, which is not even my peer. Сan we somehow avoid such situations?

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 13, 2023

Question 2 is : I definitely don't want to pay a 100000+ sats fee for this ability of that guy, which is not even my peer. Сan we somehow avoid such situations?

Make sure you reconnect to peer 3 when the htlc approaches the blockdealine + 13 blocks, only then will your peer fail the htlc back and no FC will happen on your channel.

Question 1 is : how was this guy (4) able to FC the channel between (3) and (4) with a fee 9 sats/vbyte while the normal fee at that moment was more than 100? Сan I do the same with my channels? ;)

The max-anchor-commitfee is default to 10sats, but I am wondering why the channel is not CPFPed, maybe its already purged out of the mempool by the respetive nodes, then he will not be able to bump it.

@bitromortac
Copy link
Collaborator

The max-anchor-commitfee is default to 10sats, but I am wondering why the channel is not CPFPed, maybe its already purged out of the mempool by the respetive nodes, then he will not be able to bump it.

I have seen many of these cases where the commitment fee rate is at around 10sat/vbyte (see #8271), although it should be higher (#8240 (comment)).

@SunnySarahNode
Copy link

SunnySarahNode commented Dec 14, 2023

@ziggie1984, Thank you for your answer.

Make sure you reconnect to peer 3 when the htlc approaches the blockdealine + 13 blocks, only then will your peer fail the htlc back and no FC will happen on your channel.

Practice shows that in such cases reconnection does not help, but restarting the node shortly before the expiration of the HTLC helps. Obviously there is some difference between a simple reconnect and what happens after restart. I'll try to collect some logs and come back when I find something interesting.

@ziggie1984
Copy link
Collaborator

I have seen many of these cases where the commitment fee rate is at around 10sat/vbyte (see #8271), although it should be higher (#8240 (comment)).

Good input, so looked at it as well, so #8271 definitely is not the right behavior in during chan opening, but the fee negotiation for normal UpdateFee msgs should cap at the min_relay fee. Tho there might always be the problem between the two peers, the initiator might have an increased mempool, but the node force-closing the channel might not, so we might end up in this situation where the non-initiator cannot bump the fee of the commitment. Not sure if there is really a fix for this for now, because not accepting feeupdates might cause problems. 🤔

@ziggie1984
Copy link
Collaborator

Practice shows that in such cases reconnection does not help, but restarting the node shortly before the expiration of the HTLC helps. Obviously there is some difference between a simple reconnect and what happens after restart. I'll try to collect some logs and come back when I find something interesting.

That would be great, are you verifying that you disconnect the peer and then connect again, because the link needs to be torn down for this to work.

@SunnySarahNode
Copy link

That would be great, are you verifying that you disconnect the peer and then connect again, because the link needs to be torn down for this to work.

Of course. Disconnect and connect the peer again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
force closes P1 MUST be fixed or reviewed
Projects
None yet
Development

No branches or pull requests

10 participants