-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unilateral closure due to HTLC deadline does not fail upstream HTLCs #4649
Comments
We can't close the HTLC upstream until close out the HTLC downstream. That means seeing the commitment onchain to a certain depth: remember, we sent a signature for the HTLC, so the peer might drop a commitment tx containing it. So, what's onchaind doing? I expect it would see our own commitment tx (I'm guessing the peer doesn't close at the same time), and then after 3 confirms it is happy that the HTLC is definitely missing and close it out upstream. For the other HTLC (1543) it's presumably in the commitment tx, so that will actually need to be spent after it times out (again, by onchaind), which once it reaches "reasonable depth" (3), should get closed out. Of course, it's possible that our timings are too tight (insufficient HTLC delay), or we're lowballing on fees for various things (e.g. HTLC spend), but it would be good to check that is indeed what is happening? |
@rustyrussell: I thought you might say something like that, but no: even after my node had observed 3 confirmations on its commitment transaction (which contained no HTLCs), it still didn't fail the upstream HTLC.
It may be worth noting (I'm not sure what it means) that, at the time of the above log messages,
|
To be clear: the downstream node was hosed and never sent any new commitments after HTLC 1539 (which was successfully removed), so HTLCs 1540, 1541, 1542, 1543, and 1544 were not in the commitment that my node dropped to chain because I had never received from the peer any new commitment containing them. |
@rustyrussell: I have another occurrence of this same bug. It's not an issue of on-chain confirmations taking too long. There is a real bug in C-Lightning. Allow me to elucidate. This occurrence was brought to my attention by the operator of the upstream LND node, c-otto.de, which was forced to unilaterally close a channel with my C-Lightning node because my node had neglected to fail an upstream HTLC at the appropriate time after an associated downstream channel had been unilaterally closed by a downstream peer. The play-by-playThe upstream node adds HTLC 44689:
We add HTLC 5245 to the downstream channel:
The downstream node never responds with its revocation. In fact, it never sends us any more messages in the current connection. At this point, we cannot be certain whether the downstream node has seen our new commitment. Some time later, another payment comes in from another upstream node, to be forwarded over the same downstream node, so we add HTLC 5246 to the downstream channel:
This time we didn't send a commitment. Note that we actually haven't heard anything from the downstream node in quite some time at this point. Eventually, our attempt to add downstream HTLC 5246 times out, and we kill our connection with the downstream node. We correctly fail HTLC 5246 (and indeed correctly fail the corresponding upstream HTLC), but we cannot fail HTLC 5245 at this point because we don't know whether the downstream node has seen our new commitment.
When the downstream node reconnects, we attempt to restore downstream HTLC 5245, but the downstream node is still borked and never sends a revocation:
It's like talking to a brick wall. Ahhh, LND. Some time later, the downstream node unilaterally closes the channel, and we correctly note that we need to know whether downstream HTLC 5245 was included in the commitment that was broadcast to the Bitcoin network:
We analyze the on-chain transaction and find that downstream HTLC 5245 was not included. (There are only two outputs.)
So now we wait for two more confirmations:
Now at 3 confirmations, despite noting the missing downstream HTLC, we never fail the corresponding upstream HTLC as we ought to. We can even confirm that upstream HTLC 44689 is still pending at this point, as the connection with the upstream node was restarted a few hours later, and HTLC 44689 is noted…
Eventually, the upstream node is forced to unilaterally close the channel when upstream HTLC 44689 hits its deadline. This is our fault, as we should have failed that HTLC many hours ago. |
It looks like this happened again last night. My channel to @whitslack was active, the nodes were connected, but my node still had to force-close the channel due to a HTLC timeout. |
@C-Otto: Apologies for any inconvenience/expense my node has caused for you. My node is in a very sorry state of affairs right now after Comcast changed my IP addresses about a week ago. |
According to my logs the channel/connection was alive and healthy, though. As far as I can see, the IP troubles are unrelated to this bug. |
@C-Otto: Your connection with my node was alive and healthy, but that's not what this bug is about. This bug causes my node to neglect to fail an HTLC with your node after another of my channels with some other peer has to be force-closed. |
Yes, exactly. The IP change issue caused your node to experience a force-close of some channel that was the next hop of some HTLC coming from my node. Due to this issue your node didn't fail the HTLC after the force-close, causing my node to force-close its channel to your node (so that a single HTLC lead to two force-closes on your node). |
@C-Otto: Yes, correct. Sorry, I thought you were arguing that my IP troubles were unrelated to our reoccurrence of this bug. This bug is unrelated to the node announcement propagation issue, but my inability to get reconnected to my channel peers is certainly increasing the likelihood that this bug will manifest. |
I have tried to reproduce this with no success :( It seems onchaind is working, but we don't close the corresponding incoming. There are several cases where this can happen:
None of these should happen here, and none do (I tried reconnecting or restarting, too). So it's a puzzle. I will definitely add more debugging in case this happens again, and tomorrow I'll do an audit and see if there are any unexpected ways these things could happen. |
My tree for testing is here, for those playing along at home: https://github.com/rustyrussell/lightning/tree/guilt/htlc-upstream-close-test |
I recently had a very similar issue with two lnd nodes (one of them mine), caused by a non-confirming/delayed force-close tx: lightningnetwork/lnd#4215 Could that also be the case here? |
I don't think so, because the force-close did confirm. Go back and read my play-by-play again. The upstream HTLC should have been failed when the downstream force-closure reached 3 confirmations, but it was not. |
We get sent three corresponding arrays: 1. htlc stubs 2. whether we want to know if they're missing, 3. whether to wait 3 blocks or tell us immediately We then sorted the htlc stubs by CLTV, *but didn't sort the corresponding arrays*. This fixes that the simplest way possible, and probably also: Fixes: ElementsProject#4649 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sorry this took so long! I found an issue which can cause this problem! 8641a5b Of course, I can't be sure that is the only problem, but I'm tentatively marking this fixed by that PR. |
We get sent three corresponding arrays: 1. htlc stubs 2. whether we want to know if they're missing, 3. whether to wait 3 blocks or tell us immediately We then sorted the htlc stubs by CLTV, *but didn't sort the corresponding arrays*. This fixes that the simplest way possible, and probably also: Fixes: ElementsProject#4649 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: onchaind: we sometimes failed to close upstream htlcs if more than one HTLC is in flight during unilateral close.
We get sent three corresponding arrays: 1. htlc stubs 2. whether we want to know if they're missing, 3. whether to wait 3 blocks or tell us immediately We then sorted the htlc stubs by CLTV, *but didn't sort the corresponding arrays*. This fixes that the simplest way possible, and probably also: Fixes: ElementsProject#4649 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: onchaind: we sometimes failed to close upstream htlcs if more than one HTLC is in flight during unilateral close.
@rustyrussell: Oh man. That absolutely looks like a smoking gun. Excellent find! |
We get sent three corresponding arrays: 1. htlc stubs 2. whether we want to know if they're missing, 3. whether to wait 3 blocks or tell us immediately We then sorted the htlc stubs by CLTV, *but didn't sort the corresponding arrays*. This fixes that the simplest way possible, and probably also: Fixes: ElementsProject#4649 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: onchaind: we sometimes failed to close upstream htlcs if more than one HTLC is in flight during unilateral close.
We get sent three corresponding arrays: 1. htlc stubs 2. whether we want to know if they're missing, 3. whether to wait 3 blocks or tell us immediately We then sorted the htlc stubs by CLTV, *but didn't sort the corresponding arrays*. This fixes that the simplest way possible, and probably also: Fixes: ElementsProject#4649 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: onchaind: we sometimes failed to close upstream htlcs if more than one HTLC is in flight during unilateral close.
Issue and Steps to Reproduce
RCVD_ADD_REVOCATION
state because the peer is a buggy LND that never sends a new commitment.Case Study
And now, some annotated log excerpts demonstrating a real-world example of this scenario:
Routed payment arrives on channel A
Forwarded payment goes out over channel B
Due to lightningnetwork/lnd#5506, the peer did not respond with a
WIRE_REVOKE_AND_ACK
(and indeed had gone out to lunch for the remainder of this connection).Downstream peer reconnects and ACKs the downstream HTLC
However, the buggy peer never sends a new commitment, so the downstream HTLC never transitions into the
RCVD_ADD_ACK_COMMIT
state.Downstream channel is forced closed by a different HTLC hitting its deadline
Here is the bug! Despite that the downstream channel got unilaterally closed here, the upstream HTLC was never failed/removed from the upstream channel.
Please note, the HTLC that hit its deadline was 1543, not 1542 that we've been following. (The fact that it's a different HTLC may or may not be relevant to the bug.)
Upstream peer reconnects
Here, we can see that the upstream HTLC is still in the
RCVD_ADD_ACK_REVOCATION
state over 33 hours after it was added. C-Lightning should have sent aWIRE_UPDATE_FAIL_HTLC
for this HTLC when it unilaterally closed the downstream channel, but it did not.Upstream peer unilaterally closes due to HTLC deadline
Damn.
getinfo
outputThis is on version 0.10.0. No other
getinfo
output is relevant.The text was updated successfully, but these errors were encountered: