Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad commit_sig signature #4152

Closed
darosior opened this issue Oct 22, 2020 · 8 comments · Fixed by #4480
Closed

Bad commit_sig signature #4152

darosior opened this issue Oct 22, 2020 · 8 comments · Fixed by #4480
Assignees
Milestone

Comments

@darosior
Copy link
Contributor

az0re on IRC just reported it. Unfortunately Github won't let them get here because they use Tor.
Told them to send the (private) signature, transaction and feerate by mail.

@cdecker
Copy link
Member

cdecker commented Oct 26, 2020

Very interesting indeed, looking at the transaction whose signature failed it seems like the transaction that ultimately made it on-chain looks identical to the one which failed the signature. The fact that the unilateral close was ours suggests that the two consecutive states look identical, narrowing this down to a feerate being the cause for the new commit. However we also add and remove an HTLC that are identical except the payment_hash:

       "htlcs": [
	  {
	     "direction": "out",
	     "id": 95,
	     "msatoshi": 125000125,
	     "amount_msat": "125000125msat",
	     "expiry": 653923,
	     "payment_hash": "hashA",
	     "state": "SENT_ADD_ACK_REVOCATION"
	  },
	  {
	     "direction": "out",
	     "id": 96,
	     "msatoshi": 125000125,
	     "amount_msat": "125000125msat",
	     "expiry": 653923,
	     "payment_hash": "hashB",
	     "state": "RCVD_REMOVE_HTLC"
	  }
       ]

So it could be that the unilateral close was really the one with hashB, and the new (failed commitment) state was the one with hashA. The counterparty being an lnd node, which we had synchronization issues in the past with rapid successions of HTLCs suggests that we might have de-synched inbetween the two commits.

@0e319b6
Copy link

0e319b6 commented Nov 18, 2020

This happened again with a different peer, this time with an empty htlcs field, using git cd7d5cd. I will send another email with the listpeers object and relevant log excerpt.

@0e319b6
Copy link

0e319b6 commented Apr 14, 2021

This happened yet again, using git bbfcae6, this time with totally full HTLC slots. All but two of those HTLCs were incoming, in state RCVD_REMOVE_REVOCATION and the two outgoing HTLCs were in state SENT_ADD_ACK_REVOCATION. This one might be a simple race issue, where both sides see max - 1 HTLCs and so simultaneously think it's OK to add one more HTLC. Then maybe they both issue an HTLC and then are unable to deal with the resulting channel state with more than the max number of allowed in-flight HTLCs. Being a bit flexible with the maximum number of allowed in-flight HTLCs might help.

This bug has cost me a ton of money, and fixing it is clearly not a priority. Despite my significant investment in C-Lightning infrastructure, I have started planning to migrate to an LND node. I am not willing to keep burning satoshis like this. I will also warn other people not to use C-Lightning on mainnet until this bug is fixed so they don't get burned, either.

@rustyrussell rustyrussell self-assigned this Apr 16, 2021
@rustyrussell rustyrussell added this to the v0.10.1 milestone Apr 16, 2021
@rustyrussell
Copy link
Contributor

OK, the good news is I've caught one of these in my logs, too! Finally...

@cdecker's idea that it's around feerate changes seems to be borne out here. Seems like we were updating fees, both sides were on different levels, and then when we tried updating fees again (while they were still in flux), something b0rked.

There's a simple workaround, however: we can avoid changing fees again until they're completely quiescent: we know the fee logic works in the simple cases, otherwise we wouldn't keep channels open at all.

That means even if LND's or our fee logic doesn't work in complex cases (and I'm still tracking this down to try to get the exact cause here), we won't trigger it in the meantime.

@rustyrussell
Copy link
Contributor

This bug has cost me a ton of money, and fixing it is clearly not a priority. Despite my significant investment in C-Lightning infrastructure, I have started planning to migrate to an LND node. I am not willing to keep burning satoshis like this. I will also warn other people not to use C-Lightning on mainnet until this bug is fixed so they don't get burned, either.

I totally understand! It's a nasty bug, and because it wasn't happening all the time, it didn't hit the top of our TODO list. However, now I've got some more clues I am pretty sure I can work around it (for existing peers), and make sure that we're actually doing the right thing (fixing LND or c-lightning, whichever is wrong).

@cdecker
Copy link
Member

cdecker commented Apr 16, 2021

There's a simple workaround, however: we can avoid changing fees again until they're completely quiescent: we know the fee logic works in the simple cases, otherwise we wouldn't keep channels open at all.

What are you referring to when you say quiescent? Do you mean no feerate update in flight, no HTLCs in flight or no commitment? I noticed that reports often share a couple of HTLCs and some are even close to the default 30 concurrent HTLCs. Some of these nodes seem incredibly busy, so waiting for there to be no HTLCs might starve our fee adjustment. Would remembering the last couple of feerates and just trying them out work?

@rustyrussell
Copy link
Contributor

Was planning on "no feerates in flight". In my logs, there are multiple fee changes going on at the same time...

@0e319b6
Copy link

0e319b6 commented Apr 16, 2021

I totally understand! It's a nasty bug, and because it wasn't happening all the time, it didn't hit the top of our TODO list. However, now I've got some more clues I am pretty sure I can work around it (for existing peers), and make sure that we're actually doing the right thing (fixing LND or c-lightning, whichever is wrong).

Super glad this might be resolved soon. I really don't want to migrate to LND, so if I stop bleeding satoshis then I am absolutely going to stay on C-Lightning. Thanks for looking at this!

rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 18, 2021
There are several reports of desynchronizaion with LND here; a simple
approach is to only have one feerate change in flight at any time.

Even if this turns out to be our fault, it's been a historic area of
confusion, so this restriction seems reasonable.

Fixes: ElementsProject#4152
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 18, 2021
There are several reports of desynchronizaion with LND here; a simple
approach is to only have one feerate change in flight at any time.

Even if this turns out to be our fault, it's been a historic area of
confusion, so this restriction seems reasonable.

Fixes: ElementsProject#4152
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 20, 2021
There are several reports of desynchronizaion with LND here; a simple
approach is to only have one feerate change in flight at any time.

Even if this turns out to be our fault, it's been a historic area of
confusion, so this restriction seems reasonable.

Changelog-Fixed: Protocol: Don't create more than one feerate change at a time, as this seems to desync with LND.
Fixes: ElementsProject#4152
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Apr 21, 2021
There are several reports of desynchronization with LND here; a simple
approach is to only have one feerate change in flight at any time.

Even if this turns out to be our fault, it's been a historic area of
confusion, so this restriction seems reasonable.

Changelog-Fixed: Protocol: Don't create more than one feerate change at a time, as this seems to desync with LND.
Fixes: ElementsProject#4152
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit that referenced this issue Apr 24, 2021
There are several reports of desynchronization with LND here; a simple
approach is to only have one feerate change in flight at any time.

Even if this turns out to be our fault, it's been a historic area of
confusion, so this restriction seems reasonable.

Changelog-Fixed: Protocol: Don't create more than one feerate change at a time, as this seems to desync with LND.
Fixes: #4152
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants