-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add next_funding_txid
awareness for dual-fund opens
#6824
Add next_funding_txid
awareness for dual-fund opens
#6824
Conversation
40c05ef
to
d020f9e
Compare
There is a description of this protocol somewhere? or the BOLT is already describing this proposal? |
|
wallet/wallet.h
Outdated
/* We save channels that aren't yet committed :( */ | ||
BUILD_ASSERT(DUALOPEND_OPEN_INIT == 1); | ||
return s; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely not. Not again.
You need a new state, DUALOPEND_OPEN_COMMITTED. When you add it, the compiler will tell you everywhere you have to change it.
lightningd/peer_control.c
Outdated
bool too_early_v2(const struct channel *channel) | ||
{ | ||
return channel->state == DUALOPEND_OPEN_INIT | ||
&& !channel->last_tx; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is exactly the kind of vague nomenclature I had to clean up. If you add a new state, you don't need this.
lightningd/dual_open_control.c
Outdated
/* we check validity in dualopend, this is a sanity check */ | ||
assert(bitcoin_txid_eq(&txid, &inflight_txid)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule (loosely followed!) we prefer to fail the channel if a subd does something weird, rahter than kill lightningd.
/* Only after we've updated/saved our psbt do we check | ||
* for peer connected */ | ||
if (!channel->owner) | ||
return command_fail(cmd, FUNDING_PEER_NOT_CONNECTED, | ||
"Peer not connected"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!WARNING!! This will be messy after rebase, since now in master we use param_check! So if we are simply running the check
RPC, we still want to do this check.
I'm not sure how to handle this in general. I mean, the command still does something, though it "fails": that just seems... weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's weird that we save incoming data to disk, but it makes fundchannel
work really nicely across reconnects so unsure if that's the worst thing??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree that it's best from where we are, but worth thinking for future.
In theory there are two things going on: saving the info, and telling the peer. This is a bit like close
, so, you could argue we shouldn't actually here, or even block until peer is told.
openingd/dualopend.c
Outdated
return NULL; | ||
} | ||
tal_free(wscript); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is just weird... the only danger with tmpctx is when you've got an inner io_loop, which is already an anti-pattern, and there's not one here that I can see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, i'll drop it!
openingd/dualopend.c
Outdated
static char *do_reconnect_commit_sigs(const struct state *state, | ||
const struct tx_state *tx_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you allocate for the caller, you must use the ctx pattern...
openingd/dualopend.c
Outdated
static void do_reconnect_dance(struct state *state) | ||
static bool do_reconnect_dance(struct state *state) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you did this because your previous commit did return false
and that didn't compile? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's b/c i wanted reconnected
to be false if we aborted; regardless i'm getting rid of the idea of returning a bool entirely...
Also, valgrind failures in test_multifunding_v1_v2_mixed:
|
Nice work, I just tested that against eclair, and here is the report (there are a few things to fix, but otherwise it's looking good and we've understood the spec the same way).
|
d020f9e
to
96f1bfd
Compare
Ok this still has some bugs + I'm expecting a few tests to fail but:
Nice catch. This was causing a crash in our tests also; knowing to look for this as a root cause was very helpful. Thank you! I've patched it not to do this.
Oh no you caught me red-handed. 🙈 I did wonder if you'd catch me doing this... seems like the answer was yes! It was a quick fix 80e8503, should send ASAP for both sides now :)
I think this has been patched (there was a crash happening that I patched that think explains this), but haven't tested conclusively.
Tested + patched with the latest updates. Thanks @rustyrussell + @t-bast for the extremely fast review!! |
@@ -195,6 +195,7 @@ | |||
"CLOSINGD_SIGEXCHANGE": 4, | |||
"DUALOPEND_AWAITING_LOCKIN": 10, | |||
"DUALOPEND_OPEN_COMMITTED": 12, | |||
"DUALOPEND_OPEN_COMMIT_READY": 13, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful with this: the enum numeric values are stored in the DB, so adding to the end is the right thing to do here (not to reorder states, causing faulty loads from DB), however IIRC there are places where we check for sets of states by using <
and >
on the numeric values. This would obviously fail.
Can we have a quick check through the code to see if there are numeric comparisons in the code on values for this enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc those have all been updated to be comprehensive swtich statements in a clean up Rusty did earlier this year, but I'll do a sweep to double check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news: you cannot get this wrong, as we have channel_state_in_db() which indeed, is patched correctly, AND I removed all those numeric comparisons last release for exactly this reason (when the new splicing state was added, I had the same concerns!).
Confirmed, this is now working fine 👍
Heh, did you think you could trick my testing skills? This is indeed fixed,
I'm still seeing that crash, with the following logs:
|
c332489
to
37fa464
Compare
me, friday: this is a cute, well written, <1k line patch |
37fa464
to
ec4b34e
Compare
From the spec: Once peers are ready to exchange commitment signatures, they must remember the details of the funding transaction to allow resuming the signatures exchange if a disconnection happens. Basically this means we add channels to the database before we've gotten commitments for them; it's nice that there's now a state for commitments recevied but we now save the channel prior to that. This commit makes it possible to track the pre-commit-rcvd but not quite open-init state.
What if the last_tx is empty for the channel? We're about to let the channels not have last_txs at start.
We're going to add the commitment transaction data at a different time than when we init a new inflight. Split them up!
we now save them before we get the commitment data.
If we get an error on a channel that doesn't have commitments yet, we can just delete it.
Here, we split up what was "commit_received" into two phases: - commit-ready, where we're about to send our commitment tx to peer - commit-received, when we've gotten the commitment tx from our peer This lets us do the right thing (as far as the spec is concerned) with returning the correct 'next_funding_txid' on reconnect (later commits).
We don't actually use this internal to this method? Weird. Anyway, if we don't want/need it allow the caller to signal that by passing in NULL, if desired.
When we reconnect, if we get a note from the peer that they dont know about a pending inflight, we need to be able to clean it up so we can restart/re-negotiate a new RBF etc. This adds a cleanup method to remove any inflights for a channel without a last_tx (commitment tx)
If you resend us a commitment tx, and we already have one, we check that it's correct!
(ones that are missing last_txs)
depending on the state, we might - forget the channel - drop it to chain - reconnect via dualopend
We need to keep track of if we've gotten the last negotiation's commitment sigs, for reconnect logic (helps us know what messages to send in the reconnect case)
You can't publish a tx you don't have!
Changelog-Changed: RPC `listpeerchannels`.`inflights` may sometimes not include `scratch_txid` (mandatory -> optional)
We're going to need to reuse this for reconnect; make the method standalone in that it can figure out what to send to HSMD independent of where it's located in the setup call flow.
Let's make it easier to build remote commitments (we're going to need this for reconnects soon!)
Move common code for verifying a commitment sig from peer into one place. On reconnects, we'll need to verify peer's commitments. Changelog-None.
A bit gratuitous, but it's a bit cleaner on a whole?
If we get a commitment-signed message from a peer, outside of a normal flow, process it! We're about to send these during reconnect, so we need to be able to handle them!
If you get the right series of disconnects, it's possible for your peer to send you a tx-sigs even though the current state of the channel open is that you've seen the funding open on chain (your channel_ready[LOCAL] = true) In this case, if we haven't marked that we've seen the tx sigs yet, we go ahead and mark them as seen and just ignore this tx-sigs msg.
In the case where you're echoing back a tx-abort, just let it through. Not doing this causes problems in the case where your node has forgotten about an in-progress open. This fixes the following problem: - you send a tx-abort (even tho you have marked tx-sigs as received) - peer echos it back (we echo back tx-aborts always) - you throw an error because you're already in a tx-abort unallowed state In this commit, we allow for echos to come thru no matter our current state and this fixes things/makes them work as expected.
Here we conform to the specification, which requires that we handle next-funding-id in a specific way. Note that we were already sending it, but now we actually correctly handle its presence. Changelog-Changed: Spec: dual-funding now follows the next-funding-id rules.
Makes it easier to see why things are failing in the logs.
ec4b34e
to
2866b4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments only, on the last dev_disconnect hack.
@@ -195,6 +195,7 @@ | |||
"CLOSINGD_SIGEXCHANGE": 4, | |||
"DUALOPEND_AWAITING_LOCKIN": 10, | |||
"DUALOPEND_OPEN_COMMITTED": 12, | |||
"DUALOPEND_OPEN_COMMIT_READY": 13, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news: you cannot get this wrong, as we have channel_state_in_db() which indeed, is patched correctly, AND I removed all those numeric comparisons last release for exactly this reason (when the new splicing state was added, I had the same concerns!).
/* Only after we've updated/saved our psbt do we check | ||
* for peer connected */ | ||
if (!channel->owner) | ||
return command_fail(cmd, FUNDING_PEER_NOT_CONNECTED, | ||
"Peer not connected"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree that it's best from where we are, but worth thinking for future.
In theory there are two things going on: saving the info, and telling the peer. This is a bit like close
, so, you could argue we shouldn't actually here, or even block until peer is told.
if (!updated) { | ||
log_info(channel->log, "Already had sigs, skipping notif"); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!Thanks!! 🧡 I saw this intermittantly and couldn't figure out what was happening!
connectd/multiplex.c
Outdated
case DEV_DISCONNECT_DROP: | ||
/* try again? */ | ||
return read_hdr_from_peer(peer_conn, peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strictly impossible, since our connections always send in order. So now we ignore a msg, that contract is broken. You're using it in a very specific way, so that's OK, but "don't use this" is probably needed.
Can you do it on the send side (i.e. don't send this msg)? Would be significant code reduction...
I just want to say that this code was a delight to review. 🧡 ! |
Let's test that things stay together! One cool thing to note is that now we sort of "magically" recover from pretty brutal disconnects! Very nice!
When we got our peer's sigs, if we were the remote, we would re-notify the plugin, which in turn would re-send the tx-sigs to use. In the case of CLN, we'd then - break, because we'd re-forward the sigs to the `openchannel` plugin, which was then in the wrong state (MULTIFUNDCHANNEL_SIGNED) spenderp: plugins/spender/openchannel.c:598: json_peer_sigs: Assertion `dest->state == MULTIFUNDCHANNEL_SECURED' failed. spenderp: FATAL SIGNAL 6 (version 5880d59-modded) In the case of eclair, they'd just see our 2nd TX_SIGS message and @t-bast would complain: > This test works, with one minor issue: on reconnection, cln sends its tx_signatures twice (duplicate?). This commit does two things: - has the openchannel / spender plugin log a broken instead of crashing when the state is not what we're expecting - stops us from calling the `funder` plugin if this is a replay/second receipt of commit-sigs.
Originally the accepter waited for the peer to send us their commitment sigs before we send ours; this changes things so that the accepter sends their commitment sigs ASAP. This test fails: when cln is not the channel initiator, it waits for the other node to send commit_sig before sending its own commit_sig. There is no reason to do that, both nodes should send commit_sig immediately after exchanging tx_complete? Otherwise it's a missed opportunity to finalize the channel creation on reconnection, because in that case cln hasn't saved the channel and fails it on reconnection. Reported-By: @t-bast
Now that we save the commitment sigs immediately, we have to drop the connection elsewhere in the flow to get the state where only one peer remembers.
We don't let go of the `msg` on error, which triggers a memleak warning! lightningd-2 2023-10-31T19:54:06.582Z **BROKEN** lightningd: MEMLEAK: 0x55ae3615b498 lightningd-2 2023-10-31T19:54:06.582Z **BROKEN** lightningd: label=openingd/dualopend_wiregen.c:919:u8[] lightningd-2 2023-10-31T19:54:06.582Z **BROKEN** lightningd: alloc: lightningd-2 2023-10-31T19:54:06.685Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:477 (tal_alloc_) lightningd-2 2023-10-31T19:54:06.686Z **BROKEN** lightningd: ccan/ccan/tal/tal.c:506 (tal_alloc_arr_) lightningd-2 2023-10-31T19:54:06.686Z **BROKEN** lightningd: openingd/dualopend_wiregen.c:919 (towire_dualopend_send_tx_sigs) lightningd-2 2023-10-31T19:54:06.686Z **BROKEN** lightningd: lightningd/dual_open_control.c:1122 (openchannel2_sign_hook_cb) lightningd-2 2023-10-31T19:54:06.686Z **BROKEN** lightningd: lightningd/plugin_hook.c:194 (plugin_hook_call_next) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: lightningd/plugin_hook.c:169 (plugin_hook_callback) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: lightningd/plugin.c:660 (plugin_response_handle) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: lightningd/plugin.c:772 (plugin_read_json_one) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: lightningd/plugin.c:823 (plugin_read_json) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: ccan/ccan/io/io.c:59 (next_plan) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: ccan/ccan/io/io.c:407 (do_plan) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: ccan/ccan/io/io.c:417 (io_ready) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: ccan/ccan/io/poll.c:453 (io_loop) lightningd-2 2023-10-31T19:54:06.687Z **BROKEN** lightningd: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) lightningd-2 2023-10-31T19:54:06.688Z **BROKEN** lightningd: lightningd/lightningd.c:1333 (main) lightningd-2 2023-10-31T19:54:06.688Z **BROKEN** lightningd: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) lightningd-2 2023-10-31T19:54:06.688Z **BROKEN** lightningd: ../csu/libc-start.c:392 (__libc_start_main_impl) lightningd-2 2023-10-31T19:54:06.688Z **BROKEN** lightningd: parents:
If we disconnect, we lose the open_attempt record. Which is fine, but we should prevent the user from starting another RBF if the last one isn't done yet!
Here we make sure we can drop the initial tx to chain, and that an inflight txid that's missing its commitment sigs is properly ignored.
results in an error.
We weren't blocking if the tx-sigs arrived before the commitment sigs. This was causing problems in the openchannel (spender plugin) spenderp: FATAL SIGNAL 11 (version v23.08.1-404-g62ff475-modded) 0x559836dc98ba send_backtrace common/daemon.c:33 0x559836dc9951 crashdump common/daemon.c:75 0x7f37f42c351f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 0x7f37f441ac92 ??? ../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:83 0x559836db7760 bitcoin_txid_eq ./bitcoin/tx.h:29 0x559836db7760 collect_sigs plugins/spender/openchannel.c:509 0x559836db81de check_sigs_ready plugins/spender/openchannel.c:531 0x559836db84dd json_peer_sigs plugins/spender/openchannel.c:611 0x559836dbcad7 ld_command_handle plugins/libplugin.c:1611 0x559836dbcd9d ld_read_json_one plugins/libplugin.c:1721 0x559836dbce29 ld_read_json plugins/libplugin.c:1741 0x559836ef3bff next_plan ccan/ccan/io/io.c:59 0x559836ef40da do_plan ccan/ccan/io/io.c:407 0x559836ef4177 io_ready ccan/ccan/io/io.c:417 0x559836ef5b14 io_loop ccan/ccan/io/poll.c:453 0x559836dbd48d plugin_main plugins/libplugin.c:1948 0x559836db22bf main plugins/spender/main.c:35 0x7f37f42aad8f __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0x7f37f42aae3f __libc_start_main_impl ../csu/libc-start.c:392 0x559836da3774 ??? ???:0 0xffffffffffffffff ??? ???:0 2023-10-31T15:15:57.458Z INFO plugin-spenderp: Killing plugin: exited during normal operation 2023-10-31T15:15:57.458Z **BROKEN** plugin-spenderp: Plugin marked as important, shutting down lightningd! 2023-10-31T15:15:57.458Z DEBUG lightningd: io_break: lightningd_exit 2023-10-31T15:15:57.458Z DEBUG lightningd: io_loop_with_timers: main 2023-10-31T15:15:57.458Z DEBUG connectd: REPLY WIRE_CONNECTD_START_SHUTDOWN_REPLY with 0 fds 2023-10-31T15:15:57.458Z DEBUG lightningd: io_break: connectd_start_shutdown_reply 2023-10-31T15:15:57.458Z DEBUG 021ccce7bc396996c8f3b7bfeb1e30c6600269517026a74adfe2217b7187879797-dualopend-chan#1: Status closed, but not exited. Killing 2023-10-31T15:15:57.458Z DEBUG lightningd: Command returned result after jcon close 2023-10-31T15:15:57.458Z INFO 021ccce7bc396996c8f3b7bfeb1e30c6600269517026a74adfe2217b7187879797-chan#1: Unsaved peer failed. Deleting channel. 2023-10-31T15:15:57.464Z DEBUG lightningd: io_break: destroy_plugin 2023-10-31T15:15:57.464Z DEBUG connectd: Shutting down 2023-10-31T15:15:57.464Z DEBUG gossipd: Shutting down 2023-10-31T15:15:57.464Z DEBUG hsmd: Shutting down Reported-By: @t-bast
2866b4e
to
cd5ccd5
Compare
I ran tests again |
This commit reworks how reconnects work for dual-funding/v2 channel opens.
We migrate to using the
funding_next_txid
protocol that @t-bast developed (see lightning/bolts@cd3c99e). It works very nicely!PR also includes some cleanups and tidying in the
dualopend.c
daemon.Short overview of changes:
channel_inflights
with blank/emptylast_tx
(the commitment tx field).commit_ready
is called (as the spec requires) before we send ourcommitment_signed
.commit_rcvd
is called after we've received the peer'scommitment_signed
.next_funding_txid
sent by a peer to resend them the requestedcommitment_signed
+tx_signatures
, as requested/required.commitment_signed
out of order (prior to this we'd fail since we'd only accept a commitment_signed in a specific part of the initial channel fow).It looks like a lot of commits, but they should be fairly easy to review? Here's hoping any way.