Proper connectd cleanups #5261

rustyrussell · 2022-05-16T07:51:15Z

I did some bandaids for the v0.11.1 release, but this fixes the issues properly, as well as addresses some things I found along the way.

connectd was not allowed to simply disconnect peers, but only when it was told to by lightningd. This means a complex dance if we e.g. get a reconnection: we have to ask lightningd to tell us to disconnect. It turns out, however, that it's not that hard to have lightningd clean up when told by connectd, which makes the logic in connectd much simpler!
When lightningd wanted to talk to a peer, it would ask connectd to allocate file descriptors for that channel id and return them. This again made a callback dance and added complexity. Instead, lightningd now allocates the file descriptors and hands them down for connectd to attach.
We don't ever tell connectd to explicitly disconnect (except from the "disconnect" RPC command, or some cases where we can't allocate a subdaemon fd): it does so on send/receive of a warning/error already. This changed a number of tests which assumed disconnection, but so many tests were having to be reworked already.
Unless another connection comes in, or we are explictly told to disconnect, we give 5 seconds grace for any remaining messages to/from subds to drain. This is much more reliable and friendly.

With this rework, we seem to be reliably able to run the tests I used to locate the issues, so those are included at the end.

But disconnection is no longer instantaneous, many tests had to be fixed (under CI; they pass on my build machine!).

Aspirational list of things this should fix:

Fixes: #5353
Fixes: #5366
Fixes: #5371

niftynei · 2022-07-08T14:29:50Z

maybe #5366 ?

rustyrussell · 2022-07-09T02:50:02Z

Rebased, and fixed python extra whitespace.

rustyrussell · 2022-07-12T12:22:29Z

Rebased on master, slowly fixing new flakes I added!

niftynei

Read through the changes in detail, didn't have a chance to dig into the causes of the test failures though, which makes me think maybe there's something here that I missed.

Overall, the removal of the flag checks in connectd is quite nice, I found the new control flow a lot easier to reason about.

Bunch of very tiny spelling nits, otherwise lgtm

connectd/multiplex.c

lightningd/peer_control.c

tests/test_opening.py

rustyrussell · 2022-07-13T06:08:36Z

I've removed many of my test workarounds, since I discovered that we needed to be looser in terminating connections: give all the subds, and the peer a chance to drain out if we can.

And applied Lisa's sp fixes! Me typoe!

rustyrussell · 2022-07-15T04:41:16Z

Rebased on master for static channel backups clash...

Setting it to 0xfffff... is just confusing. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

``` lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: FATAL SIGNAL 11 (version e0507aa) lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x56319736e437 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x56319736e48a lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x7fc37721151f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:839 (subd_send_msg) 0x5631973425f3 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:859 (subd_req_) 0x5631973426e4 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:2967 (peer_dev_memleak) 0x56319732ca93 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/memdump.c:296 (json_memleak) 0x56319730fc97 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:630 (command_exec) 0x563197306f9d lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:765 (rpc_command_hook_final) 0x5631973075d5 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:278 (plugin_hook_call_) 0x56319733db47 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:853 (plugin_hook_call_rpc_command) 0x5631973079d4 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:957 (parse_request) 0x563197307efb lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:1054 (read_json) 0x563197308361 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5631973de46c lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x5631973df0a1 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x5631973df0e3 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5631973e147f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x563197305041 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1184 (main) 0x56319730b58d lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) 0x7fc3771f8d8f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:392 (__libc_start_main_impl) 0x7fc3771f8e3f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x5631972dec54 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff lightningd-3 2022-07-14T08:18:38.976Z DEBUG connectd: drain_peer ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It's directly a product of "does it have a current owner subdaemon" and "does that subdaemon talk to peers", so create a helper function which just evaluates that instead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Before this patch: 1. connectd says it's connected (peer_connected) 2. we tell connectd we want to talk about each channel (peer_make_active) 3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active) 4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active) Now: 1. connectd says it's connected (peer_connected) 2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd). 3. if connectd says it sent something about a channel we didn't tell it about, we either tell it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd). This is the minimal-size patch, which is why we create socket pairs in so many places to use the existing functions. Many cleanups are possible, since the new flow is so simple. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This allows us to detect when lightningd hasn't seen our latest disconnect/reconnect; in particular, we would hit the following pattern: 1. lightningd says to connect a subd. 2. connectd disconnects and reconnects. 3. connectd reads message, connects subd. 4. lightningd reads disconnect and reconnect, sends msg to connect to subd again. 5. connectd asserts because subd is alreacy connected. This way connectd can tell if lightningd is talking about the previous connection, and ignoere it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Now this passes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

A subtle case I hadn't come across before: if a child tal_resizes() its parent while the parent is being deleted, tal gets confused. The subd destructor does this using tal_arr_remove() on peer->subds, which is currently being freed: ``` ==61056== Invalid read of size 8 ==61056== at 0x185632: del_tree (tal.c:417) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x185957: tal_free (tal.c:486) ==61056== by 0x1183BC: peer_discard (connectd.c:1861) ==61056== by 0x11869E: recv_req (connectd.c:1942) ==61056== by 0x12774B: handle_read (daemon_conn.c:35) ==61056== by 0x173453: next_plan (io.c:59) ==61056== by 0x17405B: do_plan (io.c:407) ==61056== by 0x17409D: io_ready (io.c:417) ==61056== by 0x176390: io_loop (poll.c:453) ==61056== by 0x118A68: main (connectd.c:2082) ==61056== Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd ==61056== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==61056== by 0x1860E6: tal_resize_ (tal.c:699) ==61056== by 0x1373DD: tal_arr_remove_ (utils.c:184) ==61056== by 0x11D508: destroy_subd (multiplex.c:930) ==61056== by 0x1850A4: notify (tal.c:240) ==61056== by 0x1855BB: del_tree (tal.c:402) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x185957: tal_free (tal.c:486) ==61056== by 0x1183BC: peer_discard (connectd.c:1861) ==61056== by 0x11869E: recv_req (connectd.c:1942) ==61056== by 0x12774B: handle_read (daemon_conn.c:35) ``` So simply make the subds children of `peer` not the `peer->subds` array. The only effect is that drain_peer() can't simply free the subds array but must free the subds one at a time. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Give them time to process any final messages! If there's a reconnect, then we need to clean them up immediately of course. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Very unusual, but it can happen, and we don't free: ``` lightningd-1 2022-07-12T04:21:22.591Z DEBUG gossipd: REPLY WIRE_GOSSIPD_DEV_MEMLEAK_REPLY with 0 fds lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: MEMLEAK: 0x55e73123d008 lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: label=common/gossip_store.c:92:u8[] lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: backtrace: lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: ccan/ccan/tal/tal.c:442 (tal_alloc_) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: ccan/ccan/tal/tal.c:471 (tal_alloc_arr_) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: common/gossip_store.c:92 (gossip_store_next) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: connectd/multiplex.c:433 (maybe_from_gossip_store) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: connectd/multiplex.c:856 (write_to_peer) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:59 (next_plan) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:407 (do_plan) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:423 (io_ready) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/poll.c:453 (io_loop) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: connectd/connectd.c:2083 (main) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ../csu/libc-start.c:392 (__libc_start_main_impl) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: parents: ```

…disconnect, or peer disconnects. We want to avoid lost messages in the common cases. This generalizes our drain code, by giving the subds each 5 seconds to close themselves, but continue to allow them to send us traffic (if peer is still connected) and continue to send them traffic. We continue to send traffic *out* to the peer (if it's still connected), until all subds are gone. We still have a 5 second timer to close the connection to peer. On reconnects, we don't do this "drain period" on reconnects: we kill immediately. We fix up one test which was looking for the "disconnect" message explicitly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@niftynei

From @niftynei. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

I saw this in test_bech32_funding: the peer disconnected and told gossipd before lightningd relayed a local_channel_announcement from the subd: ``` lightningd-1 2022-07-14T08:46:32.352Z DEBUG 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: billboard: Funding transaction locked. Channel announced. lightningd-1 2022-07-14T08:46:33.353Z DEBUG connectd: drain_peer lightningd-1 2022-07-14T08:46:33.353Z DEBUG connectd: drain_peer draining subd! lightningd-1 2022-07-14T08:46:33.353Z DEBUG 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-lightningd: peer_disconnect_done lightningd-1 2022-07-14T08:46:33.354Z INFO 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: Peer connection lost lightningd-1 2022-07-14T08:46:33.354Z INFO 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-chan#1: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208) lightningd-1 2022-07-14T08:46:33.354Z DEBUG connectd: maybe_free_peer freeing peer! lightningd-1 2022-07-14T08:46:33.354Z DEBUG 0228af54fd951097caa2ceea5546a37bcc7d7f746e1cb7cb549e3edcd1797a1d80-hsmd: Got WIRE_HSMD_CUPDATE_SIG_REQ lightningd-1 2022-07-14T08:46:33.354Z DEBUG plugin-funder: Cleaning up inflights for peer id 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d lightningd-1 2022-07-14T08:46:33.354Z **BROKEN** gossipd: Unknown peer 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d for local_channel_announcement ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It usually works out due to other reconnections, but I noticed this diagnosing another test. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

In particular, when onchaind finishes with a channel, and we delete it, we would forget about the peer, even if it's still connected. That leads to a surprise if we are activated because of something it sends: ``` 2022-07-16T09:07:51.8668176Z lightningd-1 2022-07-16T08:54:32.497Z INFO 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer transient failure in AWAITING_UNILATERAL: Disconnected 2022-07-16T09:07:51.8668717Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Will try reconnect in 1 seconds 2022-07-16T09:07:51.8669323Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer has reconnected, state AWAITING_UNILATERAL: connecting subd 2022-07-16T09:07:51.8671225Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Telling connectd to send error 001185a48d443eae6fbcc679accd4d497c4183b711f2cd204c0b50acd3cd76fda08d00936368616e6e656c643a207265636569766564204552524f52206572726f72206368616e6e656c20383561343864343433656165366662636336373961636364346434393763343138336237313166326364323034633062353061636433636437366664613038643a20466f726369626c7920636c6f7365642062792060636c6f73656020636f6d6d616e642074696d656f7574 2022-07-16T09:07:51.8671786Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER 2022-07-16T09:07:51.8672270Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ 2022-07-16T09:07:51.8673027Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-onchaind-chan#1: billboard: All outputs resolved: waiting 0 more blocks before forgetting channel 2022-07-16T09:07:51.8673419Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds 2022-07-16T09:07:51.8673954Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER 2022-07-16T09:07:51.8674298Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG hsmd: Client: Received message 1 from client 2022-07-16T09:07:51.8674811Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-gossipd: seeker: chosen as startup peer 2022-07-16T09:07:51.8675330Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Activating for message WIRE_ERROR 2022-07-16T09:07:51.8675825Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ ... 2022-07-16T09:07:51.9503144Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: FATAL SIGNAL 11 (version 6e6b41d) 2022-07-16T09:07:51.9503563Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x5620882dbffb 2022-07-16T09:07:51.9503970Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x5620882dc04d 2022-07-16T09:07:51.9504534Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7ffb3abdf08f 2022-07-16T09:07:51.9504973Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:1378 (peer_spoke) 0x5620882aedd7 2022-07-16T09:07:51.9505418Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/connect_control.c:492 (connectd_msg) 0x56208828e9db 2022-07-16T09:07:51.9505835Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/subd.c:557 (sd_msg_read) 0x5620882bd89a 2022-07-16T09:07:51.9506236Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5620883318d4 2022-07-16T09:07:51.9506618Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x562088331da1 2022-07-16T09:07:51.9507021Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x562088331e3e 2022-07-16T09:07:51.9507428Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5620883337d3 2022-07-16T09:07:51.9507945Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x5620882969f5 2022-07-16T09:07:51.9508368Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1190 (main) 0x56208829a7bb 2022-07-16T09:07:51.9508804Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7ffb3abc0082 2022-07-16T09:07:51.9509172Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x56208827d32d 2022-07-16T09:07:51.9509552Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

@localhost

Disconnecting a peer after openingd fails is not instantaneous: we abort the open, so openingd sends out a WIRE_ERROR which makes connectd close the connection. As a result this test fails often. The simplest fix is to wait for a second in multifundchannel before retrying, which is also robust against behaviour changes if we decide *not* to disconnect in future. Also make sure that addrhint ownership is correct, since this can lead to a use-after-free if we filter dests. ``` tests/test_connection.py::test_multifunding_best_effort FAILED [100%] ======================================================= FAILURES ======================================================== _____________________________________________ test_multifunding_best_effort _____________________________________________ node_factory = <pyln.testing.utils.NodeFactory object at 0x7f6c0c95c1c0> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f6c0c92a880> @pytest.mark.openchannel('v1') @pytest.mark.developer("disconnect=... needs DEVELOPER=1") def test_multifunding_best_effort(node_factory, bitcoind): ''' Check that best_effort flag works. ''' disconnects = ["-WIRE_INIT", "-WIRE_ACCEPT_CHANNEL", "-WIRE_FUNDING_SIGNED"] l1 = node_factory.get_node() l2 = node_factory.get_node() l3 = node_factory.get_node(disconnect=disconnects) l4 = node_factory.get_node() l1.fundwallet(2000000) destinations = [{"id": '{}@localhost:{}'.format(l2.info['id'], l2.port), "amount": 50000}, {"id": '{}@localhost:{}'.format(l3.info['id'], l3.port), "amount": 50000}, {"id": '{}@localhost:{}'.format(l4.info['id'], l4.port), "amount": 50000}] for i, d in enumerate(disconnects): # Should succeed due to best-effort flag. > l1.rpc.multifundchannel(destinations, minchannels=2) tests/test_connection.py:2070: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: multifundchannel, payload: {'destinations': [{'id': '022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59@localhost:41023', 'amount': 50000}, {'id': '035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d@localhost:41977', 'amount': 50000}, {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199@localhost:34943', 'amount': 50000}], 'minchannels': 2}, error: {'code': 305, 'message': 'Peer not connected at start', 'data': {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199', 'method': 'fundchannel_start'}} ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…erfering redirects. In various places, we assumed that when `connected` is false, everything is finished. This is not true: we should wait for the state we expect. In addition, various places allows reconnections, which interfered with the logic; suppress them. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…it on sending error/warning. Connectd already does this when we *receive* an error or warning, but now do it on send. This causes some slight behavior change: we don't disconnect when we close a channel, for example (our behaviour here has been inconsistent across versions, depending on the code). When connectd is told to disconnect, it now does so immediately, and doesn't wait for subds to drain etc. That simplifies the manual disconnect case, which now cleans up as it would from any other disconnection when connectd says it's disconnected. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The only places which should call try_reconnect now are the "connect" command, and the disconnect path when it decides there's still an active channel. This introduces one subtlety: if we disconnect when there's no active channel, but then the subd makes one, we have to catch that case! This temporarily reverts "slow" reconnections to fast ones: see next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Just keep a flag on the peer, and delay connection longer if that is set. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Dualopend is not listening to the peer fd when it hangs up, so doesn't notice it's gone. We don't clean up the channel until it's done (usually a good thing: it could be about to lock it in), but this harms us here. Fix the test failure and make a comment. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We close the channel because the payment fulfilment is not totally done, and we generate 6 blocks, causing it to hit CLTV deadline. ``` # send some payments, mine a block or two inv = l2.rpc.invoice(10**4, '1', 'no_1') l1.rpc.pay(inv['bolt11']) # l2 attempts to close a channel that it leased, should fail with pytest.raises(RpcError, match=r'Peer leased this channel from us'): l2.rpc.close(l1.get_channel_scid(l2)) bitcoind.generate_block(6) sync_blockheight(bitcoind, [l1, l2]) # make sure we're at the right place for the csv lock > l2.daemon.wait_for_log('Blockheight: SENT_ADD_ACK_COMMIT->RCVD_ADD_ACK_REVOCATION LOCAL now 115') tests/test_closing.py:823: ... lightningd-2 2022-07-17T13:15:34.242Z DEBUG lightningd: Adding block 115: 39d95061935e9fc42b04c86ae60d0cf157765aff4c040f3a8d0b7888db19e015 lightningd-2 2022-07-17T13:15:34.244Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#2: Peer permanent failure in CHANNELD_NORMAL: Fulfilled HTLC 0 SENT_REMOVE_COMMIT cltv 115 hit deadline ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Otherwise connectd complains we didn't close, eg tests/test_connection.py::test_funding_cancel_race: ``` lightningd-1 2022-07-17T14:43:56.813Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Send error reason: Cancel channel by our RPC command before funding transaction broadcast. lightningd-1 2022-07-17T14:43:56.867Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: peer_out WIRE_ERROR lightningd-2 2022-07-17T14:43:56.926Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: peer_in WIRE_ERROR lightningd-2 2022-07-17T14:43:56.951Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: billboard perm: Received error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast. lightningd-1 2022-07-17T14:43:56.952Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Status closed, but not exited. Killing lightningd-2 2022-07-17T14:43:56.976Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: Peer permanent failure in CHANNELD_AWAITING_LOCKIN: channeld: received ERROR error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast., forget channel DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}} DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}} DEBUG:root:{ "id": 1, "result": { "cancelled": "Channel open canceled by RPC(after fundchannel_complete)" } } DEBUG:root:{ "id": 1, "result": { "cancelled": "Channel open canceled by RPC(after fundchannel_complete)" } } DEBUG:root:{ "id": 1, "method": "txdiscard", "params": { "txid": "0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1" } } lightningd-1 2022-07-17T14:43:57.022Z DEBUG connectd: drain_peer DEBUG:root:Calling txdiscard with payload {'txid': '0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1'} lightningd-1 2022-07-17T14:43:57.024Z DEBUG connectd: drain_peer draining subd! lightningd-1 2022-07-17T14:43:57.069Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: peer_disconnect_done lightningd-1 2022-07-17T14:43:57.082Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Not reconnecting: no active channel ... lightningd-1 2022-07-17T14:44:01.877Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: Subd did not close, forcing close ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

niftynei · 2022-07-19T01:49:45Z

ACK 76b5fc8

rustyrussell force-pushed the guilt/connectd-cleanups branch 4 times, most recently from 5209547 to 5e20ce5 Compare May 19, 2022 07:36

whitslack mentioned this pull request Jun 1, 2022

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

Merged

rustyrussell force-pushed the guilt/connectd-cleanups branch from 5e20ce5 to 18f76ba Compare June 7, 2022 10:21

heyambob mentioned this pull request Jun 12, 2022

v0.11.1, crash in connectd #5282

Closed

rustyrussell force-pushed the guilt/connectd-cleanups branch 2 times, most recently from 171fd41 to c9fec4f Compare June 16, 2022 07:21

rustyrussell force-pushed the guilt/connectd-cleanups branch from c9fec4f to d6aa721 Compare July 8, 2022 13:50

rustyrussell added this to the v0.12 milestone Jul 8, 2022

rustyrussell force-pushed the guilt/connectd-cleanups branch from d6aa721 to ac2bd2e Compare July 9, 2022 02:49

rustyrussell force-pushed the guilt/connectd-cleanups branch 3 times, most recently from 2a4a574 to 6af7457 Compare July 12, 2022 12:22

niftynei reviewed Jul 13, 2022

View reviewed changes

connectd/multiplex.c Outdated Show resolved Hide resolved

lightningd/peer_control.c Outdated Show resolved Hide resolved

lightningd/peer_control.c Outdated Show resolved Hide resolved

tests/test_opening.py Show resolved Hide resolved

rustyrussell force-pushed the guilt/connectd-cleanups branch from 0a54278 to 6930eba Compare July 13, 2022 06:07

rustyrussell force-pushed the guilt/connectd-cleanups branch 3 times, most recently from 45cd886 to 8b5c508 Compare July 14, 2022 10:40

rustyrussell force-pushed the guilt/connectd-cleanups branch 2 times, most recently from deadab1 to 8c5bb57 Compare July 15, 2022 12:02

rustyrussell added 3 commits July 16, 2022 14:19

lightningd: set cid correctly in peer->uncommitted_channel.

927b49c

Setting it to 0xfffff... is just confusing. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: remove 'connected' flag from channel structure.

cc240f5

It's directly a product of "does it have a current owner subdaemon" and "does that subdaemon talk to peers", so create a helper function which just evaluates that instead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/connectd-cleanups branch 7 times, most recently from 7297a00 to 82bd474 Compare July 18, 2022 06:05

rustyrussell requested a review from cdecker as a code owner July 18, 2022 10:22

rustyrussell added 19 commits July 18, 2022 21:42

pytest: enable race test.

c82d99d

Now this passes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: don't kill subds immediately on disconnect.

e5b0cc6

Give them time to process any final messages! If there's a reconnect, then we need to clean them up immediately of course. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

connectd: spelling and typo fixes.

16c6710

From @niftynei. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: provide peer address for reconnect if connect fails.

dacbe72

It usually works out due to other reconnections, but I noticed this diagnosing another test. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: reintroduce "slow connect" logic.

f86f803

Just keep a flag on the peer, and delay connection longer if that is set. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/connectd-cleanups branch from eb1d3c1 to 76b5fc8 Compare July 18, 2022 12:12

niftynei merged commit 0363c62 into ElementsProject:master Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper connectd cleanups #5261

Proper connectd cleanups #5261

rustyrussell commented May 16, 2022 •

edited

Loading

niftynei commented Jul 8, 2022

rustyrussell commented Jul 9, 2022

rustyrussell commented Jul 12, 2022

niftynei left a comment

rustyrussell commented Jul 13, 2022

rustyrussell commented Jul 15, 2022

niftynei commented Jul 19, 2022

Proper connectd cleanups #5261

Proper connectd cleanups #5261

Conversation

rustyrussell commented May 16, 2022 • edited Loading

niftynei commented Jul 8, 2022

rustyrussell commented Jul 9, 2022

rustyrussell commented Jul 12, 2022

niftynei left a comment

Choose a reason for hiding this comment

rustyrussell commented Jul 13, 2022

rustyrussell commented Jul 15, 2022

niftynei commented Jul 19, 2022

rustyrussell commented May 16, 2022 •

edited

Loading