-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper connectd cleanups #5261
Proper connectd cleanups #5261
Conversation
5209547
to
5e20ce5
Compare
5e20ce5
to
18f76ba
Compare
171fd41
to
c9fec4f
Compare
c9fec4f
to
d6aa721
Compare
maybe #5366 ? |
d6aa721
to
ac2bd2e
Compare
Rebased, and fixed python extra whitespace. |
2a4a574
to
6af7457
Compare
Rebased on master, slowly fixing new flakes I added! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through the changes in detail, didn't have a chance to dig into the causes of the test failures though, which makes me think maybe there's something here that I missed.
Overall, the removal of the flag checks in connectd is quite nice, I found the new control flow a lot easier to reason about.
Bunch of very tiny spelling nits, otherwise lgtm
0a54278
to
6930eba
Compare
I've removed many of my test workarounds, since I discovered that we needed to be looser in terminating connections: give all the subds, and the peer a chance to drain out if we can. And applied Lisa's sp fixes! Me typoe! |
45cd886
to
8b5c508
Compare
Rebased on master for static channel backups clash... |
deadab1
to
8c5bb57
Compare
Setting it to 0xfffff... is just confusing. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
``` lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: FATAL SIGNAL 11 (version e0507aa) lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x56319736e437 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x56319736e48a lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x7fc37721151f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:839 (subd_send_msg) 0x5631973425f3 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:859 (subd_req_) 0x5631973426e4 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:2967 (peer_dev_memleak) 0x56319732ca93 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/memdump.c:296 (json_memleak) 0x56319730fc97 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:630 (command_exec) 0x563197306f9d lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:765 (rpc_command_hook_final) 0x5631973075d5 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:278 (plugin_hook_call_) 0x56319733db47 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:853 (plugin_hook_call_rpc_command) 0x5631973079d4 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:957 (parse_request) 0x563197307efb lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:1054 (read_json) 0x563197308361 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5631973de46c lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x5631973df0a1 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x5631973df0e3 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5631973e147f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x563197305041 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1184 (main) 0x56319730b58d lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) 0x7fc3771f8d8f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:392 (__libc_start_main_impl) 0x7fc3771f8e3f lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x5631972dec54 lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff lightningd-3 2022-07-14T08:18:38.976Z DEBUG connectd: drain_peer ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's directly a product of "does it have a current owner subdaemon" and "does that subdaemon talk to peers", so create a helper function which just evaluates that instead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
7297a00
to
82bd474
Compare
Before this patch: 1. connectd says it's connected (peer_connected) 2. we tell connectd we want to talk about each channel (peer_make_active) 3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active) 4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active) Now: 1. connectd says it's connected (peer_connected) 2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd). 3. if connectd says it sent something about a channel we didn't tell it about, we either tell it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd). This is the minimal-size patch, which is why we create socket pairs in so many places to use the existing functions. Many cleanups are possible, since the new flow is so simple. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows us to detect when lightningd hasn't seen our latest disconnect/reconnect; in particular, we would hit the following pattern: 1. lightningd says to connect a subd. 2. connectd disconnects and reconnects. 3. connectd reads message, connects subd. 4. lightningd reads disconnect and reconnect, sends msg to connect to subd again. 5. connectd asserts because subd is alreacy connected. This way connectd can tell if lightningd is talking about the previous connection, and ignoere it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now this passes. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A subtle case I hadn't come across before: if a child tal_resizes() its parent while the parent is being deleted, tal gets confused. The subd destructor does this using tal_arr_remove() on peer->subds, which is currently being freed: ``` ==61056== Invalid read of size 8 ==61056== at 0x185632: del_tree (tal.c:417) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x185957: tal_free (tal.c:486) ==61056== by 0x1183BC: peer_discard (connectd.c:1861) ==61056== by 0x11869E: recv_req (connectd.c:1942) ==61056== by 0x12774B: handle_read (daemon_conn.c:35) ==61056== by 0x173453: next_plan (io.c:59) ==61056== by 0x17405B: do_plan (io.c:407) ==61056== by 0x17409D: io_ready (io.c:417) ==61056== by 0x176390: io_loop (poll.c:453) ==61056== by 0x118A68: main (connectd.c:2082) ==61056== Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd ==61056== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==61056== by 0x1860E6: tal_resize_ (tal.c:699) ==61056== by 0x1373DD: tal_arr_remove_ (utils.c:184) ==61056== by 0x11D508: destroy_subd (multiplex.c:930) ==61056== by 0x1850A4: notify (tal.c:240) ==61056== by 0x1855BB: del_tree (tal.c:402) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x18560D: del_tree (tal.c:412) ==61056== by 0x185957: tal_free (tal.c:486) ==61056== by 0x1183BC: peer_discard (connectd.c:1861) ==61056== by 0x11869E: recv_req (connectd.c:1942) ==61056== by 0x12774B: handle_read (daemon_conn.c:35) ``` So simply make the subds children of `peer` not the `peer->subds` array. The only effect is that drain_peer() can't simply free the subds array but must free the subds one at a time. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Give them time to process any final messages! If there's a reconnect, then we need to clean them up immediately of course. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Very unusual, but it can happen, and we don't free: ``` lightningd-1 2022-07-12T04:21:22.591Z DEBUG gossipd: REPLY WIRE_GOSSIPD_DEV_MEMLEAK_REPLY with 0 fds lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: MEMLEAK: 0x55e73123d008 lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: label=common/gossip_store.c:92:u8[] lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: backtrace: lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: ccan/ccan/tal/tal.c:442 (tal_alloc_) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: ccan/ccan/tal/tal.c:471 (tal_alloc_arr_) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: common/gossip_store.c:92 (gossip_store_next) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: connectd/multiplex.c:433 (maybe_from_gossip_store) lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: connectd/multiplex.c:856 (write_to_peer) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:59 (next_plan) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:407 (do_plan) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/io.c:423 (io_ready) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ccan/ccan/io/poll.c:453 (io_loop) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: connectd/connectd.c:2083 (main) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: ../csu/libc-start.c:392 (__libc_start_main_impl) lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd: parents: ```
…disconnect, or peer disconnects. We want to avoid lost messages in the common cases. This generalizes our drain code, by giving the subds each 5 seconds to close themselves, but continue to allow them to send us traffic (if peer is still connected) and continue to send them traffic. We continue to send traffic *out* to the peer (if it's still connected), until all subds are gone. We still have a 5 second timer to close the connection to peer. On reconnects, we don't do this "drain period" on reconnects: we kill immediately. We fix up one test which was looking for the "disconnect" message explicitly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
From @niftynei. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I saw this in test_bech32_funding: the peer disconnected and told gossipd before lightningd relayed a local_channel_announcement from the subd: ``` lightningd-1 2022-07-14T08:46:32.352Z DEBUG 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: billboard: Funding transaction locked. Channel announced. lightningd-1 2022-07-14T08:46:33.353Z DEBUG connectd: drain_peer lightningd-1 2022-07-14T08:46:33.353Z DEBUG connectd: drain_peer draining subd! lightningd-1 2022-07-14T08:46:33.353Z DEBUG 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-lightningd: peer_disconnect_done lightningd-1 2022-07-14T08:46:33.354Z INFO 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: Peer connection lost lightningd-1 2022-07-14T08:46:33.354Z INFO 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-chan#1: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208) lightningd-1 2022-07-14T08:46:33.354Z DEBUG connectd: maybe_free_peer freeing peer! lightningd-1 2022-07-14T08:46:33.354Z DEBUG 0228af54fd951097caa2ceea5546a37bcc7d7f746e1cb7cb549e3edcd1797a1d80-hsmd: Got WIRE_HSMD_CUPDATE_SIG_REQ lightningd-1 2022-07-14T08:46:33.354Z DEBUG plugin-funder: Cleaning up inflights for peer id 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d lightningd-1 2022-07-14T08:46:33.354Z **BROKEN** gossipd: Unknown peer 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d for local_channel_announcement ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It usually works out due to other reconnections, but I noticed this diagnosing another test. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, when onchaind finishes with a channel, and we delete it, we would forget about the peer, even if it's still connected. That leads to a surprise if we are activated because of something it sends: ``` 2022-07-16T09:07:51.8668176Z lightningd-1 2022-07-16T08:54:32.497Z INFO 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer transient failure in AWAITING_UNILATERAL: Disconnected 2022-07-16T09:07:51.8668717Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Will try reconnect in 1 seconds 2022-07-16T09:07:51.8669323Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer has reconnected, state AWAITING_UNILATERAL: connecting subd 2022-07-16T09:07:51.8671225Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Telling connectd to send error 001185a48d443eae6fbcc679accd4d497c4183b711f2cd204c0b50acd3cd76fda08d00936368616e6e656c643a207265636569766564204552524f52206572726f72206368616e6e656c20383561343864343433656165366662636336373961636364346434393763343138336237313166326364323034633062353061636433636437366664613038643a20466f726369626c7920636c6f7365642062792060636c6f73656020636f6d6d616e642074696d656f7574 2022-07-16T09:07:51.8671786Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER 2022-07-16T09:07:51.8672270Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ 2022-07-16T09:07:51.8673027Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-onchaind-chan#1: billboard: All outputs resolved: waiting 0 more blocks before forgetting channel 2022-07-16T09:07:51.8673419Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds 2022-07-16T09:07:51.8673954Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER 2022-07-16T09:07:51.8674298Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG hsmd: Client: Received message 1 from client 2022-07-16T09:07:51.8674811Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-gossipd: seeker: chosen as startup peer 2022-07-16T09:07:51.8675330Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Activating for message WIRE_ERROR 2022-07-16T09:07:51.8675825Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ ... 2022-07-16T09:07:51.9503144Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: FATAL SIGNAL 11 (version 6e6b41d) 2022-07-16T09:07:51.9503563Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x5620882dbffb 2022-07-16T09:07:51.9503970Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x5620882dc04d 2022-07-16T09:07:51.9504534Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7ffb3abdf08f 2022-07-16T09:07:51.9504973Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:1378 (peer_spoke) 0x5620882aedd7 2022-07-16T09:07:51.9505418Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/connect_control.c:492 (connectd_msg) 0x56208828e9db 2022-07-16T09:07:51.9505835Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/subd.c:557 (sd_msg_read) 0x5620882bd89a 2022-07-16T09:07:51.9506236Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5620883318d4 2022-07-16T09:07:51.9506618Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x562088331da1 2022-07-16T09:07:51.9507021Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x562088331e3e 2022-07-16T09:07:51.9507428Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5620883337d3 2022-07-16T09:07:51.9507945Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x5620882969f5 2022-07-16T09:07:51.9508368Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1190 (main) 0x56208829a7bb 2022-07-16T09:07:51.9508804Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7ffb3abc0082 2022-07-16T09:07:51.9509172Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x56208827d32d 2022-07-16T09:07:51.9509552Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Disconnecting a peer after openingd fails is not instantaneous: we abort the open, so openingd sends out a WIRE_ERROR which makes connectd close the connection. As a result this test fails often. The simplest fix is to wait for a second in multifundchannel before retrying, which is also robust against behaviour changes if we decide *not* to disconnect in future. Also make sure that addrhint ownership is correct, since this can lead to a use-after-free if we filter dests. ``` tests/test_connection.py::test_multifunding_best_effort FAILED [100%] ======================================================= FAILURES ======================================================== _____________________________________________ test_multifunding_best_effort _____________________________________________ node_factory = <pyln.testing.utils.NodeFactory object at 0x7f6c0c95c1c0> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f6c0c92a880> @pytest.mark.openchannel('v1') @pytest.mark.developer("disconnect=... needs DEVELOPER=1") def test_multifunding_best_effort(node_factory, bitcoind): ''' Check that best_effort flag works. ''' disconnects = ["-WIRE_INIT", "-WIRE_ACCEPT_CHANNEL", "-WIRE_FUNDING_SIGNED"] l1 = node_factory.get_node() l2 = node_factory.get_node() l3 = node_factory.get_node(disconnect=disconnects) l4 = node_factory.get_node() l1.fundwallet(2000000) destinations = [{"id": '{}@localhost:{}'.format(l2.info['id'], l2.port), "amount": 50000}, {"id": '{}@localhost:{}'.format(l3.info['id'], l3.port), "amount": 50000}, {"id": '{}@localhost:{}'.format(l4.info['id'], l4.port), "amount": 50000}] for i, d in enumerate(disconnects): # Should succeed due to best-effort flag. > l1.rpc.multifundchannel(destinations, minchannels=2) tests/test_connection.py:2070: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: multifundchannel, payload: {'destinations': [{'id': '022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59@localhost:41023', 'amount': 50000}, {'id': '035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d@localhost:41977', 'amount': 50000}, {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199@localhost:34943', 'amount': 50000}], 'minchannels': 2}, error: {'code': 305, 'message': 'Peer not connected at start', 'data': {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199', 'method': 'fundchannel_start'}} ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…erfering redirects. In various places, we assumed that when `connected` is false, everything is finished. This is not true: we should wait for the state we expect. In addition, various places allows reconnections, which interfered with the logic; suppress them. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…it on sending error/warning. Connectd already does this when we *receive* an error or warning, but now do it on send. This causes some slight behavior change: we don't disconnect when we close a channel, for example (our behaviour here has been inconsistent across versions, depending on the code). When connectd is told to disconnect, it now does so immediately, and doesn't wait for subds to drain etc. That simplifies the manual disconnect case, which now cleans up as it would from any other disconnection when connectd says it's disconnected. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The only places which should call try_reconnect now are the "connect" command, and the disconnect path when it decides there's still an active channel. This introduces one subtlety: if we disconnect when there's no active channel, but then the subd makes one, we have to catch that case! This temporarily reverts "slow" reconnections to fast ones: see next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Just keep a flag on the peer, and delay connection longer if that is set. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Dualopend is not listening to the peer fd when it hangs up, so doesn't notice it's gone. We don't clean up the channel until it's done (usually a good thing: it could be about to lock it in), but this harms us here. Fix the test failure and make a comment. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We close the channel because the payment fulfilment is not totally done, and we generate 6 blocks, causing it to hit CLTV deadline. ``` # send some payments, mine a block or two inv = l2.rpc.invoice(10**4, '1', 'no_1') l1.rpc.pay(inv['bolt11']) # l2 attempts to close a channel that it leased, should fail with pytest.raises(RpcError, match=r'Peer leased this channel from us'): l2.rpc.close(l1.get_channel_scid(l2)) bitcoind.generate_block(6) sync_blockheight(bitcoind, [l1, l2]) # make sure we're at the right place for the csv lock > l2.daemon.wait_for_log('Blockheight: SENT_ADD_ACK_COMMIT->RCVD_ADD_ACK_REVOCATION LOCAL now 115') tests/test_closing.py:823: ... lightningd-2 2022-07-17T13:15:34.242Z DEBUG lightningd: Adding block 115: 39d95061935e9fc42b04c86ae60d0cf157765aff4c040f3a8d0b7888db19e015 lightningd-2 2022-07-17T13:15:34.244Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#2: Peer permanent failure in CHANNELD_NORMAL: Fulfilled HTLC 0 SENT_REMOVE_COMMIT cltv 115 hit deadline ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Otherwise connectd complains we didn't close, eg tests/test_connection.py::test_funding_cancel_race: ``` lightningd-1 2022-07-17T14:43:56.813Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Send error reason: Cancel channel by our RPC command before funding transaction broadcast. lightningd-1 2022-07-17T14:43:56.867Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: peer_out WIRE_ERROR lightningd-2 2022-07-17T14:43:56.926Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: peer_in WIRE_ERROR lightningd-2 2022-07-17T14:43:56.951Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: billboard perm: Received error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast. lightningd-1 2022-07-17T14:43:56.952Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Status closed, but not exited. Killing lightningd-2 2022-07-17T14:43:56.976Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: Peer permanent failure in CHANNELD_AWAITING_LOCKIN: channeld: received ERROR error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast., forget channel DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}} DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}} DEBUG:root:{ "id": 1, "result": { "cancelled": "Channel open canceled by RPC(after fundchannel_complete)" } } DEBUG:root:{ "id": 1, "result": { "cancelled": "Channel open canceled by RPC(after fundchannel_complete)" } } DEBUG:root:{ "id": 1, "method": "txdiscard", "params": { "txid": "0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1" } } lightningd-1 2022-07-17T14:43:57.022Z DEBUG connectd: drain_peer DEBUG:root:Calling txdiscard with payload {'txid': '0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1'} lightningd-1 2022-07-17T14:43:57.024Z DEBUG connectd: drain_peer draining subd! lightningd-1 2022-07-17T14:43:57.069Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: peer_disconnect_done lightningd-1 2022-07-17T14:43:57.082Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Not reconnecting: no active channel ... lightningd-1 2022-07-17T14:44:01.877Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: Subd did not close, forcing close ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
eb1d3c1
to
76b5fc8
Compare
ACK 76b5fc8 |
I did some bandaids for the v0.11.1 release, but this fixes the issues properly, as well as addresses some things I found along the way.
With this rework, we seem to be reliably able to run the tests I used to locate the issues, so those are included at the end.
But disconnection is no longer instantaneous, many tests had to be fixed (under CI; they pass on my build machine!).
Aspirational list of things this should fix:
Fixes: #5353
Fixes: #5366
Fixes: #5371