Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper connectd cleanups #5261

Merged
merged 38 commits into from
Jul 19, 2022

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented May 16, 2022

I did some bandaids for the v0.11.1 release, but this fixes the issues properly, as well as addresses some things I found along the way.

  1. connectd was not allowed to simply disconnect peers, but only when it was told to by lightningd. This means a complex dance if we e.g. get a reconnection: we have to ask lightningd to tell us to disconnect. It turns out, however, that it's not that hard to have lightningd clean up when told by connectd, which makes the logic in connectd much simpler!
  2. When lightningd wanted to talk to a peer, it would ask connectd to allocate file descriptors for that channel id and return them. This again made a callback dance and added complexity. Instead, lightningd now allocates the file descriptors and hands them down for connectd to attach.
  3. We don't ever tell connectd to explicitly disconnect (except from the "disconnect" RPC command, or some cases where we can't allocate a subdaemon fd): it does so on send/receive of a warning/error already. This changed a number of tests which assumed disconnection, but so many tests were having to be reworked already.
  4. Unless another connection comes in, or we are explictly told to disconnect, we give 5 seconds grace for any remaining messages to/from subds to drain. This is much more reliable and friendly.

With this rework, we seem to be reliably able to run the tests I used to locate the issues, so those are included at the end.

But disconnection is no longer instantaneous, many tests had to be fixed (under CI; they pass on my build machine!).

Aspirational list of things this should fix:

Fixes: #5353
Fixes: #5366
Fixes: #5371

@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 4 times, most recently from 5209547 to 5e20ce5 Compare May 19, 2022 07:36
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from 5e20ce5 to 18f76ba Compare June 7, 2022 10:21
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 2 times, most recently from 171fd41 to c9fec4f Compare June 16, 2022 07:21
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from c9fec4f to d6aa721 Compare July 8, 2022 13:50
@rustyrussell rustyrussell added this to the v0.12 milestone Jul 8, 2022
@niftynei
Copy link
Contributor

niftynei commented Jul 8, 2022

maybe #5366 ?

@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from d6aa721 to ac2bd2e Compare July 9, 2022 02:49
@rustyrussell
Copy link
Contributor Author

Rebased, and fixed python extra whitespace.

@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 3 times, most recently from 2a4a574 to 6af7457 Compare July 12, 2022 12:22
@rustyrussell
Copy link
Contributor Author

Rebased on master, slowly fixing new flakes I added!

Copy link
Contributor

@niftynei niftynei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through the changes in detail, didn't have a chance to dig into the causes of the test failures though, which makes me think maybe there's something here that I missed.

Overall, the removal of the flag checks in connectd is quite nice, I found the new control flow a lot easier to reason about.

Bunch of very tiny spelling nits, otherwise lgtm

connectd/multiplex.c Outdated Show resolved Hide resolved
lightningd/peer_control.c Outdated Show resolved Hide resolved
lightningd/peer_control.c Outdated Show resolved Hide resolved
tests/test_opening.py Show resolved Hide resolved
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from 0a54278 to 6930eba Compare July 13, 2022 06:07
@rustyrussell
Copy link
Contributor Author

I've removed many of my test workarounds, since I discovered that we needed to be looser in terminating connections: give all the subds, and the peer a chance to drain out if we can.

And applied Lisa's sp fixes! Me typoe!

@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 3 times, most recently from 45cd886 to 8b5c508 Compare July 14, 2022 10:40
@rustyrussell
Copy link
Contributor Author

Rebased on master for static channel backups clash...

@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 2 times, most recently from deadab1 to 8c5bb57 Compare July 15, 2022 12:02
Setting it to 0xfffff... is just confusing.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
```
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: FATAL SIGNAL 11 (version e0507aa)
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x56319736e437
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x56319736e48a
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 ((null)) 0x7fc37721151f
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:839 (subd_send_msg) 0x5631973425f3
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/subd.c:859 (subd_req_) 0x5631973426e4
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:2967 (peer_dev_memleak) 0x56319732ca93
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/memdump.c:296 (json_memleak) 0x56319730fc97
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:630 (command_exec) 0x563197306f9d
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:765 (rpc_command_hook_final) 0x5631973075d5
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:278 (plugin_hook_call_) 0x56319733db47
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:853 (plugin_hook_call_rpc_command) 0x5631973079d4
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:957 (parse_request) 0x563197307efb
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:1054 (read_json) 0x563197308361
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5631973de46c
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x5631973df0a1
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x5631973df0e3
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5631973e147f
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x563197305041
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1184 (main) 0x56319730b58d
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main) 0x7fc3771f8d8f
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:392 (__libc_start_main_impl) 0x7fc3771f8e3f
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x5631972dec54
lightningd-1 2022-07-14T08:18:38.972Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff
lightningd-3 2022-07-14T08:18:38.976Z DEBUG   connectd: drain_peer
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's directly a product of "does it have a current owner subdaemon"
and "does that subdaemon talk to peers", so create a helper function
which just evaluates that instead.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch 7 times, most recently from 7297a00 to 82bd474 Compare July 18, 2022 06:05
@rustyrussell rustyrussell requested a review from cdecker as a code owner July 18, 2022 10:22
Before this patch:
1. connectd says it's connected (peer_connected)
2. we tell connectd we want to talk about each channel (peer_make_active)
3. connectd gives us an fd for each channel, and we connect it to a subd (peer_active)
4. OR, connectd says it sent something about a channel we didn't tell it about, with an fd (peer_active)

Now:
1. connectd says it's connected (peer_connected)
2. we start all appropriate subds and tell connectd to what channels/fds (peer_connect_subd).
3. if connectd says it sent something about a channel we didn't tell it about, we either tell
   it to hang up (peer_final_msg), or connect a new opening daemon (peer_connect_subd).

This is the minimal-size patch, which is why we create socket pairs in
so many places to use the existing functions.  Many cleanups are
possible, since the new flow is so simple.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows us to detect when lightningd hasn't seen our latest
disconnect/reconnect; in particular, we would hit the following pattern:

1. lightningd says to connect a subd.
2. connectd disconnects and reconnects.
3. connectd reads message, connects subd.
4. lightningd reads disconnect and reconnect, sends msg to connect to subd again.
5. connectd asserts because subd is alreacy connected.

This way connectd can tell if lightningd is talking about the previous
connection, and ignoere it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now this passes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A subtle case I hadn't come across before: if a child tal_resizes()
its parent while the parent is being deleted, tal gets confused.

The subd destructor does this using tal_arr_remove() on peer->subds,
which is currently being freed:

```
==61056== Invalid read of size 8
==61056==    at 0x185632: del_tree (tal.c:417)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x185957: tal_free (tal.c:486)
==61056==    by 0x1183BC: peer_discard (connectd.c:1861)
==61056==    by 0x11869E: recv_req (connectd.c:1942)
==61056==    by 0x12774B: handle_read (daemon_conn.c:35)
==61056==    by 0x173453: next_plan (io.c:59)
==61056==    by 0x17405B: do_plan (io.c:407)
==61056==    by 0x17409D: io_ready (io.c:417)
==61056==    by 0x176390: io_loop (poll.c:453)
==61056==    by 0x118A68: main (connectd.c:2082)
==61056==  Address 0x4bd8850 is 16 bytes inside a block of size 48 free'd
==61056==    at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==61056==    by 0x1860E6: tal_resize_ (tal.c:699)
==61056==    by 0x1373DD: tal_arr_remove_ (utils.c:184)
==61056==    by 0x11D508: destroy_subd (multiplex.c:930)
==61056==    by 0x1850A4: notify (tal.c:240)
==61056==    by 0x1855BB: del_tree (tal.c:402)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x18560D: del_tree (tal.c:412)
==61056==    by 0x185957: tal_free (tal.c:486)
==61056==    by 0x1183BC: peer_discard (connectd.c:1861)
==61056==    by 0x11869E: recv_req (connectd.c:1942)
==61056==    by 0x12774B: handle_read (daemon_conn.c:35)
```

So simply make the subds children of `peer` not the `peer->subds`
array.  The only effect is that drain_peer() can't simply free the
subds array but must free the subds one at a time.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Give them time to process any final messages!  If there's a reconnect,
then we need to clean them up immediately of course.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Very unusual, but it can happen, and we don't free:

```
lightningd-1 2022-07-12T04:21:22.591Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_DEV_MEMLEAK_REPLY with 0 fds
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd: MEMLEAK: 0x55e73123d008
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:   label=common/gossip_store.c:92:u8[]
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:   backtrace:
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:     ccan/ccan/tal/tal.c:442 (tal_alloc_)
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:     ccan/ccan/tal/tal.c:471 (tal_alloc_arr_)
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:     common/gossip_store.c:92 (gossip_store_next)
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:     connectd/multiplex.c:433 (maybe_from_gossip_store)
lightningd-1 2022-07-12T04:21:22.645Z **BROKEN** connectd:     connectd/multiplex.c:856 (write_to_peer)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ccan/ccan/io/io.c:59 (next_plan)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ccan/ccan/io/io.c:407 (do_plan)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ccan/ccan/io/io.c:423 (io_ready)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ccan/ccan/io/poll.c:453 (io_loop)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     connectd/connectd.c:2083 (main)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:     ../csu/libc-start.c:392 (__libc_start_main_impl)
lightningd-1 2022-07-12T04:21:22.646Z **BROKEN** connectd:   parents:
```
…disconnect, or peer disconnects.

We want to avoid lost messages in the common cases.

This generalizes our drain code, by giving the subds each 5 seconds to
close themselves, but continue to allow them to send us traffic (if
peer is still connected) and continue to send them traffic.

We continue to send traffic *out* to the peer (if it's still
connected), until all subds are gone.  We still have a 5 second timer
to close the connection to peer.

On reconnects, we don't do this "drain period" on reconnects: we kill
immediately.

We fix up one test which was looking for the "disconnect" message
explicitly.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
From @niftynei.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I saw this in test_bech32_funding: the peer disconnected and told gossipd before lightningd
relayed a local_channel_announcement from the subd:

```
lightningd-1 2022-07-14T08:46:32.352Z DEBUG   020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: billboard: Funding transaction locked. Channel announced.
lightningd-1 2022-07-14T08:46:33.353Z DEBUG   connectd: drain_peer
lightningd-1 2022-07-14T08:46:33.353Z DEBUG   connectd: drain_peer draining subd!
lightningd-1 2022-07-14T08:46:33.353Z DEBUG   020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-lightningd: peer_disconnect_done
lightningd-1 2022-07-14T08:46:33.354Z INFO    020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-channeld-chan#1: Peer connection lost
lightningd-1 2022-07-14T08:46:33.354Z INFO    020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d-chan#1: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
lightningd-1 2022-07-14T08:46:33.354Z DEBUG   connectd: maybe_free_peer freeing peer!
lightningd-1 2022-07-14T08:46:33.354Z DEBUG   0228af54fd951097caa2ceea5546a37bcc7d7f746e1cb7cb549e3edcd1797a1d80-hsmd: Got WIRE_HSMD_CUPDATE_SIG_REQ
lightningd-1 2022-07-14T08:46:33.354Z DEBUG   plugin-funder: Cleaning up inflights for peer id 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d
lightningd-1 2022-07-14T08:46:33.354Z **BROKEN** gossipd: Unknown peer 020ba48216be53051ba8c661c641b5d9c3547c44bfcc43bf4d8362f0dfce0e950d for local_channel_announcement
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It usually works out due to other reconnections, but I noticed this
diagnosing another test.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, when onchaind finishes with a channel, and we delete it, we
would forget about the peer, even if it's still connected.  That leads to a
surprise if we are activated because of something it sends:

```
2022-07-16T09:07:51.8668176Z lightningd-1 2022-07-16T08:54:32.497Z INFO    022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer transient failure in AWAITING_UNILATERAL: Disconnected
2022-07-16T09:07:51.8668717Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Will try reconnect in 1 seconds
2022-07-16T09:07:51.8669323Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Peer has reconnected, state AWAITING_UNILATERAL: connecting subd
2022-07-16T09:07:51.8671225Z lightningd-1 2022-07-16T08:54:32.497Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Telling connectd to send error 001185a48d443eae6fbcc679accd4d497c4183b711f2cd204c0b50acd3cd76fda08d00936368616e6e656c643a207265636569766564204552524f52206572726f72206368616e6e656c20383561343864343433656165366662636336373961636364346434393763343138336237313166326364323034633062353061636433636437366664613038643a20466f726369626c7920636c6f7365642062792060636c6f73656020636f6d6d616e642074696d656f7574
2022-07-16T09:07:51.8671786Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER
2022-07-16T09:07:51.8672270Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ
2022-07-16T09:07:51.8673027Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-onchaind-chan#1: billboard: All outputs resolved: waiting 0 more blocks before forgetting channel
2022-07-16T09:07:51.8673419Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2022-07-16T09:07:51.8673954Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER
2022-07-16T09:07:51.8674298Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   hsmd: Client: Received message 1 from client
2022-07-16T09:07:51.8674811Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-gossipd: seeker: chosen as startup peer
2022-07-16T09:07:51.8675330Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Activating for message WIRE_ERROR
2022-07-16T09:07:51.8675825Z lightningd-2 2022-07-16T08:54:32.497Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ
...
2022-07-16T09:07:51.9503144Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: FATAL SIGNAL 11 (version 6e6b41d)
2022-07-16T09:07:51.9503563Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x5620882dbffb
2022-07-16T09:07:51.9503970Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x5620882dc04d
2022-07-16T09:07:51.9504534Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7ffb3abdf08f
2022-07-16T09:07:51.9504973Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/peer_control.c:1378 (peer_spoke) 0x5620882aedd7
2022-07-16T09:07:51.9505418Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/connect_control.c:492 (connectd_msg) 0x56208828e9db
2022-07-16T09:07:51.9505835Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/subd.c:557 (sd_msg_read) 0x5620882bd89a
2022-07-16T09:07:51.9506236Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5620883318d4
2022-07-16T09:07:51.9506618Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x562088331da1
2022-07-16T09:07:51.9507021Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x562088331e3e
2022-07-16T09:07:51.9507428Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x5620883337d3
2022-07-16T09:07:51.9507945Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x5620882969f5
2022-07-16T09:07:51.9508368Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1190 (main) 0x56208829a7bb
2022-07-16T09:07:51.9508804Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7ffb3abc0082
2022-07-16T09:07:51.9509172Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x56208827d32d
2022-07-16T09:07:51.9509552Z lightningd-2 2022-07-16T08:54:32.737Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Disconnecting a peer after openingd fails is not instantaneous:
we abort the open, so openingd sends out a WIRE_ERROR which makes
connectd close the connection.

As a result this test fails often.  The simplest fix is to wait for a
second in multifundchannel before retrying, which is also robust
against behaviour changes if we decide *not* to disconnect in future.

Also make sure that addrhint ownership is correct, since this can
lead to a use-after-free if we filter dests.

```
tests/test_connection.py::test_multifunding_best_effort FAILED                                                    [100%]

======================================================= FAILURES ========================================================
_____________________________________________ test_multifunding_best_effort _____________________________________________

node_factory = <pyln.testing.utils.NodeFactory object at 0x7f6c0c95c1c0>
bitcoind = <pyln.testing.utils.BitcoinD object at 0x7f6c0c92a880>

    @pytest.mark.openchannel('v1')
    @pytest.mark.developer("disconnect=... needs DEVELOPER=1")
    def test_multifunding_best_effort(node_factory, bitcoind):
        '''
        Check that best_effort flag works.
        '''
        disconnects = ["-WIRE_INIT",
                       "-WIRE_ACCEPT_CHANNEL",
                       "-WIRE_FUNDING_SIGNED"]
        l1 = node_factory.get_node()
        l2 = node_factory.get_node()
        l3 = node_factory.get_node(disconnect=disconnects)
        l4 = node_factory.get_node()
    
        l1.fundwallet(2000000)
    
        destinations = [{"id": '{}@localhost:{}'.format(l2.info['id'], l2.port),
                         "amount": 50000},
                        {"id": '{}@localhost:{}'.format(l3.info['id'], l3.port),
                         "amount": 50000},
                        {"id": '{}@localhost:{}'.format(l4.info['id'], l4.port),
                         "amount": 50000}]
    
        for i, d in enumerate(disconnects):
            # Should succeed due to best-effort flag.
>           l1.rpc.multifundchannel(destinations, minchannels=2)

tests/test_connection.py:2070: 
...
>           raise RpcError(method, payload, resp['error'])
E           pyln.client.lightning.RpcError: RPC call failed: method: multifundchannel, payload: {'destinations': [{'id': '022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59@localhost:41023', 'amount': 50000}, {'id': '035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d@localhost:41977', 'amount': 50000}, {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199@localhost:34943', 'amount': 50000}], 'minchannels': 2}, error: {'code': 305, 'message': 'Peer not connected at start', 'data': {'id': '0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199', 'method': 'fundchannel_start'}}
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…erfering redirects.

In various places, we assumed that when `connected` is false,
everything is finished.  This is not true: we should wait for the
state we expect.

In addition, various places allows reconnections, which interfered
with the logic; suppress them.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…it on sending error/warning.

Connectd already does this when we *receive* an error or warning, but
now do it on send.  This causes some slight behavior change: we don't
disconnect when we close a channel, for example (our behaviour here
has been inconsistent across versions, depending on the code).

When connectd is told to disconnect, it now does so immediately, and
doesn't wait for subds to drain etc.  That simplifies the manual
disconnect case, which now cleans up as it would from any other
disconnection when connectd says it's disconnected.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The only places which should call try_reconnect now are the "connect"
command, and the disconnect path when it decides there's still an
active channel.

This introduces one subtlety: if we disconnect when there's no active
channel, but then the subd makes one, we have to catch that case!

This temporarily reverts "slow" reconnections to fast ones: see next
patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Just keep a flag on the peer, and delay connection longer if that is set.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Dualopend is not listening to the peer fd when it hangs up, so doesn't
notice it's gone.  We don't clean up the channel until it's done (usually
a good thing: it could be about to lock it in), but this harms us
here.

Fix the test failure and make a comment.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We close the channel because the payment fulfilment is not totally done, and
we generate 6 blocks, causing it to hit CLTV deadline.

```
        # send some payments, mine a block or two
        inv = l2.rpc.invoice(10**4, '1', 'no_1')
        l1.rpc.pay(inv['bolt11'])
    
        # l2 attempts to close a channel that it leased, should fail
        with pytest.raises(RpcError, match=r'Peer leased this channel from us'):
            l2.rpc.close(l1.get_channel_scid(l2))
    
        bitcoind.generate_block(6)
        sync_blockheight(bitcoind, [l1, l2])
        # make sure we're at the right place for the csv lock
>       l2.daemon.wait_for_log('Blockheight: SENT_ADD_ACK_COMMIT->RCVD_ADD_ACK_REVOCATION LOCAL now 115')

tests/test_closing.py:823: 
...
lightningd-2 2022-07-17T13:15:34.242Z DEBUG   lightningd: Adding block 115: 39d95061935e9fc42b04c86ae60d0cf157765aff4c040f3a8d0b7888db19e015
lightningd-2 2022-07-17T13:15:34.244Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#2: Peer permanent failure in CHANNELD_NORMAL: Fulfilled HTLC 0 SENT_REMOVE_COMMIT cltv 115 hit deadline
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Otherwise connectd complains we didn't close, eg
tests/test_connection.py::test_funding_cancel_race:

```

lightningd-1 2022-07-17T14:43:56.813Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Send error reason: Cancel channel by our RPC command before funding transaction broadcast.
lightningd-1 2022-07-17T14:43:56.867Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: peer_out WIRE_ERROR
lightningd-2 2022-07-17T14:43:56.926Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: peer_in WIRE_ERROR
lightningd-2 2022-07-17T14:43:56.951Z DEBUG   0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-channeld-chan#1: billboard perm: Received error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast.
lightningd-1 2022-07-17T14:43:56.952Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-channeld-chan#1: Status closed, but not exited. Killing
lightningd-2 2022-07-17T14:43:56.976Z UNUSUAL 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-chan#1: Peer permanent failure in CHANNELD_AWAITING_LOCKIN: channeld: received ERROR error channel d11396cbb8de10f02ee8d76ff6265bad0eefa7e43b4f540f14dfaab851aa3606: Cancel channel by our RPC command before funding transaction broadcast., forget channel
DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}}
DEBUG:root:Received response for fundchannel_cancel call: {'jsonrpc': '2.0', 'id': 1, 'result': {'cancelled': 'Channel open canceled by RPC(after fundchannel_complete)'}}
DEBUG:root:{
  "id": 1,
  "result": {
    "cancelled": "Channel open canceled by RPC(after fundchannel_complete)"
  }
}
DEBUG:root:{
  "id": 1,
  "result": {
    "cancelled": "Channel open canceled by RPC(after fundchannel_complete)"
  }
}
DEBUG:root:{
  "id": 1,
  "method": "txdiscard",
  "params": {
    "txid": "0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1"
  }
}
lightningd-1 2022-07-17T14:43:57.022Z DEBUG   connectd: drain_peer
DEBUG:root:Calling txdiscard with payload {'txid': '0736aa51b8aadf140f544f3be4a7ef0ead5b26f66fd7e82ef010deb8cb9613d1'}
lightningd-1 2022-07-17T14:43:57.024Z DEBUG   connectd: drain_peer draining subd!
lightningd-1 2022-07-17T14:43:57.069Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: peer_disconnect_done
lightningd-1 2022-07-17T14:43:57.082Z DEBUG   022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-lightningd: Not reconnecting: no active channel
...
lightningd-1 2022-07-17T14:44:01.877Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: Subd did not close, forcing close
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the guilt/connectd-cleanups branch from eb1d3c1 to 76b5fc8 Compare July 18, 2022 12:12
@niftynei
Copy link
Contributor

ACK 76b5fc8

@niftynei niftynei merged commit 0363c62 into ElementsProject:master Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants