lightningd: make shutdown safe again #5932

SimonVrouwe · 2023-01-29T13:29:01Z

Another draft to demonstrate the issues with the current shutdown sequence mentioned in #5859 (zero response so far)
This is work-in-progress, still contains debug code and will probably fail some tests.

But the tests added in 79e49b6 and 91ad510 demonstrate concrete examples.

TLDR: If any plugin subscribed shutdown, you create a potential backdoor during shutdown which allows peers to connect, restart channels, open new channels etc. at the time when plugins (+ their hooks) may already be gone and lightningd is halve dead.

Please tell me I am dreaming and this is not an issue.

This is more convenient when debugging lightningd in a pyln-testing (pytest) setup. Now you don't need to pause and do a separate `gdb -ex 'attach <lightningd_pid>'`

…utdown sequence test_connected_hook_shutdown test_openchannel_shutdown

Now try restart channeld while shutting down via a reconnect

…en shutdown plugins

See if it fixes these error in /tests: lightningd: FATAL SIGNAL 11 (version v22.11.1-147-g9a1fb18) 0x5639421e4c9c send_backtrace common/daemon.c:33 0x5639421e4d41 crashdump common/daemon.c:46 0x7fa866a68d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x5639422075f2 db_begin_transaction_ db/exec.c:124 0x563942152088 sendrawtx_callback lightningd/bitcoind.c:295 0x5639421ad24e plugin_response_handle lightningd/plugin.c:590 0x5639421ad499 plugin_read_json_one lightningd/plugin.c:701 0x5639421ad679 plugin_read_json lightningd/plugin.c:746 0x56394229cb53 next_plan ccan/ccan/io/io.c:59 0x56394229d6fd do_plan ccan/ccan/io/io.c:407 0x56394229d73b io_ready ccan/ccan/io/io.c:417 0x56394229f951 io_loop ccan/ccan/io/poll.c:453 0x5639421b1c3f shutdown_plugins lightningd/plugin.c:2168 0x56394217ee76 main lightningd/lightningd.c:1280 0x7fa866a53d09 __libc_start_main ../csu/libc-start.c:308 0x5639421513c9 ??? ???:0 0xffffffffffffffff ??? ???:0 Log dumped in crash.log.20230126125518

Yep, again. This prevents every JSON-RPC callback (incl. db_write hook). So we **should** be able to close the database earlier? <== Nope Hopefully fixes some tests that where still broken: ERROR tests/test_closing.py::test_closing_while_disconnected - ValueError: ERROR tests/test_db.py::test_block_backfill - ValueError: ERROR tests/test_plugin.py::test_channel_state_changed_bilateral - ValueError: in handling a bcli response that needed gossipd which was already dead: lightningd: FATAL SIGNAL 11 (version v22.11.1-148-g727b2e5) 0x55bbc8a65c69 send_backtrace common/daemon.c:33 0x55bbc8a65d0e crashdump common/daemon.c:46 0x7fb01e740d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x55bbc8a392c5 subd_send_msg lightningd/subd.c:839 0x55bbc89f087e gossipd_notify_spends lightningd/gossip_control.c:281 0x55bbc89d6a80 topo_update_spends lightningd/chaintopology.c:690 0x55bbc89d6d0f add_tip lightningd/chaintopology.c:729 0x55bbc89d7245 get_new_block lightningd/chaintopology.c:823 0x55bbc89d3431 getrawblockbyheight_callback lightningd/bitcoind.c:383 0x55bbc8a2e24e plugin_response_handle lightningd/plugin.c:590 0x55bbc8a2e499 plugin_read_json_one lightningd/plugin.c:701 0x55bbc8a2e679 plugin_read_json lightningd/plugin.c:746 0x55bbc8b1db13 next_plan ccan/ccan/io/io.c:59 0x55bbc8b1e6bd do_plan ccan/ccan/io/io.c:407 0x55bbc8b1e6fb io_ready ccan/ccan/io/io.c:417 0x55bbc8b20911 io_loop ccan/ccan/io/poll.c:453 0x55bbc8a32c0c shutdown_plugins lightningd/plugin.c:2168 0x55bbc89ffe44 main lightningd/lightningd.c:1271 0x7fb01e72bd09 __libc_start_main ../csu/libc-start.c:308 0x55bbc89d23c9 ??? ???:0 0xffffffffffffffff ??? ???:0

…er closing db in second shutdown_plugin call, trigger EOF in plugin stdin. This correctly handles corner cases in shutdown_plugin: - in a non-dev build, buildin plugins don't subscribe shutdown, but we still need a (brief) io_loop to notify interested db_write plugins - only an awaited-for plugin should call io_break - allways send EOF to db_write plugins in the final call (subscribed or not) This adds two helper functions: plugin_registered_db_write_hook jsonrpc_command_del and a new plugin state: SHUTDOWN inspired by ElementsProject#4790 and also credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785 Hopefully this fixes the "Nope" from prev commit, these: ERROR tests/test_invoices.py::test_amountless_invoice - ValueError: ERROR tests/test_misc.py::test_htlc_out_timeout - ValueError: ERROR tests/test_pay.py::test_sendpay_msatoshi_arg - ValueError: ERROR tests/test_pay.py::test_pay_peer - ValueError: ERROR tests/test_pay.py::test_listpays_with_filter_by_status - ValueError: ERROR tests/test_pay.py::test_channel_receivable - ValueError: where still triggering db_write's: lightningd: FATAL SIGNAL 11 (version v22.11.1-149-g93f1458-modded) 0x555e78c30d0f send_backtrace common/daemon.c:33 0x555e78c30db4 crashdump common/daemon.c:46 0x7fd20cc60d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x555e78c53665 db_begin_transaction_ db/exec.c:124 0x555e78bc7557 read_json lightningd/jsonrpc.c:1090 0x555e78ce8bc3 next_plan ccan/ccan/io/io.c:59 0x555e78ce976d do_plan ccan/ccan/io/io.c:407 0x555e78ce97ab io_ready ccan/ccan/io/io.c:417 0x555e78ceb9c1 io_loop ccan/ccan/io/poll.c:453 0x555e78bfdcb2 shutdown_plugins lightningd/plugin.c:2179 0x555e78bcae76 main lightningd/lightningd.c:1281 0x7fd20cc4bd09 __libc_start_main ../csu/libc-start.c:308 0x555e78b9d3c9 ??? ???:0 0xffffffffffffffff ??? ???:0 Log dumped in crash.log.20230126160704 Not from plugin_response_handle !!! What are they, seems every parse_request() in read_json() is wrapped in a db transaction: ``` if (!in_transaction) { db_begin_transaction(jcon->ld->wallet->db); in_transaction = true; } parse_request(jcon, jcon->input_toks); ``` But wasn't jsonrpc_stop_listening supposed to block new requests? <== Well sort-off, it only blocks new rpc connections But (suprise) the buildin plugins reuse their rpc connection and don't don't close it, so their request go through!

…k to pass Needed by db_write plugins that are kept alive. Note that strcmp() returns 0 if strings are equal.

…ode -5 Only jsonrpc_stop_listening() isn't enough to block new requests, because some (buildin) plugins allways keep their rpc connection open.

cdecker

Thanks @SimonVrouwe for all the efforts. This issue is definitely real, and especially for busier nodes this can mean the difference between a working backup and one that lags behind.

It is likely that even backups that'd block a restart because of the off-by-one issue introduced by late transactions, would still be recoverable (those last transactions should never be a new commitment iirc) but they are really tedious and disconcerting when the block a restart.

Would it make sense to have an early abort for plugin returns and RPC entrypoints to return a shutdown error instead of starting a new DB transaction? That way we could have just one global that determines if we want to re-enter the DB or not, and not have to do all the special handling (yes, that'd also mean we abort the last TX we just sent to the plugin, but that's ok, the backup is always allowed to be 1 ahead).

cdecker · 2023-01-30T13:01:44Z

lightningd/plugin.c

+	 * this state but that is hard to proof, so we keep db_write plugins alive
+	 * longer and still allow db_write hook responses. */
+	if (plugin->plugins->ld->state == LD_STATE_SHUTDOWN &&
+			       strcmp(request->method, "db_write") != 0) {


We have a tiny helper: streq

rustyrussell · 2023-02-03T03:45:18Z

Well, for a start, as soon as we're shutting down, we should tell connectd not to allow new connections, or send us connectd_peer_spoke messages.

That's minimal and catches the obvious problems?

Here's my attempt:

https://github.com/rustyrussell/lightning/tree/guilt/stop-new-connections-on-shutdown

rustyrussell · 2023-06-23T04:01:13Z

I think this was addressed by #5951

SimonVrouwe added 14 commits January 25, 2023 10:10

tests: cleanup test_plugin_shutdown

d5b0bd7

lightningd: add developer option --debugger to lightningd to attach gdb

b3babe3

This is more convenient when debugging lightningd in a pyln-testing (pytest) setup. Now you don't need to pause and do a separate `gdb -ex 'attach <lightningd_pid>'`

lightningd: add more detail to log line "Abandoning plugin hook..."

409e620

tests: add some xfail tests to demonstrate issues with the current sh…

79e49b6

…utdown sequence test_connected_hook_shutdown test_openchannel_shutdown

WIP: add another test test_rpc_command_hook_chain_shutdown

8313e6c

add test_invoice_payment_hook_chain_shutdown

465c2ff

Rework test_invoice_payment_hook_chain_shutdown, another xfail variant

91ad510

Now try restart channeld while shutting down via a reconnect

lightningd: shutdown sequence now: stop subdaemons, close database th…

5c8ee26

…en shutdown plugins

JSON RPC: In the shutdown loop, still allow responses of db_write hoo…

94d666b

…k to pass Needed by db_write plugins that are kept alive. Note that strcmp() returns 0 if strings are equal.

JSON-RPC: fail jsonrpc requests made in shutdown (again) with error c…

3253372

…ode -5 Only jsonrpc_stop_listening() isn't enough to block new requests, because some (buildin) plugins allways keep their rpc connection open.

DEBUGGING post-shutdown requests

ea9a9b0

cdecker self-requested a review January 30, 2023 13:03

cdecker reviewed Jan 30, 2023

View reviewed changes

rustyrussell mentioned this pull request Feb 3, 2023

Stop new connections on shutdown #5951

Merged

rustyrussell closed this Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightningd: make shutdown safe again #5932

lightningd: make shutdown safe again #5932

SimonVrouwe commented Jan 29, 2023

cdecker left a comment

cdecker Jan 30, 2023

rustyrussell commented Feb 3, 2023

rustyrussell commented Jun 23, 2023

lightningd: make shutdown safe again #5932

lightningd: make shutdown safe again #5932

Conversation

SimonVrouwe commented Jan 29, 2023

cdecker left a comment

Choose a reason for hiding this comment

cdecker Jan 30, 2023

Choose a reason for hiding this comment

rustyrussell commented Feb 3, 2023

rustyrussell commented Jun 23, 2023