-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightningd: make shutdown safe again #5932
lightningd: make shutdown safe again #5932
Conversation
This is more convenient when debugging lightningd in a pyln-testing (pytest) setup. Now you don't need to pause and do a separate `gdb -ex 'attach <lightningd_pid>'`
…utdown sequence test_connected_hook_shutdown test_openchannel_shutdown
Now try restart channeld while shutting down via a reconnect
…en shutdown plugins
See if it fixes these error in /tests: lightningd: FATAL SIGNAL 11 (version v22.11.1-147-g9a1fb18) 0x5639421e4c9c send_backtrace common/daemon.c:33 0x5639421e4d41 crashdump common/daemon.c:46 0x7fa866a68d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x5639422075f2 db_begin_transaction_ db/exec.c:124 0x563942152088 sendrawtx_callback lightningd/bitcoind.c:295 0x5639421ad24e plugin_response_handle lightningd/plugin.c:590 0x5639421ad499 plugin_read_json_one lightningd/plugin.c:701 0x5639421ad679 plugin_read_json lightningd/plugin.c:746 0x56394229cb53 next_plan ccan/ccan/io/io.c:59 0x56394229d6fd do_plan ccan/ccan/io/io.c:407 0x56394229d73b io_ready ccan/ccan/io/io.c:417 0x56394229f951 io_loop ccan/ccan/io/poll.c:453 0x5639421b1c3f shutdown_plugins lightningd/plugin.c:2168 0x56394217ee76 main lightningd/lightningd.c:1280 0x7fa866a53d09 __libc_start_main ../csu/libc-start.c:308 0x5639421513c9 ??? ???:0 0xffffffffffffffff ??? ???:0 Log dumped in crash.log.20230126125518
Yep, again. This prevents every JSON-RPC callback (incl. db_write hook). So we **should** be able to close the database earlier? <== Nope Hopefully fixes some tests that where still broken: ERROR tests/test_closing.py::test_closing_while_disconnected - ValueError: ERROR tests/test_db.py::test_block_backfill - ValueError: ERROR tests/test_plugin.py::test_channel_state_changed_bilateral - ValueError: in handling a bcli response that needed gossipd which was already dead: lightningd: FATAL SIGNAL 11 (version v22.11.1-148-g727b2e5) 0x55bbc8a65c69 send_backtrace common/daemon.c:33 0x55bbc8a65d0e crashdump common/daemon.c:46 0x7fb01e740d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x55bbc8a392c5 subd_send_msg lightningd/subd.c:839 0x55bbc89f087e gossipd_notify_spends lightningd/gossip_control.c:281 0x55bbc89d6a80 topo_update_spends lightningd/chaintopology.c:690 0x55bbc89d6d0f add_tip lightningd/chaintopology.c:729 0x55bbc89d7245 get_new_block lightningd/chaintopology.c:823 0x55bbc89d3431 getrawblockbyheight_callback lightningd/bitcoind.c:383 0x55bbc8a2e24e plugin_response_handle lightningd/plugin.c:590 0x55bbc8a2e499 plugin_read_json_one lightningd/plugin.c:701 0x55bbc8a2e679 plugin_read_json lightningd/plugin.c:746 0x55bbc8b1db13 next_plan ccan/ccan/io/io.c:59 0x55bbc8b1e6bd do_plan ccan/ccan/io/io.c:407 0x55bbc8b1e6fb io_ready ccan/ccan/io/io.c:417 0x55bbc8b20911 io_loop ccan/ccan/io/poll.c:453 0x55bbc8a32c0c shutdown_plugins lightningd/plugin.c:2168 0x55bbc89ffe44 main lightningd/lightningd.c:1271 0x7fb01e72bd09 __libc_start_main ../csu/libc-start.c:308 0x55bbc89d23c9 ??? ???:0 0xffffffffffffffff ??? ???:0
…er closing db in second shutdown_plugin call, trigger EOF in plugin stdin. This correctly handles corner cases in shutdown_plugin: - in a non-dev build, buildin plugins don't subscribe shutdown, but we still need a (brief) io_loop to notify interested db_write plugins - only an awaited-for plugin should call io_break - allways send EOF to db_write plugins in the final call (subscribed or not) This adds two helper functions: plugin_registered_db_write_hook jsonrpc_command_del and a new plugin state: SHUTDOWN inspired by ElementsProject#4790 and also credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785 Hopefully this fixes the "Nope" from prev commit, these: ERROR tests/test_invoices.py::test_amountless_invoice - ValueError: ERROR tests/test_misc.py::test_htlc_out_timeout - ValueError: ERROR tests/test_pay.py::test_sendpay_msatoshi_arg - ValueError: ERROR tests/test_pay.py::test_pay_peer - ValueError: ERROR tests/test_pay.py::test_listpays_with_filter_by_status - ValueError: ERROR tests/test_pay.py::test_channel_receivable - ValueError: where still triggering db_write's: lightningd: FATAL SIGNAL 11 (version v22.11.1-149-g93f1458-modded) 0x555e78c30d0f send_backtrace common/daemon.c:33 0x555e78c30db4 crashdump common/daemon.c:46 0x7fd20cc60d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x555e78c53665 db_begin_transaction_ db/exec.c:124 0x555e78bc7557 read_json lightningd/jsonrpc.c:1090 0x555e78ce8bc3 next_plan ccan/ccan/io/io.c:59 0x555e78ce976d do_plan ccan/ccan/io/io.c:407 0x555e78ce97ab io_ready ccan/ccan/io/io.c:417 0x555e78ceb9c1 io_loop ccan/ccan/io/poll.c:453 0x555e78bfdcb2 shutdown_plugins lightningd/plugin.c:2179 0x555e78bcae76 main lightningd/lightningd.c:1281 0x7fd20cc4bd09 __libc_start_main ../csu/libc-start.c:308 0x555e78b9d3c9 ??? ???:0 0xffffffffffffffff ??? ???:0 Log dumped in crash.log.20230126160704 Not from plugin_response_handle !!! What are they, seems every parse_request() in read_json() is wrapped in a db transaction: ``` if (!in_transaction) { db_begin_transaction(jcon->ld->wallet->db); in_transaction = true; } parse_request(jcon, jcon->input_toks); ``` But wasn't jsonrpc_stop_listening supposed to block new requests? <== Well sort-off, it only blocks new rpc connections But (suprise) the buildin plugins reuse their rpc connection and don't don't close it, so their request go through!
…k to pass Needed by db_write plugins that are kept alive. Note that strcmp() returns 0 if strings are equal.
…ode -5 Only jsonrpc_stop_listening() isn't enough to block new requests, because some (buildin) plugins allways keep their rpc connection open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SimonVrouwe for all the efforts. This issue is definitely real, and especially for busier nodes this can mean the difference between a working backup and one that lags behind.
It is likely that even backups that'd block a restart because of the off-by-one issue introduced by late transactions, would still be recoverable (those last transactions should never be a new commitment iirc) but they are really tedious and disconcerting when the block a restart.
Would it make sense to have an early abort for plugin returns and RPC entrypoints to return a shutdown error instead of starting a new DB transaction? That way we could have just one global that determines if we want to re-enter the DB or not, and not have to do all the special handling (yes, that'd also mean we abort the last TX we just sent to the plugin, but that's ok, the backup is always allowed to be 1 ahead).
* this state but that is hard to proof, so we keep db_write plugins alive | ||
* longer and still allow db_write hook responses. */ | ||
if (plugin->plugins->ld->state == LD_STATE_SHUTDOWN && | ||
strcmp(request->method, "db_write") != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a tiny helper: streq
Well, for a start, as soon as we're shutting down, we should tell connectd not to allow new connections, or send us connectd_peer_spoke messages. That's minimal and catches the obvious problems? Here's my attempt: https://github.com/rustyrussell/lightning/tree/guilt/stop-new-connections-on-shutdown |
I think this was addressed by #5951 |
Another draft to demonstrate the issues with the current shutdown sequence mentioned in #5859 (zero response so far)
This is work-in-progress, still contains debug code and will probably fail some tests.
But the tests added in 79e49b6 and 91ad510 demonstrate concrete examples.
TLDR: If any plugin subscribed shutdown, you create a potential backdoor during shutdown which allows peers to connect, restart channels, open new channels etc. at the time when plugins (+ their hooks) may already be gone and
lightningd
is halve dead.Please tell me I am dreaming and this is not an issue.