Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: make shutdown safe again #5932

Conversation

SimonVrouwe
Copy link
Collaborator

Another draft to demonstrate the issues with the current shutdown sequence mentioned in #5859 (zero response so far)
This is work-in-progress, still contains debug code and will probably fail some tests.

But the tests added in 79e49b6 and 91ad510 demonstrate concrete examples.

TLDR: If any plugin subscribed shutdown, you create a potential backdoor during shutdown which allows peers to connect, restart channels, open new channels etc. at the time when plugins (+ their hooks) may already be gone and lightningd is halve dead.

Please tell me I am dreaming and this is not an issue.

This is more convenient when debugging lightningd in a pyln-testing (pytest) setup.
Now you don't need to pause and do a separate `gdb -ex 'attach <lightningd_pid>'`
…utdown sequence

 test_connected_hook_shutdown
 test_openchannel_shutdown
Now try restart channeld while shutting down via a reconnect
See if it fixes these error in /tests:

lightningd: FATAL SIGNAL 11 (version v22.11.1-147-g9a1fb18)
0x5639421e4c9c send_backtrace
        common/daemon.c:33
0x5639421e4d41 crashdump
        common/daemon.c:46
0x7fa866a68d5f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x5639422075f2 db_begin_transaction_
        db/exec.c:124
0x563942152088 sendrawtx_callback
        lightningd/bitcoind.c:295
0x5639421ad24e plugin_response_handle
        lightningd/plugin.c:590
0x5639421ad499 plugin_read_json_one
        lightningd/plugin.c:701
0x5639421ad679 plugin_read_json
        lightningd/plugin.c:746
0x56394229cb53 next_plan
        ccan/ccan/io/io.c:59
0x56394229d6fd do_plan
        ccan/ccan/io/io.c:407
0x56394229d73b io_ready
        ccan/ccan/io/io.c:417
0x56394229f951 io_loop
        ccan/ccan/io/poll.c:453
0x5639421b1c3f shutdown_plugins
        lightningd/plugin.c:2168
0x56394217ee76 main
        lightningd/lightningd.c:1280
0x7fa866a53d09 __libc_start_main
        ../csu/libc-start.c:308
0x5639421513c9 ???
        ???:0
0xffffffffffffffff ???
        ???:0
Log dumped in crash.log.20230126125518
Yep, again. This prevents every JSON-RPC callback (incl. db_write hook).
So we **should** be able to close the database earlier? <== Nope

Hopefully fixes some tests that where still broken:
ERROR tests/test_closing.py::test_closing_while_disconnected - ValueError:
ERROR tests/test_db.py::test_block_backfill - ValueError:
ERROR tests/test_plugin.py::test_channel_state_changed_bilateral - ValueError:

in handling a bcli response that needed gossipd which was already dead:

lightningd: FATAL SIGNAL 11 (version v22.11.1-148-g727b2e5)
0x55bbc8a65c69 send_backtrace
        common/daemon.c:33
0x55bbc8a65d0e crashdump
        common/daemon.c:46
0x7fb01e740d5f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x55bbc8a392c5 subd_send_msg
        lightningd/subd.c:839
0x55bbc89f087e gossipd_notify_spends
        lightningd/gossip_control.c:281
0x55bbc89d6a80 topo_update_spends
        lightningd/chaintopology.c:690
0x55bbc89d6d0f add_tip
        lightningd/chaintopology.c:729
0x55bbc89d7245 get_new_block
        lightningd/chaintopology.c:823
0x55bbc89d3431 getrawblockbyheight_callback
        lightningd/bitcoind.c:383
0x55bbc8a2e24e plugin_response_handle
        lightningd/plugin.c:590
0x55bbc8a2e499 plugin_read_json_one
        lightningd/plugin.c:701
0x55bbc8a2e679 plugin_read_json
        lightningd/plugin.c:746
0x55bbc8b1db13 next_plan
        ccan/ccan/io/io.c:59
0x55bbc8b1e6bd do_plan
        ccan/ccan/io/io.c:407
0x55bbc8b1e6fb io_ready
        ccan/ccan/io/io.c:417
0x55bbc8b20911 io_loop
        ccan/ccan/io/poll.c:453
0x55bbc8a32c0c shutdown_plugins
        lightningd/plugin.c:2168
0x55bbc89ffe44 main
        lightningd/lightningd.c:1271
0x7fb01e72bd09 __libc_start_main
        ../csu/libc-start.c:308
0x55bbc89d23c9 ???
        ???:0
0xffffffffffffffff ???
        ???:0
…er closing db

in second shutdown_plugin call, trigger EOF in plugin stdin.

This correctly handles corner cases in shutdown_plugin:
- in a non-dev build, buildin plugins don't subscribe shutdown,
  but we still need a (brief) io_loop to notify interested db_write
  plugins
- only an awaited-for plugin should call io_break
- allways send EOF to db_write plugins in the final call (subscribed or not)

This adds two helper functions:
 plugin_registered_db_write_hook
 jsonrpc_command_del

and a new plugin state: SHUTDOWN

inspired by ElementsProject#4790 and also
credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785

Hopefully this fixes the "Nope" from prev commit, these:
ERROR tests/test_invoices.py::test_amountless_invoice - ValueError:
ERROR tests/test_misc.py::test_htlc_out_timeout - ValueError:
ERROR tests/test_pay.py::test_sendpay_msatoshi_arg - ValueError:
ERROR tests/test_pay.py::test_pay_peer - ValueError:
ERROR tests/test_pay.py::test_listpays_with_filter_by_status - ValueError:
ERROR tests/test_pay.py::test_channel_receivable - ValueError:

where still triggering db_write's:
lightningd: FATAL SIGNAL 11 (version v22.11.1-149-g93f1458-modded)
0x555e78c30d0f send_backtrace
        common/daemon.c:33
0x555e78c30db4 crashdump
        common/daemon.c:46
0x7fd20cc60d5f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x555e78c53665 db_begin_transaction_
        db/exec.c:124
0x555e78bc7557 read_json
        lightningd/jsonrpc.c:1090
0x555e78ce8bc3 next_plan
        ccan/ccan/io/io.c:59
0x555e78ce976d do_plan
        ccan/ccan/io/io.c:407
0x555e78ce97ab io_ready
        ccan/ccan/io/io.c:417
0x555e78ceb9c1 io_loop
        ccan/ccan/io/poll.c:453
0x555e78bfdcb2 shutdown_plugins
        lightningd/plugin.c:2179
0x555e78bcae76 main
        lightningd/lightningd.c:1281
0x7fd20cc4bd09 __libc_start_main
        ../csu/libc-start.c:308
0x555e78b9d3c9 ???
        ???:0
0xffffffffffffffff ???
        ???:0
Log dumped in crash.log.20230126160704

Not from plugin_response_handle !!!
What are they, seems every parse_request() in read_json() is wrapped in a db transaction:
```
	if (!in_transaction) {
		db_begin_transaction(jcon->ld->wallet->db);
		in_transaction = true;
	}
	parse_request(jcon, jcon->input_toks);
```

But wasn't jsonrpc_stop_listening supposed to block new requests? <== Well sort-off, it only blocks new rpc connections
But (suprise) the buildin plugins reuse their rpc connection and don't don't close it, so their request go through!
…k to pass

Needed by db_write plugins that are kept alive.
Note that strcmp() returns 0 if strings are equal.
…ode -5

Only jsonrpc_stop_listening() isn't enough to block new requests, because
some (buildin) plugins allways keep their rpc connection open.
@cdecker cdecker self-requested a review January 30, 2023 13:03
Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SimonVrouwe for all the efforts. This issue is definitely real, and especially for busier nodes this can mean the difference between a working backup and one that lags behind.

It is likely that even backups that'd block a restart because of the off-by-one issue introduced by late transactions, would still be recoverable (those last transactions should never be a new commitment iirc) but they are really tedious and disconcerting when the block a restart.

Would it make sense to have an early abort for plugin returns and RPC entrypoints to return a shutdown error instead of starting a new DB transaction? That way we could have just one global that determines if we want to re-enter the DB or not, and not have to do all the special handling (yes, that'd also mean we abort the last TX we just sent to the plugin, but that's ok, the backup is always allowed to be 1 ahead).

* this state but that is hard to proof, so we keep db_write plugins alive
* longer and still allow db_write hook responses. */
if (plugin->plugins->ld->state == LD_STATE_SHUTDOWN &&
strcmp(request->method, "db_write") != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a tiny helper: streq

@rustyrussell
Copy link
Contributor

Well, for a start, as soon as we're shutting down, we should tell connectd not to allow new connections, or send us connectd_peer_spoke messages.

That's minimal and catches the obvious problems?

Here's my attempt:

https://github.com/rustyrussell/lightning/tree/guilt/stop-new-connections-on-shutdown

@rustyrussell
Copy link
Contributor

I think this was addressed by #5951

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants