Security: Stop RocksDB threads calling unexpected code when `zebrad` exits #3133

teor2345 · 2021-12-02T21:25:17Z

Description

When I terminate zebrad start using Ctrl-C (SIGINT) on Linux, I get pthread errors.

I can reliably generate these errors during startup and the initial peer connections. But they seem much less frequent after initialization.

This is a serious security risk, because calling pure virtual methods is unexpected code execution. So compilers could optimise it into something dangerous, or something that could be exploited. (That doesn't happen in this case - instead, it prints a warning. But there are no guarantees.)

Initial Analysis

We recently made a few changes to Zebra's startup and shutdown behaviour (#3057, #3071, #3078, #3091).

Some of these errors seem to come from C++ code ("pure virtual method called"), so it's probably a combination of:

rocksdb, and
the quick shutdown change from Re-order Zebra startup, so slow services are launched last #3091: https://github.com/ZcashFoundation/zebra/blame/68d7198e9ff269c6329c03ed9fec187660167780/zebrad/src/components/tokio.rs#L62-L63

PR #3091 also stopped zebrad deleting ephemeral state directories. We should fix that issue as part of this bug fix.

Suggested Design

Precautions:

upgrade to the latest version of RocksDB - Use the latest bindgen version in zcash_script and rocksdb #2981

Shut down in this order:

wait for RocksDB to finish, by calling CancelAllBackgroundWork() when Zebra wants to exit: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ ,
drop the database: https://github.com/facebook/rocksdb/wiki/Known-Issues ,
delete the database files on disk, if it is ephemeral,
exit Zebra

We can do this via a shutdown method in the state service and finalized state, which is called:

when the state is dropped,
when the state reaches the debug_stop_at_height, or
using a Shutdown request to the state, in the start and copy-state (both state services) commands.

After we've waited for shutdown to finish, it is safe to exit Zebra.

Other state requests could be processed while we are exiting Zebra:

return errors for any requests and readiness checks after shutdown is called

Rejected Alternatives

Rust's sys::at_exit handler has been removed:
rust-lang/rust#84115

There's no guarantee that libc::atexit runs before RocksDB's static destructors, just like there's no guarantee of the exit order between Rust and C:
https://stackoverflow.com/questions/35980148/why-does-an-atexit-handler-panic-when-it-accesses-stdout

Steps to Reproduce

I tried this:

Run zebrad start
Press Ctrl-C (SIGINT)

zebrad start &
kill -SIGINT $!

Expected Behaviour

I expected to see this happen:

zebrad terminates quickly, without errors.

Actual Behaviour

Instead, this happened:

zebrad terminates quickly, with errors like:

pthread lock: Invalid argument                                                                                                                                                                 
Aborted (core dumped)

pure virtual method called
terminate called without an active exception
Aborted (core dumped)

pthread destroy mutex: Device or resource busy
Aborted (core dumped)

Zebra Logs

$ /home/dev/.cargo/bin/zebrad start                                                                                                                                         
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: Starting zebrad                                                                                              
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: config=ZebradConfig { consensus: Config { checkpoint_sync: false }, metrics: MetricsSection { endpoint_addr: 
None }, network: Config { listen_addr: 0.0.0.0:8233, network: Mainnet, initial_mainnet_peers: {"dnsseed.z.cash:8233", "dnsseed.str4d.xyz:8233", "mainnet.seeder.zfnd.org:8233", "mainnet.is.yol
o.money:8233"}, initial_testnet_peers: {"testnet.seeder.zfnd.org:18233", "testnet.is.yolo.money:18233", "dnsseed.testnet.z.cash:18233"}, peerset_initial_target_size: 50, crawl_new_peer_interv
al: 60s }, state: Config { cache_dir: "/home/dev/.cache/zebra", ephemeral: false, debug_stop_at_height: None }, tracing: TracingSection { use_color: true, filter: None, endpoint_addr: None, f
lamegraph: None, use_journald: false }, sync: SyncSection { max_concurrent_block_requests: 50, lookahead_limit: 2000 }, mempool: Config { debug_enable_at_height: None, tx_cost_limit: 80000000
, eviction_memory_time: 3600s } }                                                                                                                                                              
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing node state                                                                                      
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebra_state::config: the open file limit is at or above the specified limit new_limit=1024 current_limit=1024 hard_rlimit=Some(524288)
Dec 03 07:04:11.144  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: Opened Zebra state cache at /home/dev/.cache/zebra/state/v11/mainnet                           
Dec 03 07:04:11.144  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: reading cached tip height                                                                      
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: loaded Zebra state cache tip=None                                                              
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service: starting legacy chain check                                                                                     
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service: no legacy chain found                                                                                           
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing network                                                                                         
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Trying to open Zcash protocol endpoint at 0.0.0.0:8233...       
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Opened Zcash protocol endpoint at 0.0.0.0:8233        
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="dnsseed.z.cash:8233" remote_ip_count=25
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.seeder.zfnd.org:8233" remote_ip_count=25
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.is.yolo.money:8233" remote_ip_count=50
^CDec 03 07:04:12.083  INFO {zebrad="022808d" net="Main"}:sig{kind=SignalKind(2) name="SIGINT"}: zebrad::signal: received SIGINT, starting shutdown
pthread destroy mutex: Device or resource busy
Aborted (core dumped)

Environment

Zebra Version

$ zebrad version
zebrad 1.0.0-beta.2

Operating System

$ uname -a
Linux ... 5.4.122 #1-NixOS SMP Wed May 26 10:05:21 UTC 2021 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

mpguerra · 2021-12-15T13:37:11Z

Hey team! Please add your planning poker estimate with ZenHub @conradoplg @dconnolly @jvff @oxarbitrage @teor2345 @upbqdn

oxarbitrage · 2022-01-22T15:19:17Z

It seems a lot of things are going on here and i spent quite some good amount of time trying several alternatives.

Warnings as pthread lock: Invalid argument, pure virtual method called can be fully eliminated if we allow the runtime to shutdown. We can do this by just removing this lines of code: https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L62-L63
Using the runtime will call the destructors that will allow us to add self.db.cancel_all_background_work(true|false) and a database drop into https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebra-state/src/service/finalized_state.rs#L822. This should result in a maybe faster and cleaner shutdown of the database.

The downside of this are mainly:

Zebra can still display zebra specific errors like buffer's worker closed unexpectedly before shutting down but this are not crashes (we could ignore these or workaround them, they are more manageable as they are in our side).
The runtime shutdown will not work properly (gets stuck) if Zebra is downloading the parameters.
The runtime shutdown can take a few seconds to shutdown depending on what zebra is doing.

To avoid the last 2 points the non runtime shutdown( https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L62-L63) was introduced.

However as this introduces the c++ errors the suggested design was to keep the non runtime shutdown but close the database properly from the application hoping to get rid of them. There were a few options proposed by this, i personally created a service request into the state that will close the database and i was able to call it at shutdown.

The result was a clean database cleanup without using the runtime destructor but unfortunatly the C++ issues at shutdown do not disapear. So there is something else that causes them, it could be still the rocksdb database but i am not so sure.

Another approach to try could be to lunch all the tasks in a thread pool (https://doc.rust-lang.org/book/ch20-03-graceful-shutdown-and-cleanup.html) and try to `abort()`` them at shutdown, this is while we don't use the runtime shutdown. The result of doing that is uncertain so i am not pushing forward for it.

In my opinion here is what we can do:

Use the runtime shutdown that will call destructors.
Add proper database close to FinalizedState destructor.
Download parameters:
- Do not do it from zebrad start but from zebrad download. zcashd do it in a separated command (https://github.com/zcash/zcash/blob/master/zcutil/fetch-params.sh). I understand this may need changes to some CI.
- Download parameters in zebra start before anything else and not in parallel. By not having other tasks around while downloading could make the shutdown more manageable.

oxarbitrage · 2022-01-22T16:15:33Z

There is another case happening here that seems to be independent on if we are shutting down zebra using the runtime or not. This is a hang, when ctrl-c is received zebra will not do anything else but will not exit either:

...
Jan 22 12:54:47.949  INFO sync:extend_tips:push_extend_tips_length{sync_length=498}: zebrad::components::sync::recent_sync_lengths: sending recent sync lengths recent_lengths=[498, 498, 498, 498]
Jan 22 12:54:51.033  INFO sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=400 current_range=(Excluded(Height(124873)), Included(Height(125273)))
^C^C
^C^C^C^C^C^C^C^C^C^C
... time passes but no output until i send SIGKILL from outside ...
Killed
$

teor2345 · 2022-01-23T22:49:30Z

Download parameters:

Do not do it from zebrad start but from zebrad download. zcashd do it in a separated command (https://github.com/zcash/zcash/blob/master/zcutil/fetch-params.sh). I understand this may need changes to some CI.

Download parameters in zebra start before anything else and not in parallel. By not having other tasks around while downloading could make the shutdown more manageable.

I'm not sure if this change is needed.

Have you tried:
https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.shutdown_timeout

If you call it here, the runtime should stop waiting for all tasks to finish:
https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L59

~~We might also need to fix #3378.~~ Done!

These issues aren't scheduled until the lightwalletd work:

The runtime shutdown can take a few seconds to shutdown

This is a hang, when ctrl-c is received zebra will not do anything else but will not exit either:

They are tracked in #1351.

oxarbitrage · 2022-01-24T20:07:46Z

I had been trying today with shutdown_timeout(). It was promising but unfortunately it will not fix the issue (pthread lock: Invalid argument and similar will keep happening).

teor2345 · 2022-01-24T22:52:45Z

I had been trying today with shutdown_timeout(). It was promising but unfortunately it will not fix the issue (pthread lock: Invalid argument and similar will keep happening).

That's really strange, because this is the recommended method, and other projects use it successfully.

Did you also delete the calls to exit()?
Can you post a draft PR, in case there is some other bug causing the issue?

oxarbitrage · 2022-01-24T23:05:08Z

#3392

I didn't deleted the exits, i will try.

oxarbitrage · 2022-01-24T23:34:37Z

On top the above PR i removed the full shutdown implementation in our side that is doing the exits: https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L46-L71

It didnt worked, i am still seeing:

...
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

occasionally.

teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage P-Medium I-crash Zebra crashes (without a panic) I-usability Zebra is hard to understand or use labels Dec 2, 2021

teor2345 changed the title ~~Zebra aborts with a pthread error if shut down during initialization~~ Wait for RocksDB to finish before exiting Zebra Dec 3, 2021

mpguerra removed the S-needs-triage Status: A bug report needs triage label Dec 17, 2021

teor2345 added C-security Category: Security issues I-bad-code and removed C-bug Category: This is a bug P-Medium labels Dec 19, 2021

teor2345 mentioned this issue Dec 19, 2021

Add a copy-state zebrad command, which copies blocks between two state services #3175

Merged

1 task

teor2345 changed the title ~~Wait for RocksDB to finish before exiting Zebra~~ Security: Stop RocksDB threads calling unexpected code when zebrad exits Dec 20, 2021

mpguerra assigned oxarbitrage Jan 17, 2022

oxarbitrage mentioned this issue Jan 22, 2022

Document zebrad download command #3383

Closed

teor2345 mentioned this issue Jan 23, 2022

Always shut down the tokio runtime before exiting zebra #3384

Closed

teor2345 mentioned this issue Jan 25, 2022

Security: Stop RocksDB or tokio calling unexpected code when zebrad exits #3392

Merged

teor2345 self-assigned this Jan 25, 2022

teor2345 mentioned this issue Jan 25, 2022

Security: Wait for RocksDB tasks to exit on shutdown oxarbitrage/zebra#164

Merged

3 tasks

mergify bot closed this as completed in #3392 Jan 26, 2022

mpguerra mentioned this issue Jan 27, 2022

Tracking: zebrad Usability #2310

Closed

49 tasks

teor2345 mentioned this issue Oct 19, 2022

feat(rpc): add getblockhash rpc method #4967

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security: Stop RocksDB threads calling unexpected code when `zebrad` exits #3133

Security: Stop RocksDB threads calling unexpected code when `zebrad` exits #3133

teor2345 commented Dec 2, 2021 •

edited

Loading

mpguerra commented Dec 15, 2021

oxarbitrage commented Jan 22, 2022

oxarbitrage commented Jan 22, 2022

teor2345 commented Jan 23, 2022 •

edited

Loading

oxarbitrage commented Jan 24, 2022

teor2345 commented Jan 24, 2022

oxarbitrage commented Jan 24, 2022

oxarbitrage commented Jan 24, 2022

Security: Stop RocksDB threads calling unexpected code when zebrad exits #3133

Security: Stop RocksDB threads calling unexpected code when zebrad exits #3133

Comments

teor2345 commented Dec 2, 2021 • edited Loading

Description

Initial Analysis

Suggested Design

Rejected Alternatives

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Zebra Logs

Environment

Zebra Version

Operating System

mpguerra commented Dec 15, 2021

oxarbitrage commented Jan 22, 2022

oxarbitrage commented Jan 22, 2022

teor2345 commented Jan 23, 2022 • edited Loading

oxarbitrage commented Jan 24, 2022

teor2345 commented Jan 24, 2022

oxarbitrage commented Jan 24, 2022

oxarbitrage commented Jan 24, 2022

Security: Stop RocksDB threads calling unexpected code when `zebrad` exits #3133

Security: Stop RocksDB threads calling unexpected code when `zebrad` exits #3133

teor2345 commented Dec 2, 2021 •

edited

Loading

teor2345 commented Jan 23, 2022 •

edited

Loading