Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: Stop RocksDB threads calling unexpected code when zebrad exits #3133

Closed
Tracked by #2310
teor2345 opened this issue Dec 2, 2021 · 8 comments · Fixed by #3392
Closed
Tracked by #2310

Security: Stop RocksDB threads calling unexpected code when zebrad exits #3133

teor2345 opened this issue Dec 2, 2021 · 8 comments · Fixed by #3392
Assignees
Labels
C-security Category: Security issues I-crash Zebra crashes (without a panic) I-usability Zebra is hard to understand or use

Comments

@teor2345
Copy link
Contributor

teor2345 commented Dec 2, 2021

Description

When I terminate zebrad start using Ctrl-C (SIGINT) on Linux, I get pthread errors.

I can reliably generate these errors during startup and the initial peer connections. But they seem much less frequent after initialization.

This is a serious security risk, because calling pure virtual methods is unexpected code execution. So compilers could optimise it into something dangerous, or something that could be exploited. (That doesn't happen in this case - instead, it prints a warning. But there are no guarantees.)

Initial Analysis

We recently made a few changes to Zebra's startup and shutdown behaviour (#3057, #3071, #3078, #3091).

Some of these errors seem to come from C++ code ("pure virtual method called"), so it's probably a combination of:

PR #3091 also stopped zebrad deleting ephemeral state directories. We should fix that issue as part of this bug fix.

Suggested Design

Precautions:

Shut down in this order:

  1. wait for RocksDB to finish, by calling CancelAllBackgroundWork() when Zebra wants to exit: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ ,
  2. drop the database: https://github.com/facebook/rocksdb/wiki/Known-Issues ,
  3. delete the database files on disk, if it is ephemeral,
  4. exit Zebra

We can do this via a shutdown method in the state service and finalized state, which is called:

  • when the state is dropped,
  • when the state reaches the debug_stop_at_height, or
  • using a Shutdown request to the state, in the start and copy-state (both state services) commands.

After we've waited for shutdown to finish, it is safe to exit Zebra.

Other state requests could be processed while we are exiting Zebra:

  • return errors for any requests and readiness checks after shutdown is called

Rejected Alternatives

Rust's sys::at_exit handler has been removed:
rust-lang/rust#84115

There's no guarantee that libc::atexit runs before RocksDB's static destructors, just like there's no guarantee of the exit order between Rust and C:
https://stackoverflow.com/questions/35980148/why-does-an-atexit-handler-panic-when-it-accesses-stdout

Steps to Reproduce

I tried this:

  1. Run zebrad start
  2. Press Ctrl-C (SIGINT)
zebrad start &
kill -SIGINT $!

Expected Behaviour

I expected to see this happen:

zebrad terminates quickly, without errors.

Actual Behaviour

Instead, this happened:

zebrad terminates quickly, with errors like:

pthread lock: Invalid argument                                                                                                                                                                 
Aborted (core dumped)  
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
pthread destroy mutex: Device or resource busy
Aborted (core dumped)

Zebra Logs

$ /home/dev/.cargo/bin/zebrad start                                                                                                                                         
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: Starting zebrad                                                                                              
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: config=ZebradConfig { consensus: Config { checkpoint_sync: false }, metrics: MetricsSection { endpoint_addr: 
None }, network: Config { listen_addr: 0.0.0.0:8233, network: Mainnet, initial_mainnet_peers: {"dnsseed.z.cash:8233", "dnsseed.str4d.xyz:8233", "mainnet.seeder.zfnd.org:8233", "mainnet.is.yol
o.money:8233"}, initial_testnet_peers: {"testnet.seeder.zfnd.org:18233", "testnet.is.yolo.money:18233", "dnsseed.testnet.z.cash:18233"}, peerset_initial_target_size: 50, crawl_new_peer_interv
al: 60s }, state: Config { cache_dir: "/home/dev/.cache/zebra", ephemeral: false, debug_stop_at_height: None }, tracing: TracingSection { use_color: true, filter: None, endpoint_addr: None, f
lamegraph: None, use_journald: false }, sync: SyncSection { max_concurrent_block_requests: 50, lookahead_limit: 2000 }, mempool: Config { debug_enable_at_height: None, tx_cost_limit: 80000000
, eviction_memory_time: 3600s } }                                                                                                                                                              
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing node state                                                                                      
Dec 03 07:04:11.132  INFO {zebrad="022808d" net="Main"}: zebra_state::config: the open file limit is at or above the specified limit new_limit=1024 current_limit=1024 hard_rlimit=Some(524288)
Dec 03 07:04:11.144  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: Opened Zebra state cache at /home/dev/.cache/zebra/state/v11/mainnet                           
Dec 03 07:04:11.144  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: reading cached tip height                                                                      
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: loaded Zebra state cache tip=None                                                              
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service: starting legacy chain check                                                                                     
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebra_state::service: no legacy chain found                                                                                           
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing network                                                                                         
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Trying to open Zcash protocol endpoint at 0.0.0.0:8233...       
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Opened Zcash protocol endpoint at 0.0.0.0:8233        
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="dnsseed.z.cash:8233" remote_ip_count=25
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.seeder.zfnd.org:8233" remote_ip_count=25
Dec 03 07:04:11.145  INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.is.yolo.money:8233" remote_ip_count=50
^CDec 03 07:04:12.083  INFO {zebrad="022808d" net="Main"}:sig{kind=SignalKind(2) name="SIGINT"}: zebrad::signal: received SIGINT, starting shutdown
pthread destroy mutex: Device or resource busy
Aborted (core dumped)

Environment

Zebra Version

$ zebrad version
zebrad 1.0.0-beta.2

Operating System

$ uname -a
Linux ... 5.4.122 #1-NixOS SMP Wed May 26 10:05:21 UTC 2021 x86_64 GNU/Linux
@teor2345 teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage P-Medium I-crash Zebra crashes (without a panic) I-usability Zebra is hard to understand or use labels Dec 2, 2021
@teor2345 teor2345 changed the title Zebra aborts with a pthread error if shut down during initialization Wait for RocksDB to finish before exiting Zebra Dec 3, 2021
@mpguerra
Copy link
Contributor

@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Dec 17, 2021
@teor2345 teor2345 added C-security Category: Security issues I-bad-code and removed C-bug Category: This is a bug P-Medium labels Dec 19, 2021
@teor2345 teor2345 changed the title Wait for RocksDB to finish before exiting Zebra Security: Stop RocksDB threads calling unexpected code when zebrad exits Dec 20, 2021
@oxarbitrage
Copy link
Contributor

It seems a lot of things are going on here and i spent quite some good amount of time trying several alternatives.

Warnings as pthread lock: Invalid argument, pure virtual method called can be fully eliminated if we allow the runtime to shutdown. We can do this by just removing this lines of code: https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L62-L63
Using the runtime will call the destructors that will allow us to add self.db.cancel_all_background_work(true|false) and a database drop into https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebra-state/src/service/finalized_state.rs#L822. This should result in a maybe faster and cleaner shutdown of the database.

The downside of this are mainly:

  • Zebra can still display zebra specific errors like buffer's worker closed unexpectedly before shutting down but this are not crashes (we could ignore these or workaround them, they are more manageable as they are in our side).
  • The runtime shutdown will not work properly (gets stuck) if Zebra is downloading the parameters.
  • The runtime shutdown can take a few seconds to shutdown depending on what zebra is doing.

To avoid the last 2 points the non runtime shutdown( https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L62-L63) was introduced.

However as this introduces the c++ errors the suggested design was to keep the non runtime shutdown but close the database properly from the application hoping to get rid of them. There were a few options proposed by this, i personally created a service request into the state that will close the database and i was able to call it at shutdown.

The result was a clean database cleanup without using the runtime destructor but unfortunatly the C++ issues at shutdown do not disapear. So there is something else that causes them, it could be still the rocksdb database but i am not so sure.

Another approach to try could be to lunch all the tasks in a thread pool (https://doc.rust-lang.org/book/ch20-03-graceful-shutdown-and-cleanup.html) and try to `abort()`` them at shutdown, this is while we don't use the runtime shutdown. The result of doing that is uncertain so i am not pushing forward for it.

In my opinion here is what we can do:

  • Use the runtime shutdown that will call destructors.
  • Add proper database close to FinalizedState destructor.
  • Download parameters:
    • Do not do it from zebrad start but from zebrad download. zcashd do it in a separated command (https://github.com/zcash/zcash/blob/master/zcutil/fetch-params.sh). I understand this may need changes to some CI.
    • Download parameters in zebra start before anything else and not in parallel. By not having other tasks around while downloading could make the shutdown more manageable.

@oxarbitrage
Copy link
Contributor

There is another case happening here that seems to be independent on if we are shutting down zebra using the runtime or not. This is a hang, when ctrl-c is received zebra will not do anything else but will not exit either:

...
Jan 22 12:54:47.949  INFO sync:extend_tips:push_extend_tips_length{sync_length=498}: zebrad::components::sync::recent_sync_lengths: sending recent sync lengths recent_lengths=[498, 498, 498, 498]
Jan 22 12:54:51.033  INFO sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=400 current_range=(Excluded(Height(124873)), Included(Height(125273)))
^C^C
^C^C^C^C^C^C^C^C^C^C
... time passes but no output until i send SIGKILL from outside ...
Killed
$

@teor2345
Copy link
Contributor Author

teor2345 commented Jan 23, 2022

Download parameters:

  • Do not do it from zebrad start but from zebrad download. zcashd do it in a separated command (https://github.com/zcash/zcash/blob/master/zcutil/fetch-params.sh). I understand this may need changes to some CI.

  • Download parameters in zebra start before anything else and not in parallel. By not having other tasks around while downloading could make the shutdown more manageable.

I'm not sure if this change is needed.

Have you tried:
https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.shutdown_timeout

If you call it here, the runtime should stop waiting for all tasks to finish:
https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L59

We might also need to fix #3378. Done!

These issues aren't scheduled until the lightwalletd work:

  • The runtime shutdown can take a few seconds to shutdown

This is a hang, when ctrl-c is received zebra will not do anything else but will not exit either:

They are tracked in #1351.

@oxarbitrage
Copy link
Contributor

I had been trying today with shutdown_timeout(). It was promising but unfortunately it will not fix the issue (pthread lock: Invalid argument and similar will keep happening).

@teor2345
Copy link
Contributor Author

I had been trying today with shutdown_timeout(). It was promising but unfortunately it will not fix the issue (pthread lock: Invalid argument and similar will keep happening).

That's really strange, because this is the recommended method, and other projects use it successfully.

Did you also delete the calls to exit()?
Can you post a draft PR, in case there is some other bug causing the issue?

@oxarbitrage
Copy link
Contributor

#3392

I didn't deleted the exits, i will try.

@oxarbitrage
Copy link
Contributor

On top the above PR i removed the full shutdown implementation in our side that is doing the exits: https://github.com/ZcashFoundation/zebra/blob/v1.0.0-beta.3/zebrad/src/components/tokio.rs#L46-L71

It didnt worked, i am still seeing:

...
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

occasionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-security Category: Security issues I-crash Zebra crashes (without a panic) I-usability Zebra is hard to understand or use
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants