Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Refactor security code into Linux-specific module #2321

Closed
mrcnski opened this issue Nov 14, 2023 · 24 comments · Fixed by #3047
Closed

PVF: Refactor security code into Linux-specific module #2321

mrcnski opened this issue Nov 14, 2023 · 24 comments · Fixed by #3047
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T0-node This PR/Issue is related to the topic “node”.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Nov 14, 2023

More of a general thought than a comment on this very PR: the security module has a lot of code gated with #[cfg(target_os = "linux")]. Indeed, all the security features we use are only available on Linux. Thus, we assume that only the validator running on Linux is secure. Why not gate the whole module with #[cfg(target_os = "linux")] then? On other platforms, there could be a single big fat warning like "you're not running Linux so you're not safe", and you wouldn't have to gate literally everything with like #[cfg_attr(not(target_os = "linux"), allow(unused_variables))], etc.

Originally posted by @s0me0ne-unkn0wn in #2304 (review)

@mrcnski mrcnski added T0-node This PR/Issue is related to the topic “node”. C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. labels Nov 14, 2023
@maksimryndin
Copy link
Contributor

HI @mrcnski !:) Can I take this issue and research on it a little?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 19, 2024

Certainly @maksimryndin! But heads up: this will be a challenging refactor. And you would need to get that Linux amd64 environment set up for this. Still want to do it? :)

@maksimryndin
Copy link
Contributor

Certainly @maksimryndin! But heads up: this will be a challenging refactor. And you would need to get that Linux amd64 environment set up for this. Still want to do it? :)

:) Definitely! Actually, yesterday I spawned a QEMU-emulated x86 machine on Mac and also took a free low-resource Oracle VPS for x86 experiments. Of course, emulated x86 is sloooow but with such compilation times the difference is acceptable while reading the book =)

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 19, 2024

@maksimryndin Cool! If you have time, mind quickly sharing how you got qemu working? Could be useful for me or someone else someday.

@maksimryndin
Copy link
Contributor

@maksimryndin Cool! If you have time, mind quickly sharing how you got qemu working? Could be useful for me or someone else someday.

For sure!:) I use https://mac.getutm.app and downloaded Ubuntu x86 ISO from the official website. UTM is a convenient wrapper around QEMU with good defaults. I spawn VMs for sandboxed development. For example, my main dev env is aarch64 ubuntu on Mac M1.

Of course, emulated x86 is really slow but I will try with that setup. If it will be too painful, will try to find alternatives

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin beware that QEMU is not just slow, sometimes it also brings in some unpleasant surprises, so it's always better to test features as low-level as security sandboxing on real hardware.

@maksimryndin
Copy link
Contributor

@maksimryndin beware that QEMU is not just slow, sometimes it also brings in some unpleasant surprises, so it's always better to test features as low-level as security sandboxing on real hardware.

Thank you @s0me0ne-unkn0wn for an important warning! I will definitely try with the real hardware.

@maksimryndin
Copy link
Contributor

maksimryndin commented Jan 23, 2024

Hi @mrcnski @s0me0ne-unkn0wn ! :)

Progress so far

Current zombienet test run (without any change) - are these warnings suspicious?
Screenshot 2024-01-23 at 09 03 37

Questions so far

  1. Why do we restrict security features to x86_64? Seccompiler and landlock crates supports also aarch64 and linux aarch64 distros are common in cloud providers. I would like also to relax arch target constraint and include aarch64 and test with it.
  2. Why seccomp uses a blacklist and not a whitelist filter? Perhaps is it worth to profile with strace?
  3. How do we limit worker memory and cpu usage to prevent DDOS-attacks on validators? Should we use cgroups for that?

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 23, 2024

Impressive work @maksimryndin!

Current zombienet test run (without any change) - are these warnings suspicious?

I always see some warnings like this. As far as I can tell they don't affect the tests. I never looked into it, but it might be a good idea to raise an issue (if there isn't one already for those warnings).


Why do we restrict security features to x86_64?

That is a great question! We only restrict to x86_64 for seccomp, the other security features work on any recent Linux. We can support seccomp on aarch as well, but because some syscall numbers are different on different architectures, we want to at least have an aarch CI job (there is an issue about this in a private ci_cd repo). This would make sure things work as expected on that architecture. Reason being, if we failed to allow some legitimate syscall it could break consensus, which can lead to disputes and slashing and economic damage to validators and the ecosystem. There is a TODO in the code about removing the architecture gate for seccomp once the CI job is in place - I'll leave it up to you to find it. :)

Why seccomp uses a blacklist and not a whitelist filter? Perhaps is it worth to profile with strace?

We unfortunately don't have full control over which syscalls we make. We depend on wasmtime, which is an external project and can introduce new syscalls any time we update it. This doesn't make it impossible, but it is a challenging engineering effort, and we would have to get it right due to the potential consequences to consensus mentioned above.

There was some progress being made on a whitelist. It was dropped as we plan on repalcing wasmtime with PolkaVM, which has zero dependencies, controls all its syscalls, and already uses seccomp as part of its extensive sandboxing. Since there was still some work left to do to make a whitelist on top of wasmtime a reality, it was deemed not worth the trouble and risk.

The small blacklist is enough to prevent networking which is a mitigation against our biggest attack vector (stolen keys), and we are reasonably sure that honest validation won't ever attempt to do any form of networking.

I did some profiling with strace as part of this. :) But what did you have in mind exactly?

How do we limit worker memory and cpu usage to prevent DDOS-attacks on validators? Should we use cgroups for that?

There is some work being done on limiting worker memory and we considered using cgroups for it. We implemented a custom memory tracker and are collecting data to determine a safe limit (cc @s0me0ne-unkn0wn). Then the limit can be enacted through governance as an executor parameter, and enforced through the memory tracker (IIRC this was deemed more reliable than cgroups, since the amount of memory detected depends a lot on how it is allocated).

About CPU usage, I'm not sure how we'd limit it apart from having a timeout, which we do. :) Is that what you had in mind?


Good questions! I tried to document as much as possible in the implementer's guide, as well as in issues and PRs in this repo as well as the old polkadot repo. You might have to do some digging for answers when I'm gone though. :P

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 23, 2024

I would like also to relax arch target constraint and include aarch64 and test with it.

With all the above said, it is fine if you relax this constraint locally for testing purposes!

@s0me0ne-unkn0wn
Copy link
Contributor

Running a validator is a CPU-intensive task; we're explicitly asking validator owners to allocate at least four dedicated CPU cores to the validator. So limiting the CPU load doesn't make much sense, as it's supposed to be loaded.

As for supporting aarch64... Well, sandboxing is not the only reason we assume x64 Linux is the only secure platform for a validator. Bringing in another platform also brings in some determinism issues. When we compile PVFs from Wasm to native, we obviously get different native code on different platforms. I mean, it does the same, but in a different way. And, as PVF code is untrusted, a malicious actor could use that difference to craft a code that executes on one platform and fails on another, leading to consensus breakage. It's not a concern right now exactly because nearly every validator runs on x64 Linux. So introducing the sandboxing on aarch64 doesn't solve the entire problem and thus is not worth it.

We hope to overcome that with PolkaVM, which is going to have an aarch64 backend someday, perfectly controllable from our side. Thus, we could eliminate those determinism issues at the VM level and support aarch64 as well.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 23, 2024

It's not a concern right now exactly because nearly every validator runs on x64 Linux.

Yeah, I forgot to mention that 99% of validators are on x86_64, which you can see here.

@s0me0ne-unkn0wn
Copy link
Contributor

About memory limiting. Yes, it's possible to achieve with groups. But again, it's all about determinism. We wanted to measure as precisely as possible how much memory the PVF preparation takes, not the whole worker doing some side tasks as well. Thus, we crafted this limiting allocator and checked it's behavior is deterministic, at least in single-threaded compilation mode.

@maksimryndin
Copy link
Contributor

maksimryndin commented Jan 23, 2024

Thank you very for detailed answers @mrcnski @s0me0ne-unkn0wn ! :)

I never looked into it, but it might be a good idea to raise an issue (if there isn't one already for those warnings).

Hm, @mrcnski do you mean an issue in this repo? Or zombienet repo? (as they may be more experienced with such tests?)

if we failed to allow some legitimate syscall it could break consensus, which can lead to disputes and slashing and economic damage to validators and the ecosystem

Sure, I realize the consequences, and have seen the comments. But as far as I understand regarding syscalls, that libc crate abstracts away numbers for you for different archs (https://github.com/rust-lang/libc/blob/40741baa1d892518fd3c39795e962058ff558fb9/src/unix/linux_like/linux/gnu/b64/aarch64/mod.rs#L794 vs https://github.com/rust-lang/libc/blob/40741baa1d892518fd3c39795e962058ff558fb9/src/unix/linux_like/linux/gnu/b64/x86_64/not_x32.rs#L114). Of course, a smoke test is required. But after @s0me0ne-unkn0wn note on the lack of determinism at the assembly level (as far as I understood) and @mrcnski on the validator arch distribution stats, the aarch64 seems as a good learning exercise :))

It was dropped as we plan on #2461, which has zero dependencies, controls all its syscalls, and already uses seccomp as part of its extensive sandboxing.

Cool :). I think it is worth to add a comment on PolkaVM and motivation for black vs white listing to clarify (I can add it with the PR). What do you think?

I did some profiling with strace as part of this. :)

No doubt :) I thought about running zombietest and attach strace to the running processes to check which syscalls they issue. But I see that the situation is more complex than I assumed :)

Thus, we crafted this limiting allocator and checked it's behavior is deterministic, at least in single-threaded compilation mode.

I am highly impressed by the overall level of engineering, guys @mrcnski @s0me0ne-unkn0wn 🥇

@burdges
Copy link

burdges commented Feb 8, 2024

We've some pushback on landlock, so maybe imposing that should be better justified? Or made optional?

@s0me0ne-unkn0wn
Copy link
Contributor

We've some pushback on landlock

What pushback exactly? I mean, do we have an issue?

@mrcnski
Copy link
Contributor Author

mrcnski commented Feb 8, 2024

Landlock is optional. Well, provided that another filesystem restriction is present, like pivot_root, which it is on most systems.

@burdges
Copy link

burdges commented Feb 8, 2024

We've no issue, but one of the secops guys brought up recent conversations. Anyway the concerns are virtualization and kernels being shipped without support. As it's optional, then this sounds like just a documentation issue, or maybe a question of defaults.

@d3vr4nd0m
Copy link

d3vr4nd0m commented Feb 9, 2024

The announcement in the matrix rooms sounded like its gonna get enforced and many ppl (incl. me) migrated to full VM instead of LXC :(

Anyway let me recap why I think this would be wrong:

  • the polkadot binary has no visibility of operational aspects, it should not take over/replace OS/Distro/Hypervisor security controls
  • the polkadot binary should ideally run on any OS/Kernel regardless of CPU architecture as long as performance benchmarks are met. Leave ops/secops to validators. this helps with CPU/OS diversity and therefore improves security
  • the chain is permissionless/BFT, validators have skin in the game so everyone is incentivized to keep their validators as secure as possible
  • pandoras box: whats next? selinux enforcement? secure enclaves for keystore (sgx/sev) and restrict by CPU? TPM2.0-only like windows 11? I mean the list is endless...
  • a minor one: the systemd unit file with the correct syscalls from the parity repos seems unmaintained, using the binary and unit file straight from the repos throws errors, see add missing syscalls for workers #2212 so I guess most ppl out there using the repo have no landlock anyway

Thanks for clarifying this.

@s0me0ne-unkn0wn
Copy link
Contributor

the polkadot binary has no visibility of operational aspects, it should not take over/replace OS/Distro/Hypervisor security controls

It doesn't replace anything; it just uses OS-provided facilities designed exactly for the user-level applications requiring a higher security level. And, being a bleeding edge technology, Polkadot requires bleeding edge security facilities ;)

the polkadot binary should ideally run on any OS/Kernel regardless of CPU architecture as long as performance benchmarks are met. Leave ops/secops to validators. this helps with CPU/OS diversity and therefore improves security

That's not entirely true in practice. Here I gave one brief explanation of why we cannot easily allow different architectures for validators. There were a lot more discussions on that topic; one of the most comprehensive is in #1434.

the chain is permissionless/BFT, validators have skin in the game so everyone is incentivized to keep their validators as secure as possible

I think here we should differentiate between "validator as a host" and "validator as software". Yes, it's in the best interests of a validator's operator to keep the host as secure as possible, but it won't help if the software is not secure. From the software side, two main security concerns must be addressed: 1) keep the validator itself secure (that is, do not allow to steal its keys or influence the validation process itself), and 2) keep network participants isolated (do not allow malicious actions to interfere with honest actions).

The validator software executes untrusted code sourced from network participants; not every network participant is guaranteed to be honest. It was less of a concern when we only had classic parachains and auctions. A high auction price is a strong disincentive for any malicious actions. The situation has changed with the introduction of on-demand parachains and Coretime. We must be very careful with the untrusted PVF code not being able to make it to the validator keys or to get into some state of another PVF execution. That can only be addressed in software, no matter how security-concerned the host operator is.

pandoras box: whats next? selinux enforcement? secure enclaves for keystore (sgx/sev) and restrict by CPU? TPM2.0-only like windows 11? I mean the list is endless...

I'd say it's a valid concern; when you've already started security sandboxing, you always feel it's not enough :) Hopefully, PolkaVM will introduce a lot more security determinism and stable requirements. Until then, I believe we won't be imposing more security requirements on validators unless some critical vulnerability comes up.

All in all, anyone feeling brave enough is equipped with --insecure-validator-i-know-what-i-do command line flag and is free to run a validator even on Raspberry PI. While the participants behave honestly, that will work perfectly. Otherwise, it's better to be on the safe side.

@mrcnski
Copy link
Contributor Author

mrcnski commented Feb 9, 2024

Thanks for sharing your concerns @d3vr4nd0m. I'm curious what was the announcement? I don't work with Parity anymore or represent them, but I worked on this area, so just to share a few points:

  • A default level of security around untrusted PVFs was a hard-stop requirement for on-demand to launch.
  • The binary does what it can to self-enforce security policies, on a best-effort basis. The goal was definitely not to replace operator-enforced security, but to provide another layer of defense-in-depth.
  • Landlock was designed specifically to allow applications to securely sandbox themselves. It's a bit unfortunate that it's not present everywhere, so it should be optional.
  • But I should have clarified in my last comment that even the "required" security capabilities can be bypassed with a flag --insecure-validator-i-know-what-i-do. Some of the self-security cannot be applied in container environments, for example. Even with that flag the binary still does what it can on a best-effort, but it doesn't generate errors.
  • The binary only applies its sandboxing around running untrusted code. The restrictions are very strict, too strict to apply to the whole node, and indeed it would be better for the operator to secure the node in general.

Maybe Parity's messaging should clarify some of these points. For example, operators should not relax their security hygiene assuming that the self-security is a replacement. Hope that's clear.

@d3vr4nd0m
Copy link

d3vr4nd0m commented Feb 9, 2024

Hey all, so I guess it all boils down to a somewhat ambiguous announcement in the matrix rooms which created a decent amount of FUD and mass-migration from 'unsupported' container (notably LXC) setups to KVM or podman/docker with seccomp. As long landlock is optional/best-effort then perfect. Keep in mind that even on bare-metal/full-VMs most repo binary setups probably keep getting this error and won't have landlock due to the wrong/missing syscalls in the systemd unit file provided by the repo.

image

@mrcnski
Copy link
Contributor Author

mrcnski commented Feb 9, 2024

Apologies, with "by default" I had in mind that it can be bypassed - but I see now that that wording is not clear. In general it's hard to succinctly describe all the intricacies here, which was my intention with linking to that wiki page for more details. Perhaps going forwards the wiki page can be improved (I'd recommend raising an issue with specific concerns about any ambiguity), and apologies again.

@d3vr4nd0m
Copy link

thanks for clarification @mrcnski and @s0me0ne-unkn0wn

github-merge-queue bot pushed a commit that referenced this issue Feb 11, 2024
resolve #2321

- [x] refactor `security` module into a conditionally compiled
- [x] rename `amd64` into x86-64 for consistency with conditional
compilation guards and remove reference to a particular vendor
- [x] run unit tests and zombienet

---------

Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
@github-project-automation github-project-automation bot moved this from Backlog to Completed in parachains team board Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

5 participants