PVF: Refactor security code into Linux-specific module #2321

mrcnski · 2023-11-14T14:01:59Z

More of a general thought than a comment on this very PR: the security module has a lot of code gated with #[cfg(target_os = "linux")]. Indeed, all the security features we use are only available on Linux. Thus, we assume that only the validator running on Linux is secure. Why not gate the whole module with #[cfg(target_os = "linux")] then? On other platforms, there could be a single big fat warning like "you're not running Linux so you're not safe", and you wouldn't have to gate literally everything with like #[cfg_attr(not(target_os = "linux"), allow(unused_variables))], etc.

Originally posted by @s0me0ne-unkn0wn in #2304 (review)

The text was updated successfully, but these errors were encountered:

maksimryndin · 2024-01-19T09:22:24Z

HI @mrcnski !:) Can I take this issue and research on it a little?

mrcnski · 2024-01-19T09:40:03Z

Certainly @maksimryndin! But heads up: this will be a challenging refactor. And you would need to get that Linux amd64 environment set up for this. Still want to do it? :)

maksimryndin · 2024-01-19T09:45:26Z

Certainly @maksimryndin! But heads up: this will be a challenging refactor. And you would need to get that Linux amd64 environment set up for this. Still want to do it? :)

:) Definitely! Actually, yesterday I spawned a QEMU-emulated x86 machine on Mac and also took a free low-resource Oracle VPS for x86 experiments. Of course, emulated x86 is sloooow but with such compilation times the difference is acceptable while reading the book =)

mrcnski · 2024-01-19T09:48:25Z

@maksimryndin Cool! If you have time, mind quickly sharing how you got qemu working? Could be useful for me or someone else someday.

maksimryndin · 2024-01-19T09:55:41Z

@maksimryndin Cool! If you have time, mind quickly sharing how you got qemu working? Could be useful for me or someone else someday.

For sure!:) I use https://mac.getutm.app and downloaded Ubuntu x86 ISO from the official website. UTM is a convenient wrapper around QEMU with good defaults. I spawn VMs for sandboxed development. For example, my main dev env is aarch64 ubuntu on Mac M1.

Of course, emulated x86 is really slow but I will try with that setup. If it will be too painful, will try to find alternatives

s0me0ne-unkn0wn · 2024-01-19T10:00:40Z

@maksimryndin beware that QEMU is not just slow, sometimes it also brings in some unpleasant surprises, so it's always better to test features as low-level as security sandboxing on real hardware.

maksimryndin · 2024-01-19T10:04:06Z

@maksimryndin beware that QEMU is not just slow, sometimes it also brings in some unpleasant surprises, so it's always better to test features as low-level as security sandboxing on real hardware.

Thank you @s0me0ne-unkn0wn for an important warning! I will definitely try with the real hardware.

maksimryndin · 2024-01-23T08:29:48Z

Hi @mrcnski @s0me0ne-unkn0wn ! :)

Progress so far

read about seccomp and landlock, check source code of crates
spawn a cloud VPS to get x86 (emulated story doesn't work out for such a large codebase at all, Oracle has a $300 credit for 30 days to experiment:))
run zombienet with a current repo (filed an issue Unrecognized arg -d for undying-collator export-genesis-state subcommand is provided zombienet#1698) - see the comment below
refactor security code in a separate linux module
run tests and zombienet tests after changes

Current zombienet test run (without any change) - are these warnings suspicious?

Questions so far

Why do we restrict security features to x86_64? Seccompiler and landlock crates supports also aarch64 and linux aarch64 distros are common in cloud providers. I would like also to relax arch target constraint and include aarch64 and test with it.
Why seccomp uses a blacklist and not a whitelist filter? Perhaps is it worth to profile with strace?
How do we limit worker memory and cpu usage to prevent DDOS-attacks on validators? Should we use cgroups for that?

mrcnski · 2024-01-23T09:47:14Z

Impressive work @maksimryndin!

Current zombienet test run (without any change) - are these warnings suspicious?

I always see some warnings like this. As far as I can tell they don't affect the tests. I never looked into it, but it might be a good idea to raise an issue (if there isn't one already for those warnings).

Why do we restrict security features to x86_64?

That is a great question! We only restrict to x86_64 for seccomp, the other security features work on any recent Linux. We can support seccomp on aarch as well, but because some syscall numbers are different on different architectures, we want to at least have an aarch CI job (there is an issue about this in a private ci_cd repo). This would make sure things work as expected on that architecture. Reason being, if we failed to allow some legitimate syscall it could break consensus, which can lead to disputes and slashing and economic damage to validators and the ecosystem. There is a TODO in the code about removing the architecture gate for seccomp once the CI job is in place - I'll leave it up to you to find it. :)

Why seccomp uses a blacklist and not a whitelist filter? Perhaps is it worth to profile with strace?

We unfortunately don't have full control over which syscalls we make. We depend on wasmtime, which is an external project and can introduce new syscalls any time we update it. This doesn't make it impossible, but it is a challenging engineering effort, and we would have to get it right due to the potential consequences to consensus mentioned above.

There was some progress being made on a whitelist. It was dropped as we plan on repalcing wasmtime with PolkaVM, which has zero dependencies, controls all its syscalls, and already uses seccomp as part of its extensive sandboxing. Since there was still some work left to do to make a whitelist on top of wasmtime a reality, it was deemed not worth the trouble and risk.

The small blacklist is enough to prevent networking which is a mitigation against our biggest attack vector (stolen keys), and we are reasonably sure that honest validation won't ever attempt to do any form of networking.

I did some profiling with strace as part of this. :) But what did you have in mind exactly?

How do we limit worker memory and cpu usage to prevent DDOS-attacks on validators? Should we use cgroups for that?

There is some work being done on limiting worker memory and we considered using cgroups for it. We implemented a custom memory tracker and are collecting data to determine a safe limit (cc @s0me0ne-unkn0wn). Then the limit can be enacted through governance as an executor parameter, and enforced through the memory tracker (IIRC this was deemed more reliable than cgroups, since the amount of memory detected depends a lot on how it is allocated).

About CPU usage, I'm not sure how we'd limit it apart from having a timeout, which we do. :) Is that what you had in mind?

Good questions! I tried to document as much as possible in the implementer's guide, as well as in issues and PRs in this repo as well as the old polkadot repo. You might have to do some digging for answers when I'm gone though. :P

mrcnski · 2024-01-23T10:25:01Z

I would like also to relax arch target constraint and include aarch64 and test with it.

With all the above said, it is fine if you relax this constraint locally for testing purposes!

s0me0ne-unkn0wn · 2024-01-23T10:25:13Z

Running a validator is a CPU-intensive task; we're explicitly asking validator owners to allocate at least four dedicated CPU cores to the validator. So limiting the CPU load doesn't make much sense, as it's supposed to be loaded.

As for supporting aarch64... Well, sandboxing is not the only reason we assume x64 Linux is the only secure platform for a validator. Bringing in another platform also brings in some determinism issues. When we compile PVFs from Wasm to native, we obviously get different native code on different platforms. I mean, it does the same, but in a different way. And, as PVF code is untrusted, a malicious actor could use that difference to craft a code that executes on one platform and fails on another, leading to consensus breakage. It's not a concern right now exactly because nearly every validator runs on x64 Linux. So introducing the sandboxing on aarch64 doesn't solve the entire problem and thus is not worth it.

We hope to overcome that with PolkaVM, which is going to have an aarch64 backend someday, perfectly controllable from our side. Thus, we could eliminate those determinism issues at the VM level and support aarch64 as well.

mrcnski · 2024-01-23T10:28:23Z

It's not a concern right now exactly because nearly every validator runs on x64 Linux.

Yeah, I forgot to mention that 99% of validators are on x86_64, which you can see here.

s0me0ne-unkn0wn · 2024-01-23T10:40:03Z

About memory limiting. Yes, it's possible to achieve with groups. But again, it's all about determinism. We wanted to measure as precisely as possible how much memory the PVF preparation takes, not the whole worker doing some side tasks as well. Thus, we crafted this limiting allocator and checked it's behavior is deterministic, at least in single-threaded compilation mode.

maksimryndin · 2024-01-23T12:53:18Z

Thank you very for detailed answers @mrcnski @s0me0ne-unkn0wn ! :)

I never looked into it, but it might be a good idea to raise an issue (if there isn't one already for those warnings).

Hm, @mrcnski do you mean an issue in this repo? Or zombienet repo? (as they may be more experienced with such tests?)

if we failed to allow some legitimate syscall it could break consensus, which can lead to disputes and slashing and economic damage to validators and the ecosystem

Sure, I realize the consequences, and have seen the comments. But as far as I understand regarding syscalls, that libc crate abstracts away numbers for you for different archs (https://github.com/rust-lang/libc/blob/40741baa1d892518fd3c39795e962058ff558fb9/src/unix/linux_like/linux/gnu/b64/aarch64/mod.rs#L794 vs https://github.com/rust-lang/libc/blob/40741baa1d892518fd3c39795e962058ff558fb9/src/unix/linux_like/linux/gnu/b64/x86_64/not_x32.rs#L114). Of course, a smoke test is required. But after @s0me0ne-unkn0wn note on the lack of determinism at the assembly level (as far as I understood) and @mrcnski on the validator arch distribution stats, the aarch64 seems as a good learning exercise :))

It was dropped as we plan on #2461, which has zero dependencies, controls all its syscalls, and already uses seccomp as part of its extensive sandboxing.

Cool :). I think it is worth to add a comment on PolkaVM and motivation for black vs white listing to clarify (I can add it with the PR). What do you think?

I did some profiling with strace as part of this. :)

No doubt :) I thought about running zombietest and attach strace to the running processes to check which syscalls they issue. But I see that the situation is more complex than I assumed :)

Thus, we crafted this limiting allocator and checked it's behavior is deterministic, at least in single-threaded compilation mode.

I am highly impressed by the overall level of engineering, guys @mrcnski @s0me0ne-unkn0wn 🥇

burdges · 2024-02-08T13:05:11Z

We've some pushback on landlock, so maybe imposing that should be better justified? Or made optional?

s0me0ne-unkn0wn · 2024-02-08T13:08:31Z

We've some pushback on landlock

What pushback exactly? I mean, do we have an issue?

mrcnski · 2024-02-08T13:11:03Z

Landlock is optional. Well, provided that another filesystem restriction is present, like pivot_root, which it is on most systems.

burdges · 2024-02-08T14:22:44Z

We've no issue, but one of the secops guys brought up recent conversations. Anyway the concerns are virtualization and kernels being shipped without support. As it's optional, then this sounds like just a documentation issue, or maybe a question of defaults.

d3vr4nd0m · 2024-02-09T07:16:17Z

The announcement in the matrix rooms sounded like its gonna get enforced and many ppl (incl. me) migrated to full VM instead of LXC :(

Anyway let me recap why I think this would be wrong:

the polkadot binary has no visibility of operational aspects, it should not take over/replace OS/Distro/Hypervisor security controls
the polkadot binary should ideally run on any OS/Kernel regardless of CPU architecture as long as performance benchmarks are met. Leave ops/secops to validators. this helps with CPU/OS diversity and therefore improves security
the chain is permissionless/BFT, validators have skin in the game so everyone is incentivized to keep their validators as secure as possible
pandoras box: whats next? selinux enforcement? secure enclaves for keystore (sgx/sev) and restrict by CPU? TPM2.0-only like windows 11? I mean the list is endless...
a minor one: the systemd unit file with the correct syscalls from the parity repos seems unmaintained, using the binary and unit file straight from the repos throws errors, see add missing syscalls for workers #2212 so I guess most ppl out there using the repo have no landlock anyway

Thanks for clarifying this.

s0me0ne-unkn0wn · 2024-02-09T09:24:29Z

the polkadot binary has no visibility of operational aspects, it should not take over/replace OS/Distro/Hypervisor security controls

It doesn't replace anything; it just uses OS-provided facilities designed exactly for the user-level applications requiring a higher security level. And, being a bleeding edge technology, Polkadot requires bleeding edge security facilities ;)

the polkadot binary should ideally run on any OS/Kernel regardless of CPU architecture as long as performance benchmarks are met. Leave ops/secops to validators. this helps with CPU/OS diversity and therefore improves security

That's not entirely true in practice. Here I gave one brief explanation of why we cannot easily allow different architectures for validators. There were a lot more discussions on that topic; one of the most comprehensive is in #1434.

the chain is permissionless/BFT, validators have skin in the game so everyone is incentivized to keep their validators as secure as possible

I think here we should differentiate between "validator as a host" and "validator as software". Yes, it's in the best interests of a validator's operator to keep the host as secure as possible, but it won't help if the software is not secure. From the software side, two main security concerns must be addressed: 1) keep the validator itself secure (that is, do not allow to steal its keys or influence the validation process itself), and 2) keep network participants isolated (do not allow malicious actions to interfere with honest actions).

The validator software executes untrusted code sourced from network participants; not every network participant is guaranteed to be honest. It was less of a concern when we only had classic parachains and auctions. A high auction price is a strong disincentive for any malicious actions. The situation has changed with the introduction of on-demand parachains and Coretime. We must be very careful with the untrusted PVF code not being able to make it to the validator keys or to get into some state of another PVF execution. That can only be addressed in software, no matter how security-concerned the host operator is.

pandoras box: whats next? selinux enforcement? secure enclaves for keystore (sgx/sev) and restrict by CPU? TPM2.0-only like windows 11? I mean the list is endless...

I'd say it's a valid concern; when you've already started security sandboxing, you always feel it's not enough :) Hopefully, PolkaVM will introduce a lot more security determinism and stable requirements. Until then, I believe we won't be imposing more security requirements on validators unless some critical vulnerability comes up.

All in all, anyone feeling brave enough is equipped with --insecure-validator-i-know-what-i-do command line flag and is free to run a validator even on Raspberry PI. While the participants behave honestly, that will work perfectly. Otherwise, it's better to be on the safe side.

mrcnski · 2024-02-09T09:38:59Z

Thanks for sharing your concerns @d3vr4nd0m. I'm curious what was the announcement? I don't work with Parity anymore or represent them, but I worked on this area, so just to share a few points:

A default level of security around untrusted PVFs was a hard-stop requirement for on-demand to launch.
The binary does what it can to self-enforce security policies, on a best-effort basis. The goal was definitely not to replace operator-enforced security, but to provide another layer of defense-in-depth.
Landlock was designed specifically to allow applications to securely sandbox themselves. It's a bit unfortunate that it's not present everywhere, so it should be optional.
But I should have clarified in my last comment that even the "required" security capabilities can be bypassed with a flag --insecure-validator-i-know-what-i-do. Some of the self-security cannot be applied in container environments, for example. Even with that flag the binary still does what it can on a best-effort, but it doesn't generate errors.
The binary only applies its sandboxing around running untrusted code. The restrictions are very strict, too strict to apply to the whole node, and indeed it would be better for the operator to secure the node in general.

Maybe Parity's messaging should clarify some of these points. For example, operators should not relax their security hygiene assuming that the self-security is a replacement. Hope that's clear.

d3vr4nd0m · 2024-02-09T10:12:12Z

Hey all, so I guess it all boils down to a somewhat ambiguous announcement in the matrix rooms which created a decent amount of FUD and mass-migration from 'unsupported' container (notably LXC) setups to KVM or podman/docker with seccomp. As long landlock is optional/best-effort then perfect. Keep in mind that even on bare-metal/full-VMs most repo binary setups probably keep getting this error and won't have landlock due to the wrong/missing syscalls in the systemd unit file provided by the repo.

mrcnski · 2024-02-09T12:53:19Z

Apologies, with "by default" I had in mind that it can be bypassed - but I see now that that wording is not clear. In general it's hard to succinctly describe all the intricacies here, which was my intention with linking to that wiki page for more details. Perhaps going forwards the wiki page can be improved (I'd recommend raising an issue with specific concerns about any ambiguity), and apologies again.

d3vr4nd0m · 2024-02-09T15:22:44Z

thanks for clarification @mrcnski and @s0me0ne-unkn0wn

resolve #2321 - [x] refactor `security` module into a conditionally compiled - [x] rename `amd64` into x86-64 for consistency with conditional compilation guards and remove reference to a particular vendor - [x] run unit tests and zombienet --------- Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>

mrcnski added T0-node This PR/Issue is related to the topic “node”. C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. labels Nov 14, 2023

mrcnski added this to PVF Security Hardening and parachains team board Nov 14, 2023

github-project-automation bot moved this to Todo in PVF Security Hardening Nov 14, 2023

mrcnski mentioned this issue Nov 14, 2023

PVF: fix detection of unshare-and-change-root security capability #2304

Merged

the-right-joyce moved this to Backlog in parachains team board Nov 15, 2023

maksimryndin mentioned this issue Jan 24, 2024

refactor pvf security module #3047

Merged

3 tasks

s0me0ne-unkn0wn closed this as completed in #3047 Feb 11, 2024

github-project-automation bot moved this from Backlog to Completed in parachains team board Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Refactor security code into Linux-specific module #2321

PVF: Refactor security code into Linux-specific module #2321

mrcnski commented Nov 14, 2023

maksimryndin commented Jan 19, 2024

mrcnski commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

mrcnski commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

s0me0ne-unkn0wn commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

maksimryndin commented Jan 23, 2024 •

edited

Loading

mrcnski commented Jan 23, 2024

mrcnski commented Jan 23, 2024

s0me0ne-unkn0wn commented Jan 23, 2024

mrcnski commented Jan 23, 2024

s0me0ne-unkn0wn commented Jan 23, 2024

maksimryndin commented Jan 23, 2024 •

edited

Loading

burdges commented Feb 8, 2024

s0me0ne-unkn0wn commented Feb 8, 2024

mrcnski commented Feb 8, 2024

burdges commented Feb 8, 2024 •

edited

Loading

d3vr4nd0m commented Feb 9, 2024 •

edited

Loading

s0me0ne-unkn0wn commented Feb 9, 2024

mrcnski commented Feb 9, 2024

d3vr4nd0m commented Feb 9, 2024 •

edited

Loading

mrcnski commented Feb 9, 2024

d3vr4nd0m commented Feb 9, 2024

PVF: Refactor security code into Linux-specific module #2321

PVF: Refactor security code into Linux-specific module #2321

Comments

mrcnski commented Nov 14, 2023

maksimryndin commented Jan 19, 2024

mrcnski commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

mrcnski commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

s0me0ne-unkn0wn commented Jan 19, 2024

maksimryndin commented Jan 19, 2024

maksimryndin commented Jan 23, 2024 • edited Loading

mrcnski commented Jan 23, 2024

mrcnski commented Jan 23, 2024

s0me0ne-unkn0wn commented Jan 23, 2024

mrcnski commented Jan 23, 2024

s0me0ne-unkn0wn commented Jan 23, 2024

maksimryndin commented Jan 23, 2024 • edited Loading

burdges commented Feb 8, 2024

s0me0ne-unkn0wn commented Feb 8, 2024

mrcnski commented Feb 8, 2024

burdges commented Feb 8, 2024 • edited Loading

d3vr4nd0m commented Feb 9, 2024 • edited Loading

s0me0ne-unkn0wn commented Feb 9, 2024

mrcnski commented Feb 9, 2024

d3vr4nd0m commented Feb 9, 2024 • edited Loading

mrcnski commented Feb 9, 2024

d3vr4nd0m commented Feb 9, 2024

maksimryndin commented Jan 23, 2024 •

edited

Loading

maksimryndin commented Jan 23, 2024 •

edited

Loading

burdges commented Feb 8, 2024 •

edited

Loading

d3vr4nd0m commented Feb 9, 2024 •

edited

Loading

d3vr4nd0m commented Feb 9, 2024 •

edited

Loading