Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerance #2607

Closed
cjao opened this issue Feb 22, 2021 · 2 comments
Closed

Fault tolerance #2607

cjao opened this issue Feb 22, 2021 · 2 comments

Comments

@cjao
Copy link

cjao commented Feb 22, 2021

rpm-ostreed is a single point of failure that can potentially cripple vital functions, such as installing/removing packages or applying OS updates. Hence rpm-ostree-based systems need additional safeguards before they can be safely used by non-technical users.

My laptop running Silverblue 33 is normally set to update automatically in the background using rpm-ostreed-automatic. Recently, however, rpm-ostree managed to wedge itself in a mess from which it was unable to extricate itself (#2548). If this bug had afflicted non-technical users, their systems would have been frozen in time and unable to receive any security updates. This fragility would be unacceptable for production systems; rpm-ostreed must never fail.

Since test suites could still unintentionally let bugs through (who thought about #2548 before it occurred?), the system itself should be designed to automatically recover from problems with rpm-ostreed. For instance, could crashes of rpm-ostreed (such as reported recently in #2603) trigger an automatic rollback?

@cgwalters
Copy link
Member

Hi, thanks for the issue! But I think it's the responsibility of the OS vendor to ship ostree commits (filesystem trees) that have been tested as a coherent unit together. Enabling that is in fact one of the major goals of the project, and we actually do it with e.g. Fedora/RHEL CoreOS. Our testing system caught this bug and prevented it from shipping there. But Fedora IoT and Silverblue do not currently have integrated gating test systems.

Since you are discussing Silverblue, see https://pagure.io/fedora-qa/os-autoinst-distri-fedora/issue/217

As far as automatic rollbacks, see https://github.com/fedora-iot/greenboot

@cgwalters
Copy link
Member

I wanted to clarify this a bit more: I am very embarrassed by the libsolv issue but rpm-ostree issues are for things that need to be solved here versus elsewhere. And this issue I think is more of a Fedora issue than an rpm-ostree issue. Here we provide all the tools and techniquies needed to solve this problem, they just aren't wired up together there.

This also relates a lot to https://github.com/cgwalters/fedora-silverblue-config which bases Silverblue on FCOS (although in this case we should actually use the FCOS lockfiles to validate that it's the same libsolv + rpm-ostree etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants