Fault tolerance #2607

cjao · 2021-02-22T13:35:55Z

rpm-ostreed is a single point of failure that can potentially cripple vital functions, such as installing/removing packages or applying OS updates. Hence rpm-ostree-based systems need additional safeguards before they can be safely used by non-technical users.

My laptop running Silverblue 33 is normally set to update automatically in the background using rpm-ostreed-automatic. Recently, however, rpm-ostree managed to wedge itself in a mess from which it was unable to extricate itself (#2548). If this bug had afflicted non-technical users, their systems would have been frozen in time and unable to receive any security updates. This fragility would be unacceptable for production systems; rpm-ostreed must never fail.

Since test suites could still unintentionally let bugs through (who thought about #2548 before it occurred?), the system itself should be designed to automatically recover from problems with rpm-ostreed. For instance, could crashes of rpm-ostreed (such as reported recently in #2603) trigger an automatic rollback?

The text was updated successfully, but these errors were encountered:

cgwalters · 2021-02-22T14:18:50Z

Hi, thanks for the issue! But I think it's the responsibility of the OS vendor to ship ostree commits (filesystem trees) that have been tested as a coherent unit together. Enabling that is in fact one of the major goals of the project, and we actually do it with e.g. Fedora/RHEL CoreOS. Our testing system caught this bug and prevented it from shipping there. But Fedora IoT and Silverblue do not currently have integrated gating test systems.

Since you are discussing Silverblue, see https://pagure.io/fedora-qa/os-autoinst-distri-fedora/issue/217

As far as automatic rollbacks, see https://github.com/fedora-iot/greenboot

cgwalters · 2021-02-23T12:49:38Z

I wanted to clarify this a bit more: I am very embarrassed by the libsolv issue but rpm-ostree issues are for things that need to be solved here versus elsewhere. And this issue I think is more of a Fedora issue than an rpm-ostree issue. Here we provide all the tools and techniquies needed to solve this problem, they just aren't wired up together there.

This also relates a lot to https://github.com/cgwalters/fedora-silverblue-config which bases Silverblue on FCOS (although in this case we should actually use the FCOS lockfiles to validate that it's the same libsolv + rpm-ostree etc.)

cgwalters closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerance #2607

Fault tolerance #2607

cjao commented Feb 22, 2021 •

edited

Loading

cgwalters commented Feb 22, 2021

cgwalters commented Feb 23, 2021

Fault tolerance #2607

Fault tolerance #2607

Comments

cjao commented Feb 22, 2021 • edited Loading

cgwalters commented Feb 22, 2021

cgwalters commented Feb 23, 2021

cjao commented Feb 22, 2021 •

edited

Loading