Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: Always use podman pull+cp #2751

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

A while ago we switched to using oc image extract in order
to reduce the I/O writes done to the host, but it turned out
that doesn't yet work in disconnected environments that need
ImageContentSourcePolicy.

Now, in https://bugzilla.redhat.com/show_bug.cgi?id=2000195 we discovered
that the podman fallback broke due to user.* extended attributes
in the content (which will be removed soon hopefully).

But still, a good part of the value proposition of OpenShift is that we
work consistently across platforms. Having two ways to apply
OS updates is not worth the maintenance overhead.

Eventually this flow will be more native to rpm-ostree, xref
coreos/fedora-coreos-tracker#812
and
https://github.com/ostreedev/ostree-rs-ext/#module-container-encapsulate-ostree-commits-in-ocidocker-images

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 2, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cgwalters
To complete the pull request process, please assign kikisdeliveryservice after the PR has been reviewed.
You can assign the PR to them by writing /assign @kikisdeliveryservice in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

A while ago we switched to using `oc image extract` in order
to reduce the I/O writes done to the host, but it turned out
that doesn't yet work in disconnected environments that need
ImageContentSourcePolicy.

Now, in https://bugzilla.redhat.com/show_bug.cgi?id=2000195 we discovered
that the podman fallback broke due to `user.*` extended attributes
in the content (which will be removed soon hopefully).

But still, a good part of the value proposition of OpenShift is that we
work *consistently* across platforms.  Having two ways to apply
OS updates is not worth the maintenance overhead.

Eventually this flow will be more native to rpm-ostree, xref
coreos/fedora-coreos-tracker#812
and
https://github.com/ostreedev/ostree-rs-ext/#module-container-encapsulate-ostree-commits-in-ocidocker-images
@yuqi-zhang
Copy link
Contributor

Looking at #1941 it seems we had selinux concerns as well. If anything we should wait until 4.10 branches.

cc @sinnykumari for some more context.

@cgwalters
Copy link
Member Author

cgwalters commented Sep 2, 2021

Looking at #1941 it seems we had selinux concerns as well.

Yeah, a definite possibility. (I didn't test this before submitting BTW)

But the thing is, if there are SELinux or really any other issues, then those would also break in the case of disconnected - and we need that to work. So I think it's better to have one way to do it and ensure our single OS update path is tested on each e.g. e2e-aws run.

(edit: and accept the cost of this extra I/O for control plane and other hosts for now)

@cgwalters
Copy link
Member Author

While e2e-agnostic-upgrade failed, AFAICS it failed on a known flake - the OS update part worked.

@sinnykumari
Copy link
Contributor

If I understand correctly PR coreos/coreos-assembler#2406 is the actual fix for BZ#2000195 and this PR is a cleanup to keep one flow for OSImage extract?

I am not sure or convinced that we should go back to using podman for all the cases. Since 4.6, oc image extract is the default way of extracting OS image and effort was made to keep code neat and make update process faster. Today, podman cp is just a hacky code in MCO to workaround disconnected environment or any other fallback cases which will automatically won't be used when oc supports missing usecases.

Debugging a failure shouldn't be difficult with current two flow because podman is always a fallback.
Also as noted in bug BZ#2000195 issue shouldn't occur in 4.8+, perhaps we should keep the current OSUpdate workflow and make things better when we will rework on OSUpdate as part of MCBS changes.

@cgwalters
Copy link
Member Author

If I understand correctly PR coreos/coreos-assembler#2406 is the actual fix for BZ#2000195 and this PR is a cleanup to keep one flow for OSImage extract?

Yeah, I hope anyways. It hasn't been verified yet.

Debugging a failure shouldn't be difficult with current two flow because podman is always a fallback.

I think the problem isn't so much debugging it (though it was a bit tricky in this case) so much as the fact that a customer OS update hit it in the field. What if it turned out the podman fallback had been silently broken? Then it'd be much harder to work around; we'd need to ship them a custom MCD, etc.

Hmmm...I think we don't even have an informing job for disconnected in Prow? That seems like a large oversight.

So my argument to merge this is basically disconnected is important and we should only have one OS update path. But, if you prefer to close this that's also OK by me.

Ultimately we will be replacing this bit with the ostree-native container bits which should be a lot better than either, but that's still a ways away and a fair bit of time for any new bugs in the podman path to creep in to be hit only in the field.

@sinnykumari
Copy link
Contributor

I think the problem isn't so much debugging it (though it was a bit tricky in this case) so much as the fact that a customer OS update hit it in the field. What if it turned out the podman fallback had been silently broken? Then it'd be much harder to work around; we'd need to ship them a custom MCD, etc.

Hmmm...I think we don't even have an informing job for disconnected in Prow? That seems like a large oversight.

right, we don't have good ci coverage for disconnected.

So my argument to merge this is basically disconnected is important and we should only have one OS update path. But, if you prefer to close this that's also OK by me.

+1

Ultimately we will be replacing this bit with the ostree-native container bits which should be a lot better than either, but that's still a ways away and a fair bit of time for any new bugs in the podman path to creep in to be hit only in the field.

right, make sense considering ostree-native container work is going to take a while. As mentioned earlier using podman copy was a quick hack, need to think a little bit about any other impact or code changes needed before we merge it. Will get back to this either this week or once back from PTO.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 21, 2021

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-upgrade fb5a223 link /test e2e-vsphere-upgrade
ci/prow/e2e-aws-workers-rhel8 fb5a223 link /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-workers-rhel7 fb5a223 link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-single-node fb5a223 link /test e2e-aws-single-node
ci/prow/e2e-gcp-op-single-node fb5a223 link /test e2e-gcp-op-single-node
ci/prow/e2e-aws-disruptive fb5a223 link /test e2e-aws-disruptive
ci/prow/e2e-aws-upgrade-single-node fb5a223 link /test e2e-aws-upgrade-single-node
ci/prow/e2e-agnostic-upgrade fb5a223 link true /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants