Create PoC for booting from in-cluster built image #2886

mkenigs · 2021-12-21T19:29:24Z

Add e2e test that

creates image stream and pushes build to that image stream
uses that build with rpm-ostree rebase
successfully reboots into that image

Closes https://issues.redhat.com/browse/MCO-127

openshift-ci · 2021-12-21T19:29:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mkenigs · 2021-12-21T19:30:16Z

/cc @cgwalters

mkenigs · 2021-12-21T19:31:20Z

Right now I'm using env variables for kubeadm password, which I'll fix once we set up authentication properly

cgwalters

Nice work on this! I had been thinking of this e2e as just a shell script basically running oc but having this in Go is going to greatly help migrating this code into the MCD and MCO and also correctly handle state reconciliation/monitoring etc.

test/e2e/layering_test.go

cgwalters · 2021-12-22T16:03:56Z

test/e2e/layering_test.go

+	changesQueued := "Changes queued for next boot. Run \"systemctl reboot\" to start a reboot"
+	require.Contains(t, rebase, changesQueued)


I know this is just a test, but I would like the ability to change the English text output from rpm-ostree without breaking the MCO's test suite.

I think we can probably just drop this chunk. But if we do want to be explicit, we can use rpm-ostree status --json and check for a queued pending deployment. The MCD has code for this already. See also coreos/rpm-ostree#2389

Kinda feels like overkill but added something to call rpm-ostree status --json both here and after rollback

test/e2e/layering_test.go

test/helpers/utils.go

cgwalters · 2022-01-18T19:42:07Z

@mkenigs do you want to rebase this? Why didn't we get it in before? CI flow problems?

mkenigs · 2022-01-18T19:55:31Z

I updated it to use the MCD service account but never tested that flow because I was waiting for the newer rpm-ostree

cgwalters · 2022-01-18T20:00:18Z

I updated it to use the MCD service account but never tested that flow because I was waiting for the newer rpm-ostree

Ah got it, that should be available now

cgwalters · 2022-01-19T18:55:44Z

/test all

cgwalters · 2022-01-19T18:55:47Z

/approve

mkenigs · 2022-01-20T04:01:11Z

@cgwalters the test added by this PR is failing with:
error: Unknown option --authfile
I'm guessing CI still has the old rpm-ostree?

cgwalters · 2022-01-20T16:34:32Z

OK right. This is because while we plumbed the options up to the ostree(ext) CLI, we didn't into rpm-ostree rebase which is a separate code path.

I think per discussion, what we really want is config files anyways. So I did
ostreedev/ostree-rs-ext#213

With that, the MCO can just write /run/ostree/auth.json and that's it, no need to pass it on the CLI too.

cgwalters · 2022-01-20T17:53:55Z

@mkenigs can you apply this:

diff --git a/test/e2e/layering_test.go b/test/e2e/layering_test.go
index 81d5b5830..44215e8f8 100644
--- a/test/e2e/layering_test.go
+++ b/test/e2e/layering_test.go
@@ -28,7 +28,7 @@ const (
 	// See https://docs.rs/ostree-ext/0.5.1/ostree_ext/container/struct.OstreeImageReference.html
 	ostreeUnverifiedRegistry = "ostree-unverified-registry"
 	imageRegistry            = "image-registry.openshift-image-registry.svc:5000"
-	authfilePath             = "/etc/machine-config-daemon/authfile"
+	authfilePath             = "/run/ostree/auth.json"
 )
 
 type Deployments struct {
@@ -110,7 +110,7 @@ func TestBootInClusterImage(t *testing.T) {
 
 	// rpm-ostree rebase --experimental ostree-unverified-image:docker://image-registry.openshift-image-registry.svc.cluster.local:5000/openshift-machine-config-operator/test-boot-in-cluster-image-build
 	imageURL := fmt.Sprintf("%s:%s/%s/%s", ostreeUnverifiedRegistry, imageRegistry, constants.MCONamespace, imageStreamName)
-	helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm-ostree", "rebase", "--experimental", "--authfile", authfilePath, imageURL)
+	helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm-ostree", "rebase", "--experimental", imageURL)
 	// reboot
 	helpers.RebootAndWait(t, cs, infraNode)
 	// check that new image is used

?

cgwalters · 2022-01-20T17:56:06Z

I've updated registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer-oldformat:latest with the code from ostreedev/ostree-rs-ext#213

mkenigs · 2022-01-20T19:04:05Z

Applied it! Should I go ahead and squash everything?

cgwalters · 2022-01-20T22:48:24Z

#2921
should fix the race in that e2e that is unrelated to this PR.

mkenigs · 2022-01-20T23:01:49Z

/test e2e-gcp-op

cgwalters · 2022-01-21T20:38:06Z

One node is still unschedulable, logs:

I0121 01:13:02.189032 1891 drain.go:44] Initiating uncordon on node (currently schedulable: false)
I0121 01:13:02.208130 1891 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in uncordon state, retrying

cgwalters · 2022-01-21T21:04:30Z

We had a realtime chat on this, it's not clear to me if the failure is really related, but it may be that we need to have the e2e test here revert the node (via e.g. rpm-ostree rollback) back to the original image.

Another better pattern I think we should prototype out more here is using machine API to spawn a new worker VM that is allocated per each destructive test. (This would cost more money per PR, but be more reliable)

mkenigs · 2022-01-24T19:15:32Z

/test e2e-gcp-op

mkenigs · 2022-01-25T03:58:27Z

It looks like the failure for e2e-gcp-op might just be a timeout for the entire test suite:
panic: test timed out after 1h30m0s
That would also explain why the test itself passes
Looks like this test is taking about 10m

cgwalters · 2022-01-25T18:19:17Z

We can bump the test timeout; won't be the first time. But we have a longer term problem with the e2e tests - I think we need to parallelize them. And I also think we should consider not running them less often - a ton of PRs to this repo have low-to-zero chance to break our e2es. For example, all the PRs to the OVS scripts.

mkenigs · 2022-01-25T21:45:34Z

Do I need to do any of that before merging this? Or just bump the timeout for now? How would I do that?

cgwalters · 2022-01-25T21:52:05Z

Nah let's just try bumping the timeout, it looks like #2474

Add e2e test that 1. creates image stream and pushes build to that image stream 2. uses that build with rpm-ostree rebase 3. successfully reboots into that image Closes https://issues.redhat.com/browse/MCO-127

Otherwise e2e tests fail with "panic: test timed out after 1h30m0s"

cgwalters · 2022-01-25T22:57:26Z

/lgtm
let's try it!

openshift-ci · 2022-01-25T22:58:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, mkenigs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2022-01-25T23:50:50Z

/retest-required