Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PoC for booting from in-cluster built image #2886

Merged
merged 3 commits into from
Jan 26, 2022

Conversation

mkenigs
Copy link
Contributor

@mkenigs mkenigs commented Dec 21, 2021

Add e2e test that

  1. creates image stream and pushes build to that image stream
  2. uses that build with rpm-ostree rebase
  3. successfully reboots into that image

Closes https://issues.redhat.com/browse/MCO-127

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 21, 2021

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 21, 2021
@mkenigs
Copy link
Contributor Author

mkenigs commented Dec 21, 2021

/cc @cgwalters

@openshift-ci openshift-ci bot requested a review from cgwalters December 21, 2021 19:30
@mkenigs
Copy link
Contributor Author

mkenigs commented Dec 21, 2021

Right now I'm using env variables for kubeadm password, which I'll fix once we set up authentication properly

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this! I had been thinking of this e2e as just a shell script basically running oc but having this in Go is going to greatly help migrating this code into the MCD and MCO and also correctly handle state reconciliation/monitoring etc.

test/e2e/layering_test.go Outdated Show resolved Hide resolved
test/e2e/layering_test.go Outdated Show resolved Hide resolved
test/e2e/layering_test.go Outdated Show resolved Hide resolved
test/e2e/layering_test.go Outdated Show resolved Hide resolved
Comment on lines 87 to 88
changesQueued := "Changes queued for next boot. Run \"systemctl reboot\" to start a reboot"
require.Contains(t, rebase, changesQueued)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a test, but I would like the ability to change the English text output from rpm-ostree without breaking the MCO's test suite.

I think we can probably just drop this chunk. But if we do want to be explicit, we can use rpm-ostree status --json and check for a queued pending deployment. The MCD has code for this already. See also coreos/rpm-ostree#2389

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda feels like overkill but added something to call rpm-ostree status --json both here and after rollback

test/e2e/layering_test.go Outdated Show resolved Hide resolved
test/e2e/layering_test.go Outdated Show resolved Hide resolved
test/helpers/utils.go Show resolved Hide resolved
@mkenigs mkenigs force-pushed the 127-poc branch 3 times, most recently from ce2ac1a to 2fd8ebd Compare December 23, 2021 16:28
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2022
@cgwalters
Copy link
Member

@mkenigs do you want to rebase this? Why didn't we get it in before? CI flow problems?

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 18, 2022

I updated it to use the MCD service account but never tested that flow because I was waiting for the newer rpm-ostree

@cgwalters
Copy link
Member

I updated it to use the MCD service account but never tested that flow because I was waiting for the newer rpm-ostree

Ah got it, that should be available now

@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 18, 2022
@cgwalters
Copy link
Member

/test all

@cgwalters
Copy link
Member

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2022
@mkenigs mkenigs marked this pull request as ready for review January 19, 2022 21:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 19, 2022
@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 20, 2022

@cgwalters the test added by this PR is failing with:
error: Unknown option --authfile
I'm guessing CI still has the old rpm-ostree?

@cgwalters
Copy link
Member

OK right. This is because while we plumbed the options up to the ostree(ext) CLI, we didn't into rpm-ostree rebase which is a separate code path.

I think per discussion, what we really want is config files anyways. So I did
ostreedev/ostree-rs-ext#213

With that, the MCO can just write /run/ostree/auth.json and that's it, no need to pass it on the CLI too.

@cgwalters
Copy link
Member

@mkenigs can you apply this:

diff --git a/test/e2e/layering_test.go b/test/e2e/layering_test.go
index 81d5b5830..44215e8f8 100644
--- a/test/e2e/layering_test.go
+++ b/test/e2e/layering_test.go
@@ -28,7 +28,7 @@ const (
 	// See https://docs.rs/ostree-ext/0.5.1/ostree_ext/container/struct.OstreeImageReference.html
 	ostreeUnverifiedRegistry = "ostree-unverified-registry"
 	imageRegistry            = "image-registry.openshift-image-registry.svc:5000"
-	authfilePath             = "/etc/machine-config-daemon/authfile"
+	authfilePath             = "/run/ostree/auth.json"
 )
 
 type Deployments struct {
@@ -110,7 +110,7 @@ func TestBootInClusterImage(t *testing.T) {
 
 	// rpm-ostree rebase --experimental ostree-unverified-image:docker://image-registry.openshift-image-registry.svc.cluster.local:5000/openshift-machine-config-operator/test-boot-in-cluster-image-build
 	imageURL := fmt.Sprintf("%s:%s/%s/%s", ostreeUnverifiedRegistry, imageRegistry, constants.MCONamespace, imageStreamName)
-	helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm-ostree", "rebase", "--experimental", "--authfile", authfilePath, imageURL)
+	helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm-ostree", "rebase", "--experimental", imageURL)
 	// reboot
 	helpers.RebootAndWait(t, cs, infraNode)
 	// check that new image is used

?

@cgwalters
Copy link
Member

I've updated registry.ci.openshift.org/coreos/walters-rhcos-ostreecontainer-oldformat:latest with the code from ostreedev/ostree-rs-ext#213

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 20, 2022

Applied it! Should I go ahead and squash everything?

@cgwalters
Copy link
Member

#2921
should fix the race in that e2e that is unrelated to this PR.

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 20, 2022

/test e2e-gcp-op

@cgwalters
Copy link
Member

One node is still unschedulable, logs:

I0121 01:13:02.189032 1891 drain.go:44] Initiating uncordon on node (currently schedulable: false)
I0121 01:13:02.208130 1891 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in uncordon state, retrying

@cgwalters
Copy link
Member

We had a realtime chat on this, it's not clear to me if the failure is really related, but it may be that we need to have the e2e test here revert the node (via e.g. rpm-ostree rollback) back to the original image.

Another better pattern I think we should prototype out more here is using machine API to spawn a new worker VM that is allocated per each destructive test. (This would cost more money per PR, but be more reliable)

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 24, 2022

/test e2e-gcp-op

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 25, 2022

It looks like the failure for e2e-gcp-op might just be a timeout for the entire test suite:
panic: test timed out after 1h30m0s
That would also explain why the test itself passes
Looks like this test is taking about 10m

@cgwalters
Copy link
Member

We can bump the test timeout; won't be the first time. But we have a longer term problem with the e2e tests - I think we need to parallelize them. And I also think we should consider not running them less often - a ton of PRs to this repo have low-to-zero chance to break our e2es. For example, all the PRs to the OVS scripts.

@mkenigs
Copy link
Contributor Author

mkenigs commented Jan 25, 2022

Do I need to do any of that before merging this? Or just bump the timeout for now? How would I do that?

@cgwalters
Copy link
Member

Nah let's just try bumping the timeout, it looks like #2474

Add e2e test that
1. creates image stream and pushes build to that image stream
2. uses that build with rpm-ostree rebase
3. successfully reboots into that image

Closes https://issues.redhat.com/browse/MCO-127
Otherwise e2e tests fail with "panic: test timed out after 1h30m0s"
@cgwalters
Copy link
Member

/lgtm
let's try it!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, mkenigs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2022

@mkenigs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-single-node 7eac5c1 link false /test e2e-gcp-op-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ff2a39a into openshift:mcbs Jan 26, 2022
@mkenigs mkenigs deleted the 127-poc branch March 18, 2022 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. layering lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants