Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds openshift image derivation test #746

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented Mar 14, 2022

This PR introduces the OpenShift in-cluster image derivation test. What this does is introduce a new test which will test that in-cluster image derivation works as we intend. The test works thusly:

  1. We grab the latest RHCOS image which was built just prior to test execution.
  2. Build a derived image from the aforementioned RHCOS image.
  3. Select a cluster node to apply the image to.
  4. Apply the new image using rpm-ostree.
  5. Reboot the cluster node.
  6. Verify that the cluster node came up successfully.
  7. Roll back applying the derived image.

This is a deceptively large PR because it introduces a bunch of vendored dependencies. For ease of review, I added the vendored dependencies in a commit by themselves. Additionally, since I wasn't sure about the impact my changes would have upon the coreos-assembler image, I opted to multistage the Dockerfile build so I could inject a pre-compiled version of the layering test suite into the coreos-assembler image.

The test itself originally came from openshift/machine-config-operator#2886.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 14, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2022
@openshift-ci openshift-ci bot requested review from ashcrow and mike-nguyen March 14, 2022 17:54
@cheesesashimi cheesesashimi force-pushed the zzlotnik/add-openshift-derivation-test branch from baabcb1 to ac1bedf Compare March 14, 2022 18:00
// will get cleaned up on reboot since file is in /run
writeToMCDContainer(t, cs, infraNode, path.Join("/rootfs", authfilePath), authfileData)

// rpm-ostree rebase --experimental ostree-unverified-image:docker://image-registry.openshift-image-registry.svc.cluster.local:5000/openshift-machine-config-operator/test-boot-in-cluster-image-build
Copy link

@mkenigs mkenigs Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// rpm-ostree rebase --experimental ostree-unverified-image:docker://image-registry.openshift-image-registry.svc.cluster.local:5000/openshift-machine-config-operator/test-boot-in-cluster-image-build
// rpm-ostree rebase --experimental ostree-unverified-image:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/test-boot-in-cluster-image

or maybe just

Suggested change
// rpm-ostree rebase --experimental ostree-unverified-image:docker://image-registry.openshift-image-registry.svc.cluster.local:5000/openshift-machine-config-operator/test-boot-in-cluster-image-build
// rpm-ostree rebase --experimental imageURL

to avoid similar inconsistencies in the future

Ironically this inconsistency was a result of my own negligence 😆 🙈

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I'm cleaning this up as I go.

@jlebon
Copy link
Member

jlebon commented Mar 14, 2022

Cool!

The test itself originally came from openshift/machine-config-operator#2886.

Is the test exactly the same? Could we instead of adding go vendoring here, have the builder also check out the MCO repo and build the test from there?

@cheesesashimi
Copy link
Member Author

cheesesashimi commented Mar 14, 2022

Is the test exactly the same? Could we instead of adding go vendoring here, have the builder also check out the MCO repo and build the test from there?

The test currently lives in the MCBS branch, which was primarily used as a POC branch when we started experimenting with the CoreOS layering work. Unfortunately, this branch no longer has CI attached to it and it's no longer maintained at this point.

As for the test itself, I've made some local changes now that we know more about the overall CI setup for everything, as well as doing some refactoring / cleanup. As for the dependencies, I only vendored them since that seems like the OpenShift / CoreOS convention; I wasn't sure if I could solely get away with just a go.mod / go.sum file.

I'll mention that the test itself does not directly test anything involving the MCO; we just have a couple of test libraries that I thought would be useful for this. I could probably duplicate a subset of those libraries here as a way of reducing the number of dependencies.

@cgwalters
Copy link
Member

So...this is definitely tricky. As of right now, this repository is only consumed by coreos-assembler - which already has a lot of Go code for tests. Adding new Go code here means there's another place to handle vendor bumps etc.

Another balance we try to handle though is that we don't have a hard dependency for "coreos stuff" on OpenShift or even Kubernetes. (e.g. coreos-assembler can also be run via podman)

WDYT about having this in e.g. https://github.com/openshift/origin/tree/master/test/extended instead? I think we can wire things up so that the new test is only run on this repo to start - but it would mean we could also more easily run it as a periodic outside of PRs to this repo?

That said, on the other hand there are advantages to having tests live with the repository. A big part of the reason coreos external tests were created is that the tests can live inside the config repo (like this one) that also (for FCOS) also versions the content.

I think I lean on balance towards adding this to the openshift tests, but it's by no means a strongly held opinion. Another counterbalance could be adding more tests in Go code here.

@cgwalters
Copy link
Member

Or I guess to say this another way what has me hesitating a bit is that today the output from a build of this repository is a "cosa build", which is the oscontainer image and disk images.

Now we do have a Dockerfile, but that's mainly to "fake out" Prow and try to get "a build" of stuff together - we don't actually ship the result of that build.

But if we're building a test, I think there are definitely cases where we'd want to build that test outside of building the repo - which is how the origin tests work.

If we were to go this path, perhaps actually it should be fedora-coreos-config that has a Go testing framework, and this adds RHCOS tests to it? That would be a new middle ground between "external tests are in shell script" and "cosa tests which can be Go, but are lifecycled with cosa, not the OS".

Anyone else have opinions?

@jlebon
Copy link
Member

jlebon commented Mar 15, 2022

Personally, I don't really mind Go code being used for tests in this repo or in f-c-c. I think we've discussed in the past re. how if we want to keep draining kola of tests into the config repos, we'd probably need "cosa build"-level testing there to make kola testiso work in that model, and I can see Go being good for this for sharing with cosa.

I really dislike the Go vendoring model though. Maybe if we go that way, we can find some way to use the vendoring from cosa and not ship any vendoring in the config repos (which might be what you're implying already with "lifecycled with cosa").

But re. this test in specific, since this is a cluster-level test (which unlike kola tests, isn't really something developers are expected to easily run locally, right?) and OpenShift CI falls over anyway if GitHub is down, could we just ship go.sum and go.mod? I think that'd make this much easier to swallow.

@cheesesashimi
Copy link
Member Author

I'm torn between this belonging here or in openshift/origin since this is a cluster-level test that tests interaction with RHCOS and various parts of OpenShift. I see benefits and drawbacks for both of those repos and I'm not particularly opinionated with it going in either of those places.

However, I'm not sure if it belongs with coreos-assembler or even fedora-coreos-config since those seem like they're more OS-specific tests that exercise and validate the various aspects of CoreOS. But I do see the value of vendoring the dependencies for this test within either of those repos. That said, I share @jlebon's dislike of the Go vendor model and think we can get away with only the go.mod / go.sum files in this repo.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/add-openshift-derivation-test branch from ac1bedf to 461694b Compare March 16, 2022 14:07
@cgwalters
Copy link
Member

/approve
OK, I'm good with this as is.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2022
@jlebon
Copy link
Member

jlebon commented Mar 17, 2022

Also good with this as is.
/approve

Minor suggestion to layout: add a tests/go subdir and in there have the go.mod and go.sum files and the layering/ subdir.

@cheesesashimi
Copy link
Member Author

Cool, thanks! Will make those changes and tidy up the test code. Will also update when this is ready for a final review.

@ashcrow ashcrow requested review from kikisdeliveryservice and removed request for ashcrow March 23, 2022 17:42
@cheesesashimi cheesesashimi force-pushed the zzlotnik/add-openshift-derivation-test branch from dc5bf39 to 0e0307d Compare March 29, 2022 20:17
@cheesesashimi cheesesashimi marked this pull request as ready for review March 29, 2022 20:20
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 29, 2022
@cheesesashimi
Copy link
Member Author

This is now ready for a more in-depth review.

@openshift-ci openshift-ci bot requested review from aaradhak and gursewak1997 March 29, 2022 20:21
// If this moves from /run, make sure files get cleaned up
authfilePath = "/run/ostree/auth.json"
mcoNamespace = "openshift-machine-config-operator"
imagePullSpec = "registry.ci.openshift.org/rhcos-devel/rhel-coreos:%s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why %s here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value gets replaced in with either latest or the OCP release (e.g., 4.11) using https://github.com/openshift/os/pull/746/files#diff-58ffaa9005e1e047a4897d0a0d8a414238a5334c730449caa62fbd8fca7cb34dR36.

This occurs here: https://github.com/openshift/os/pull/746/files#diff-58ffaa9005e1e047a4897d0a0d8a414238a5334c730449caa62fbd8fca7cb34dR75.

My reasoning was that this could run on all of the release branches as well as master, so it should pull the appropriate image for that branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason link not working what line is it (just for my own info when you get a chance)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/layering/helpers_test.go:75


// We haven't started the container yet
if *status.Started != true {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: probably obvious but why a pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Started field is a pointer and one has to deference it in order to check if its true or false. I probably could've done something like:

if status.Started != nil && *status.Started != true {
  return false
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking the same thing on that one :)

// get ImagePullSecret for the MCD service account and save to authfilePath on the node
// eventually we should use rest.InClusterConfig() instead of cs with kubeadmin
func putPullSecretOnNode(t *testing.T, cs *framework.ClientSet, node *corev1.Node) func() {
t.Logf("placing image pull secret on node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Logf("placing image pull secret on node")
t.Log("placing image pull secret on node")


return func() {
// Clean up the pull secret from the node if the node has not rebooted and we do not want to delete it.
t.Logf("cleaning up pull secret from node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logf->log or alternatively add "node: %s, node.name" (or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that mistake a lot; sorry! I'll get them all fixed and make a note to tune my linter.

// Rolls back to the original OS image that was on the node.
func rollbackToOriginalImage(t *testing.T, cs *framework.ClientSet, node *corev1.Node) error {
// Apply the rollback to the previous image
t.Logf("rolling back to previous OS image")
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) samesies as above (there are a couple more below as well)

t.Logf("rebooting %s", node.Name)

// Get an updated node object since this one may potentially be out-of-date
updatedNode, err := cs.Nodes().Get(context.TODO(), node.ObjectMeta.Name, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
updatedNode, err := cs.Nodes().Get(context.TODO(), node.ObjectMeta.Name, metav1.GetOptions{})
updatedNode, err := cs.Nodes().Get(context.TODO(), node.Name, metav1.GetOptions{})

You shouldn't need ObectMeta to get name for node of corev1.Node type for example in our repo: https://github.com/openshift/machine-config-operator/blob/9c6c2bfd7ed498bfbc296d530d1839bd6a177b0b/pkg/daemon/node.go#L24

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, you use node.Name right above here (line 309) too 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll go through and fix all of those. I'm not sure why I did that ¯_(ツ)_/¯


startTime := time.Now()
err = wait.Poll(2*time.Second, 10*time.Minute, func() (bool, error) {
n, err := cs.Nodes().Get(context.TODO(), node.ObjectMeta.Name, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above re:objectmeta

@@ -0,0 +1,436 @@
package e2e_test
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to mention that I may want to eventually port this library to the MCO's testing library. With this present, we could reduce our need to shell out to oc to execute arbitrary commands on a node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, along those lines, can you add an owners file to this tests/layering?

return false
}

combined := string(cr.Stdout) + string(cr.Stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a space or colon or something here ie ...+ " " + string(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do, but it couldn't hurt!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slightly better idea to do that, somewhat inspired by how os/exec does it.

Comment on lines +35 to +43
type Deployments struct {
Booted bool `json:"booted"`
ContainerImageReference string `json:"container-image-reference"`
}

type Status struct {
Deployments []Deployments `json:"deployments"`
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to make this a library somewhere 😳
jkyros/machine-config-operator#1 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a duplicated getRPMOStreeStatus


if k8sErrors.IsAlreadyExists(err) {
// We've already got an imagestream matching this name
imagestream, err := cs.ImageStreams(mcoNamespace).Get(context.TODO(), imageStreamName, metav1.GetOptions{})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much it matters but I've been asked before to do ctx := context.TODO() and then use ctx when I'm calling context.TODO() repeatedly. You have that here and a few other places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you suggested is good practice and something I should have been doing. Will fix.

},
}

_, err := cs.ImageStreams(mcoNamespace).Create(context.TODO(), imageStreamConfig, metav1.CreateOptions{})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be my imagination but it feels like a slightly more common pattern to call Get first and then Create second if it doesn't exist

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see that being the case. If that's what's more common, I'll adopt that pattern instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno if it actually is though, I bet @kikisdeliveryservice would know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused bc I can't find this line anymore - but it depends on what you're doing (which I can't tell bc I can't find the line)

}
}

baseImageBuildArg := fmt.Sprintf(imagePullSpec, getBaseImageTag())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you're grabbing this from an env var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to run in a CI context where this information is readily available via env vars.

Comment on lines 122 to 125
build, err := cs.BuildV1Interface.Builds(mcoNamespace).Get(context.TODO(), buildConfig.Name, metav1.GetOptions{})
if err != nil {
return fmt.Errorf("could not get build %s: %w", build.Name, err)
}
Copy link

@mkenigs mkenigs Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Create call already return the build? So you could drop the Get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. I don't think it's a complete build object. In other words, if you didn't supply some metadata, what it returns may be missing that metadata. However, I think that may be fine for what I'm doing with it.

require.Nil(t, writeFileToNode(t, cs, node, authfileData, authfilePath))

return func() {
// Clean up the pull secret from the node if the node has not rebooted and we do not want to delete it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Clean up the pull secret from the node if the node has not rebooted and we do not want to delete it.
// Allow cleaning up the pull secret from the node if the node will not be rebooted or deleted

defer func() {
// The pull secret should be cleared from the node upon reboot since
// it is placed into the /run directory. However, if the node does
// not reboot and we're not deleting it, we should clean it up.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused about whether it was the pull secret or the node, maybe

Suggested change
// not reboot and we're not deleting it, we should clean it up.
// not reboot or get deleted, we should clean it up.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/add-openshift-derivation-test branch from 3a8c8ca to 7e24139 Compare March 30, 2022 21:39
- Be more consistent with logf and object fields
- Handle exit code 147 on reboots
- Clarify comments
- Build object cleanups
- Consistently combine stdout and stderr
- Nodecmd cleanups
- Refactored build derivation code
- Replace context.TODO
- Allow derivation repo control via env vars
- Show additional info about builds
@cheesesashimi cheesesashimi force-pushed the zzlotnik/add-openshift-derivation-test branch from 7e24139 to 1fd41a9 Compare March 30, 2022 21:41
@cheesesashimi
Copy link
Member Author

After a bunch of refactoring, I think this is ready for a final review.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I started a review but forgot to click submit.
Nothing blocking though!

imageURL = ostreeUnverifiedRegistry + ":" + imageRegistry + "/" + mcoNamespace + "/" + imageStreamName
)

type Deployments struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally fine as is but we should do coreos/rpm-ostree#2389 at some point and share this between this test and the MCO.

}()

// We don't have any retries, so just run the command.
if runOpts.Retries <= 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a chat with jerry re hypershift one thought I had is we could use a https://kubernetes.io/docs/concepts/workloads/controllers/job/ instead of a raw pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that too. I do want to share what's in this file with the MCO test suite because doing so would allow us to completely eliminate shelling out to oc.

@cgwalters
Copy link
Member

Is it correct the state of things that this just adds the test, but nothing executes it today?

Are you thinking we'd run the test in the periodic?

Nothing stops us from having an opt-in
/test derivation
here too?

OK but actually, don't we need to wire this test up to "launch a cluster" bits?

@cgwalters
Copy link
Member

From my PoV though I'm good to /lgtm and iterate further after merge. Will leave open for a bit in case anyone else has comments.

@cheesesashimi
Copy link
Member Author

Is it correct the state of things that this just adds the test, but nothing executes it today?

Yup. This just adds the test for now.

Are you thinking we'd run the test in the periodic?

That was my plan. I think it would also be good to run on PRs as you mention below this.

OK but actually, don't we need to wire this test up to "launch a cluster" bits?

Correct. That's being worked on in openshift/release#26912. The "magic" bit over there is the workflow: ipi-gcp line which causes the OpenShift CI system to provision a cluster.

@kikisdeliveryservice
Copy link
Contributor

I can give this a final pass this afternoon

@kikisdeliveryservice
Copy link
Contributor

This looks great, Zack.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 31, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, cheesesashimi, jlebon, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 1, 2022

@cheesesashimi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants