OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

djoshy · 2023-05-30T14:16:45Z

- What I did
Added a retry so the test will attempt a few times to match certs before declaring a fail

- Description for the changelog
daemon: added a retry to TestMCDRotatesCertsOnPausedPool

openshift-ci-robot · 2023-05-30T14:16:52Z

@djoshy: This pull request references Jira Issue OCPBUGS-14272, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Added a sleep so the test waits for all cert changes to propagate.

- Description for the changelog
daemon: added a sleep to TestMCDRotatesCertsOnPausedPool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

djoshy · 2023-05-30T15:18:50Z

/hold cancel

yuqi-zhang · 2023-05-30T16:41:50Z

test/e2e/mcd_test.go

@@ -953,6 +953,9 @@ func TestMCDRotatesCertsOnPausedPool(t *testing.T) {
 	require.Nil(t, err)
 	t.Logf("Certificate stuck behind paused (as expected)")

+	// wait a few seconds for certs to rotate


So, to be more precise, the issue here isn't that the certs take 5 seconds to rotate. The issue here is that in the MCO, we have a 5 second "cooldown" period between generating new rendered configs and propagating them to pools, hardcoded into the controller.

For testing this particular rotation, the issue is that there are actually 2 certs that get rotated, and depending on timing, those could show up as separate rendered configs, which now I think about it, shouldn't affect this test, but maybe that same "cooldown" period is being applied to the controllerconfig?

The 5 second "hack" I proposed is to (mostly) eliminate the possibility of this so the config is more settled. Fundamentally though, I was hoping we could fix it via either:

having a "smarter" way to detect whether the rotation has finished (and/or fix WaitForMCDToSyncCert)

having a retry in the next steps, allowing transient failures to settle

figure out where in the sync the delay is happening so we understand if this is an issue or not

I'm also fine with this fix as a temporary measure if we want to revisit this when we finish MCO-499, but I think it would be good to have extra comments on why this is being done so we don't forget the context

Interesting, I was not aware of the 5 second cooldown - could you point me to where in the code this lives? I'd like to poke a bit more! I had assumed it was mostly arbitrary and we were just giving it more than "enough" time to complete the propagation. Definitely would like to do a permanent fix than something temporary! (:

Sure! So we have these updateDelay's scattered throughout the code, e.g. controller

machine-config-operator/pkg/controller/node/node_controller.go

Line 58 in 5843b7a

updateDelay = 5 * time.Second

and

machine-config-operator/pkg/daemon/daemon.go

Line 172 in 5843b7a

updateDelay = 5 * time.Second

Although, now I look at the code again, I think that might have been a bit of a red herring in our case, since the controllerconfig sync should not have been gated by 5 seconds, nor the controllerconfig sync in the MCD. If I had to guess, maybe a 1-2 second sleep would have been enough as well?

The 5 second delay in the machineconfigs help us see the diff more clearly but I wonder where in the cert rotation path we are delayed

After discussing with Jerry a bit more, we found the reason for our racy check to be the following:

prior to checking if the certs ondisk are a match to the configmap, we wait on the resourceVersion to change to the latest

the cert rotation could manifest as 6 resourceVersion changes(typically 3). We think this is dependent on how many config changes we generate & cluster state; and is not something we can easily control

if we end up matching the resourceVersion before it stabilizes to its final we'll end up with a failure during the comparison. This is the racy bit; its entirely possible we are not on the final resourceVersion when the condition passes here: https://github.com/openshift/machine-config-operator/blob/c50386e0700f4a7e3f322e3fb37000e9c3e5feb1/test/helpers/utils.go#LL358C4-L358C4

to work around this, we retry the cert match upto 3 times. This should be sufficient time for the certs to be settled

Here are some logs showing how we can stumble by checking against the wrong resourceVersion. In this case, we matched against 80392, instead of 80636

Daemon:

I0605 14:39:07.860107 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80371 I0605 14:39:08.022898 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80378 I0605 14:39:08.887493 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80392 I0605 14:39:15.454745 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80582 I0605 14:39:15.582593 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80589 I0605 14:39:16.457232 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80636 I0605 14:39:20.319356 2851 rpm-ostree.go:382] Running captured: rpm-ostree kargs I0605 14:39:20.579933 2851 daemon.go:655] Preflight config drift check successful (took 492.952407ms) I0605 14:39:20.584698 2851 config_drift_monitor.go:255] Config Drift Monitor has shut down

Test:

=== RUN TestMCDRotatesCertsOnPausedPool mcd_test.go:937: Pausing pool mcd_test.go:942: Paused mcd_test.go:945: Patching certificate mcd_test.go:948: Patched mcd_test.go:951: Waiting for rendered config to get stuck behind pause mcd_test.go:954: Certificate stuck behind paused (as expected) mcd_test.go:968: resource version synced: 80392 mcd_test.go:973: Error Trace: /home/djoshy/projects/machine-config-operator/test/e2e/mcd_test.go:973 Error: Not equal:

sergiordlr · 2023-05-31T14:51:49Z

Run the e2e test and passed

=== RUN TestMCDRotatesCertsOnPausedPool
mcd_test.go:937: Pausing pool
mcd_test.go:942: Paused
mcd_test.go:945: Patching certificate
mcd_test.go:948: Patched
mcd_test.go:951: Waiting for rendered config to get stuck behind pause
mcd_test.go:954: Certificate stuck behind paused (as expected)
mcd_test.go:974: The cert was properly rotated behind pause
mcd_test.go:983: Unpausing pool
mcd_test.go:986: Waiting for config to sync after unpause...
--- PASS: TestMCDRotatesCertsOnPausedPool (89.83s)

/label qe-approved

djoshy · 2023-06-05T16:00:25Z

/retest-required

sinnykumari

LGTM

jkyros

Hey, that makes sense, thanks David for fixing this!

test/e2e/mcd_test.go

djoshy · 2023-06-06T13:05:43Z

/retest-required

yuqi-zhang · 2023-06-06T15:41:01Z

/lgtm

Thanks! Just as a FYI, we've made hypershift blocking since it is blocking release payloads now. There is a known issue with the test that's being fixed in openshift/hypershift#2636

openshift-ci · 2023-06-06T15:43:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, jkyros, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jkyros,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-06-06T15:50:40Z

/retest-required

Remaining retests: 0 against base HEAD c50386e and 2 for PR HEAD 26e266d in total

sergiordlr · 2023-06-06T17:27:17Z

I run again the test case after the change.

=== RUN   TestMCDRotatesCertsOnPausedPool
    mcd_test.go:937: Pausing pool
    mcd_test.go:942: Paused
    mcd_test.go:945: Patching certificate
    mcd_test.go:948: Patched
    mcd_test.go:951: Waiting for rendered config to get stuck behind pause
    mcd_test.go:954: Certificate stuck behind paused (as expected)
    mcd_test.go:978: The cert was properly rotated behind pause
    mcd_test.go:987: Unpausing pool
    mcd_test.go:990: Waiting for config to sync after unpause...
--- PASS: TestMCDRotatesCertsOnPausedPool (91.53s)

No problems found. The qe label is already added.

djoshy · 2023-06-06T18:15:35Z

/retest-required

openshift-ci-robot · 2023-06-06T20:54:14Z

/retest-required

Remaining retests: 0 against base HEAD dadba5b and 1 for PR HEAD 26e266d in total

djoshy · 2023-06-07T12:40:14Z

/retest-required

openshift-ci-robot · 2023-06-08T05:34:40Z

/retest-required

Remaining retests: 0 against base HEAD dadba5b and 2 for PR HEAD 26e266d in total

openshift-ci · 2023-06-08T07:11:38Z

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-alibabacloud-ovn	`26e266d`	link	false	`/test e2e-alibabacloud-ovn`
ci/prow/okd-scos-e2e-aws-ovn	`26e266d`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/okd-scos-e2e-gcp-ovn-upgrade	`26e266d`	link	false	`/test okd-scos-e2e-gcp-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2023-06-08T14:25:55Z

This has no effect on Hypershift and helps other e2e test runs. Going to override

/override ci/prow/e2e-hypershift

openshift-ci · 2023-06-08T14:26:25Z

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift

In response to this:

This has no effect on Hypershift and helps other e2e test runs. Going to override

/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-08T14:30:26Z

@djoshy: Jira Issue OCPBUGS-14272: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3718

Jira Issue OCPBUGS-14272 has been moved to the MODIFIED state.

In response to this:

- What I did
Added a retry so the test will attempt a few times to match certs before declaring a fail

- Description for the changelog
daemon: added a retry to TestMCDRotatesCertsOnPausedPool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 30, 2023

openshift-ci bot requested review from sergiordlr, cgwalters and sinnykumari May 30, 2023 14:17

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023

djoshy mentioned this pull request May 30, 2023

MCO-595: Remove MCO's pending config workflow #3700

Merged

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023

yuqi-zhang reviewed May 30, 2023

View reviewed changes

djoshy marked this pull request as draft May 30, 2023 16:52

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2023

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 31, 2023

djoshy force-pushed the cert-rotate-test-fix branch from 2a8b2a7 to faebd2e Compare June 5, 2023 15:56

sinnykumari approved these changes Jun 5, 2023

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023

djoshy marked this pull request as ready for review June 5, 2023 20:18

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2023

openshift-ci bot requested review from jkyros and sinnykumari June 5, 2023 20:20

jkyros approved these changes Jun 5, 2023

View reviewed changes

yuqi-zhang reviewed Jun 5, 2023

View reviewed changes

test/e2e/mcd_test.go Show resolved Hide resolved

added a retry to TestMCDRotatesCertsOnPausedPool

26e266d

djoshy force-pushed the cert-rotate-test-fix branch from faebd2e to 26e266d Compare June 6, 2023 13:04

openshift-ci bot assigned yuqi-zhang Jun 6, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023

sinnykumari mentioned this pull request Jun 8, 2023

OCPBUGS-14793: Allow userfaultfd syscall to be used by unprivileged users #3724

Merged

openshift-merge-robot merged commit 5c99674 into openshift:master Jun 8, 2023

djoshy deleted the cert-rotate-test-fix branch June 8, 2023 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

djoshy commented May 30, 2023 •

edited

Loading

openshift-ci-robot commented May 30, 2023

djoshy commented May 30, 2023

yuqi-zhang May 30, 2023

djoshy May 30, 2023

yuqi-zhang May 30, 2023

djoshy Jun 5, 2023 •

edited

Loading

sergiordlr commented May 31, 2023

djoshy commented Jun 5, 2023

sinnykumari left a comment

jkyros left a comment

djoshy commented Jun 6, 2023

yuqi-zhang commented Jun 6, 2023

openshift-ci bot commented Jun 6, 2023

openshift-ci-robot commented Jun 6, 2023

sergiordlr commented Jun 6, 2023

djoshy commented Jun 6, 2023

openshift-ci-robot commented Jun 6, 2023

djoshy commented Jun 7, 2023

openshift-ci-robot commented Jun 8, 2023

openshift-ci bot commented Jun 8, 2023 •

edited

Loading

yuqi-zhang commented Jun 8, 2023

openshift-ci bot commented Jun 8, 2023

openshift-ci-robot commented Jun 8, 2023

OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

Conversation

djoshy commented May 30, 2023 • edited Loading

openshift-ci-robot commented May 30, 2023

djoshy commented May 30, 2023

yuqi-zhang May 30, 2023

Choose a reason for hiding this comment

djoshy May 30, 2023

Choose a reason for hiding this comment

yuqi-zhang May 30, 2023

Choose a reason for hiding this comment

djoshy Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

sergiordlr commented May 31, 2023

djoshy commented Jun 5, 2023

sinnykumari left a comment

Choose a reason for hiding this comment

jkyros left a comment

Choose a reason for hiding this comment

djoshy commented Jun 6, 2023

yuqi-zhang commented Jun 6, 2023

openshift-ci bot commented Jun 6, 2023

openshift-ci-robot commented Jun 6, 2023

sergiordlr commented Jun 6, 2023

djoshy commented Jun 6, 2023

openshift-ci-robot commented Jun 6, 2023

djoshy commented Jun 7, 2023

openshift-ci-robot commented Jun 8, 2023

openshift-ci bot commented Jun 8, 2023 • edited Loading

yuqi-zhang commented Jun 8, 2023

openshift-ci bot commented Jun 8, 2023

openshift-ci-robot commented Jun 8, 2023

djoshy commented May 30, 2023 •

edited

Loading

djoshy Jun 5, 2023 •

edited

Loading

openshift-ci bot commented Jun 8, 2023 •

edited

Loading