Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718

Merged

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented May 30, 2023

- What I did
Added a retry so the test will attempt a few times to match certs before declaring a fail

- Description for the changelog
daemon: added a retry to TestMCDRotatesCertsOnPausedPool

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 30, 2023
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-14272, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Added a sleep so the test waits for all cert changes to propagate.

- Description for the changelog
daemon: added a sleep to TestMCDRotatesCertsOnPausedPool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023
@djoshy
Copy link
Contributor Author

djoshy commented May 30, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2023
@@ -953,6 +953,9 @@ func TestMCDRotatesCertsOnPausedPool(t *testing.T) {
require.Nil(t, err)
t.Logf("Certificate stuck behind paused (as expected)")

// wait a few seconds for certs to rotate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to be more precise, the issue here isn't that the certs take 5 seconds to rotate. The issue here is that in the MCO, we have a 5 second "cooldown" period between generating new rendered configs and propagating them to pools, hardcoded into the controller.

For testing this particular rotation, the issue is that there are actually 2 certs that get rotated, and depending on timing, those could show up as separate rendered configs, which now I think about it, shouldn't affect this test, but maybe that same "cooldown" period is being applied to the controllerconfig?

The 5 second "hack" I proposed is to (mostly) eliminate the possibility of this so the config is more settled. Fundamentally though, I was hoping we could fix it via either:

  1. having a "smarter" way to detect whether the rotation has finished (and/or fix WaitForMCDToSyncCert)
  2. having a retry in the next steps, allowing transient failures to settle
  3. figure out where in the sync the delay is happening so we understand if this is an issue or not

I'm also fine with this fix as a temporary measure if we want to revisit this when we finish MCO-499, but I think it would be good to have extra comments on why this is being done so we don't forget the context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I was not aware of the 5 second cooldown - could you point me to where in the code this lives? I'd like to poke a bit more! I had assumed it was mostly arbitrary and we were just giving it more than "enough" time to complete the propagation. Definitely would like to do a permanent fix than something temporary! (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! So we have these updateDelay's scattered throughout the code, e.g. controller

and
updateDelay = 5 * time.Second

Although, now I look at the code again, I think that might have been a bit of a red herring in our case, since the controllerconfig sync should not have been gated by 5 seconds, nor the controllerconfig sync in the MCD. If I had to guess, maybe a 1-2 second sleep would have been enough as well?

The 5 second delay in the machineconfigs help us see the diff more clearly but I wonder where in the cert rotation path we are delayed

Copy link
Contributor Author

@djoshy djoshy Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Jerry a bit more, we found the reason for our racy check to be the following:

  • prior to checking if the certs ondisk are a match to the configmap, we wait on the resourceVersion to change to the latest
  • the cert rotation could manifest as 6 resourceVersion changes(typically 3). We think this is dependent on how many config changes we generate & cluster state; and is not something we can easily control
  • if we end up matching the resourceVersion before it stabilizes to its final we'll end up with a failure during the comparison. This is the racy bit; its entirely possible we are not on the final resourceVersion when the condition passes here: https://github.com/openshift/machine-config-operator/blob/c50386e0700f4a7e3f322e3fb37000e9c3e5feb1/test/helpers/utils.go#LL358C4-L358C4
  • to work around this, we retry the cert match upto 3 times. This should be sufficient time for the certs to be settled

Here are some logs showing how we can stumble by checking against the wrong resourceVersion. In this case, we matched against 80392, instead of 80636

Daemon:

I0605 14:39:07.860107    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80371
I0605 14:39:08.022898    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80378
I0605 14:39:08.887493    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80392
I0605 14:39:15.454745    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80582
I0605 14:39:15.582593    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80589
I0605 14:39:16.457232    2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80636
I0605 14:39:20.319356    2851 rpm-ostree.go:382] Running captured: rpm-ostree kargs
I0605 14:39:20.579933    2851 daemon.go:655] Preflight config drift check successful (took 492.952407ms)
I0605 14:39:20.584698    2851 config_drift_monitor.go:255] Config Drift Monitor has shut down

Test:

 === RUN TestMCDRotatesCertsOnPausedPool 
mcd_test.go:937: Pausing pool 
mcd_test.go:942: Paused 
mcd_test.go:945: Patching certificate 
mcd_test.go:948: Patched 
mcd_test.go:951: Waiting for rendered config to get stuck behind pause 
mcd_test.go:954: Certificate stuck behind paused (as expected) 
mcd_test.go:968: resource version synced: 80392 
mcd_test.go:973: 
Error Trace: /home/djoshy/projects/machine-config-operator/test/e2e/mcd_test.go:973 
Error: Not equal:  

@djoshy djoshy marked this pull request as draft May 30, 2023 16:52
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2023
@sergiordlr
Copy link

Run the e2e test and passed

=== RUN TestMCDRotatesCertsOnPausedPool
mcd_test.go:937: Pausing pool
mcd_test.go:942: Paused
mcd_test.go:945: Patching certificate
mcd_test.go:948: Patched
mcd_test.go:951: Waiting for rendered config to get stuck behind pause
mcd_test.go:954: Certificate stuck behind paused (as expected)
mcd_test.go:974: The cert was properly rotated behind pause
mcd_test.go:983: Unpausing pool
mcd_test.go:986: Waiting for config to sync after unpause...
--- PASS: TestMCDRotatesCertsOnPausedPool (89.83s)

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 31, 2023
@djoshy djoshy force-pushed the cert-rotate-test-fix branch from 2a8b2a7 to faebd2e Compare June 5, 2023 15:56
@djoshy
Copy link
Contributor Author

djoshy commented Jun 5, 2023

/retest-required

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023
@djoshy djoshy marked this pull request as ready for review June 5, 2023 20:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2023
@openshift-ci openshift-ci bot requested review from jkyros and sinnykumari June 5, 2023 20:20
Copy link
Contributor

@jkyros jkyros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, that makes sense, thanks David for fixing this!

@djoshy djoshy force-pushed the cert-rotate-test-fix branch from faebd2e to 26e266d Compare June 6, 2023 13:04
@djoshy
Copy link
Contributor Author

djoshy commented Jun 6, 2023

/retest-required

@yuqi-zhang
Copy link
Contributor

/lgtm

Thanks! Just as a FYI, we've made hypershift blocking since it is blocking release payloads now. There is a known issue with the test that's being fixed in openshift/hypershift#2636

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, jkyros, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jkyros,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD c50386e and 2 for PR HEAD 26e266d in total

@sergiordlr
Copy link

I run again the test case after the change.

=== RUN   TestMCDRotatesCertsOnPausedPool
    mcd_test.go:937: Pausing pool
    mcd_test.go:942: Paused
    mcd_test.go:945: Patching certificate
    mcd_test.go:948: Patched
    mcd_test.go:951: Waiting for rendered config to get stuck behind pause
    mcd_test.go:954: Certificate stuck behind paused (as expected)
    mcd_test.go:978: The cert was properly rotated behind pause
    mcd_test.go:987: Unpausing pool
    mcd_test.go:990: Waiting for config to sync after unpause...
--- PASS: TestMCDRotatesCertsOnPausedPool (91.53s)

No problems found. The qe label is already added.

@djoshy
Copy link
Contributor Author

djoshy commented Jun 6, 2023

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD dadba5b and 1 for PR HEAD 26e266d in total

@djoshy
Copy link
Contributor Author

djoshy commented Jun 7, 2023

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD dadba5b and 2 for PR HEAD 26e266d in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2023

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-alibabacloud-ovn 26e266d link false /test e2e-alibabacloud-ovn
ci/prow/okd-scos-e2e-aws-ovn 26e266d link false /test okd-scos-e2e-aws-ovn
ci/prow/okd-scos-e2e-gcp-ovn-upgrade 26e266d link false /test okd-scos-e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor

This has no effect on Hypershift and helps other e2e test runs. Going to override

/override ci/prow/e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2023

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift

In response to this:

This has no effect on Hypershift and helps other e2e test runs. Going to override

/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 5c99674 into openshift:master Jun 8, 2023
@openshift-ci-robot
Copy link
Contributor

@djoshy: Jira Issue OCPBUGS-14272: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-14272 has been moved to the MODIFIED state.

In response to this:

- What I did
Added a retry so the test will attempt a few times to match certs before declaring a fail

- Description for the changelog
daemon: added a retry to TestMCDRotatesCertsOnPausedPool

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@djoshy djoshy deleted the cert-rotate-test-fix branch June 8, 2023 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants