-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718
OCPBUGS-14272: Race condition in TestMCDRotatesCertsOnPausedPool #3718
Conversation
@djoshy: This pull request references Jira Issue OCPBUGS-14272, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold cancel |
test/e2e/mcd_test.go
Outdated
@@ -953,6 +953,9 @@ func TestMCDRotatesCertsOnPausedPool(t *testing.T) { | |||
require.Nil(t, err) | |||
t.Logf("Certificate stuck behind paused (as expected)") | |||
|
|||
// wait a few seconds for certs to rotate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to be more precise, the issue here isn't that the certs take 5 seconds to rotate. The issue here is that in the MCO, we have a 5 second "cooldown" period between generating new rendered configs and propagating them to pools, hardcoded into the controller.
For testing this particular rotation, the issue is that there are actually 2 certs that get rotated, and depending on timing, those could show up as separate rendered configs, which now I think about it, shouldn't affect this test, but maybe that same "cooldown" period is being applied to the controllerconfig?
The 5 second "hack" I proposed is to (mostly) eliminate the possibility of this so the config is more settled. Fundamentally though, I was hoping we could fix it via either:
- having a "smarter" way to detect whether the rotation has finished (and/or fix WaitForMCDToSyncCert)
- having a retry in the next steps, allowing transient failures to settle
- figure out where in the sync the delay is happening so we understand if this is an issue or not
I'm also fine with this fix as a temporary measure if we want to revisit this when we finish MCO-499, but I think it would be good to have extra comments on why this is being done so we don't forget the context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I was not aware of the 5 second cooldown - could you point me to where in the code this lives? I'd like to poke a bit more! I had assumed it was mostly arbitrary and we were just giving it more than "enough" time to complete the propagation. Definitely would like to do a permanent fix than something temporary! (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! So we have these updateDelay's scattered throughout the code, e.g. controller
updateDelay = 5 * time.Second |
machine-config-operator/pkg/daemon/daemon.go
Line 172 in 5843b7a
updateDelay = 5 * time.Second |
Although, now I look at the code again, I think that might have been a bit of a red herring in our case, since the controllerconfig sync should not have been gated by 5 seconds, nor the controllerconfig sync in the MCD. If I had to guess, maybe a 1-2 second sleep would have been enough as well?
The 5 second delay in the machineconfigs help us see the diff more clearly but I wonder where in the cert rotation path we are delayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with Jerry a bit more, we found the reason for our racy check to be the following:
- prior to checking if the certs ondisk are a match to the configmap, we wait on the resourceVersion to change to the latest
- the cert rotation could manifest as 6 resourceVersion changes(typically 3). We think this is dependent on how many config changes we generate & cluster state; and is not something we can easily control
- if we end up matching the resourceVersion before it stabilizes to its final we'll end up with a failure during the comparison. This is the racy bit; its entirely possible we are not on the final resourceVersion when the condition passes here: https://github.com/openshift/machine-config-operator/blob/c50386e0700f4a7e3f322e3fb37000e9c3e5feb1/test/helpers/utils.go#LL358C4-L358C4
- to work around this, we retry the cert match upto 3 times. This should be sufficient time for the certs to be settled
Here are some logs showing how we can stumble by checking against the wrong resourceVersion. In this case, we matched against 80392, instead of 80636
Daemon:
I0605 14:39:07.860107 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80371
I0605 14:39:08.022898 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80378
I0605 14:39:08.887493 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80392
I0605 14:39:15.454745 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80582
I0605 14:39:15.582593 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80589
I0605 14:39:16.457232 2851 certificate_writer.go:117] Certificate was synced from controllerconfig resourceVersion 80636
I0605 14:39:20.319356 2851 rpm-ostree.go:382] Running captured: rpm-ostree kargs
I0605 14:39:20.579933 2851 daemon.go:655] Preflight config drift check successful (took 492.952407ms)
I0605 14:39:20.584698 2851 config_drift_monitor.go:255] Config Drift Monitor has shut down
Test:
=== RUN TestMCDRotatesCertsOnPausedPool
mcd_test.go:937: Pausing pool
mcd_test.go:942: Paused
mcd_test.go:945: Patching certificate
mcd_test.go:948: Patched
mcd_test.go:951: Waiting for rendered config to get stuck behind pause
mcd_test.go:954: Certificate stuck behind paused (as expected)
mcd_test.go:968: resource version synced: 80392
mcd_test.go:973:
Error Trace: /home/djoshy/projects/machine-config-operator/test/e2e/mcd_test.go:973
Error: Not equal:
Run the e2e test and passed === RUN TestMCDRotatesCertsOnPausedPool /label qe-approved |
2a8b2a7
to
faebd2e
Compare
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, that makes sense, thanks David for fixing this!
faebd2e
to
26e266d
Compare
/retest-required |
/lgtm Thanks! Just as a FYI, we've made hypershift blocking since it is blocking release payloads now. There is a known issue with the test that's being fixed in openshift/hypershift#2636 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, jkyros, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I run again the test case after the change.
No problems found. The qe label is already added. |
/retest-required |
/retest-required |
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This has no effect on Hypershift and helps other e2e test runs. Going to override /override ci/prow/e2e-hypershift |
@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@djoshy: Jira Issue OCPBUGS-14272: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-14272 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
- What I did
Added a retry so the test will attempt a few times to match certs before declaring a fail
- Description for the changelog
daemon: added a retry to TestMCDRotatesCertsOnPausedPool