Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Closed
cgwalters opened this issue Aug 7, 2020 · 2 comments
Closed

Comments

@cgwalters
Copy link
Member

Splitting this out from #1946 (comment)

Ugh and wait a second, this failed GCP run seems to have rebooted the masters twice - from this MCD log:
I0806 20:59:03.361444 2297 update.go:1455] Starting update from rendered-worker-a3629c84fa68ef33ff7fa7b5c501041f to rendered-worker-b88d93e7e9a96f5d961386e6e811875a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
/me goes to diff the MCs

Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?

The problem there appears to be quite simple; in this test scenario we're changing both the MCO and machine-os-content.

  • New CVO started and installed updated configmap/machine-config-osimageurl
  • The old MCO rendered a new MC with that update, started a rollout to masters/workers
  • Then the new MCO took over, rolled out another config update with template changes encapsulated inside it
  • So we then upgraded again

I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.

Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat. I think it's a real race, but perhaps rare. OTOH when it does occur it's rather bad for upgrade disruption.

If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).

@cgwalters
Copy link
Member Author

If this is a real race...avoiding it seems tricky. We basically want the MCO to defer a detected osimageurl change from the CVO if and only if there is a MCO upgrade queued in that same upgrade run

@cgwalters
Copy link
Member Author

Wait nevermind, I think this was fixed by c7d3d9a - and I happened to trip over it in that job because I made the mistake of basing my test release images (overriding the MCO and machine-os-content) on the same underlying release image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant