Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

cgwalters · 2020-08-07T18:30:37Z

Splitting this out from #1946 (comment)

Ugh and wait a second, this failed GCP run seems to have rebooted the masters twice - from this MCD log:
I0806 20:59:03.361444 2297 update.go:1455] Starting update from rendered-worker-a3629c84fa68ef33ff7fa7b5c501041f to rendered-worker-b88d93e7e9a96f5d961386e6e811875a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
/me goes to diff the MCs

Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?

The problem there appears to be quite simple; in this test scenario we're changing both the MCO and machine-os-content.

New CVO started and installed updated configmap/machine-config-osimageurl
The old MCO rendered a new MC with that update, started a rollout to masters/workers
Then the new MCO took over, rolled out another config update with template changes encapsulated inside it
So we then upgraded again

I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.

Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat. I think it's a real race, but perhaps rare. OTOH when it does occur it's rather bad for upgrade disruption.

If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-08-07T18:31:28Z

If this is a real race...avoiding it seems tricky. We basically want the MCO to defer a detected osimageurl change from the CVO if and only if there is a MCO upgrade queued in that same upgrade run

cgwalters · 2020-08-07T18:41:09Z

Wait nevermind, I think this was fixed by c7d3d9a - and I happened to trip over it in that job because I made the mistake of basing my test release images (overriding the MCO and machine-os-content) on the same underlying release image.

cgwalters closed this as completed Aug 7, 2020

cgwalters mentioned this issue Aug 7, 2020

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane #1946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

cgwalters commented Aug 7, 2020

cgwalters commented Aug 7, 2020

cgwalters commented Aug 7, 2020

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Comments

cgwalters commented Aug 7, 2020

cgwalters commented Aug 7, 2020

cgwalters commented Aug 7, 2020