You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ugh and wait a second, this failed GCP run seems to have rebooted the masters twice - from this MCD log: I0806 20:59:03.361444 2297 update.go:1455] Starting update from rendered-worker-a3629c84fa68ef33ff7fa7b5c501041f to rendered-worker-b88d93e7e9a96f5d961386e6e811875a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
/me goes to diff the MCs
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
The problem there appears to be quite simple; in this test scenario we're changing both the MCO and machine-os-content.
New CVO started and installed updated configmap/machine-config-osimageurl
The old MCO rendered a new MC with that update, started a rollout to masters/workers
Then the new MCO took over, rolled out another config update with template changes encapsulated inside it
So we then upgraded again
I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.
Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat. I think it's a real race, but perhaps rare. OTOH when it does occur it's rather bad for upgrade disruption.
If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).
The text was updated successfully, but these errors were encountered:
If this is a real race...avoiding it seems tricky. We basically want the MCO to defer a detected osimageurl change from the CVO if and only if there is a MCO upgrade queued in that same upgrade run
Wait nevermind, I think this was fixed by c7d3d9a - and I happened to trip over it in that job because I made the mistake of basing my test release images (overriding the MCO and machine-os-content) on the same underlying release image.
Splitting this out from #1946 (comment)
The problem there appears to be quite simple; in this test scenario we're changing both the MCO and
machine-os-content
.configmap/machine-config-osimageurl
I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.
Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat. I think it's a real race, but perhaps rare. OTOH when it does occur it's rather bad for upgrade disruption.
If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).
The text was updated successfully, but these errors were encountered: