e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

ldevulder · 2023-09-06T15:21:17Z

It happens on Elemental CI, for example: https://github.com/rancher/elemental/actions/runs/6068163183/job/16460773463.

How to reproduce:

Install Rancher Manager HEAD version (ontop of K3s or RKE2, it doesn't matter)
Install elemental-operator Dev version
Deploy an Elemental 3 nodes cluster with Dev ISO
Uninstall Elemental operator with helm, first the operator then the CRDs chart (should be the same with the UI, not tested):

$ helm uninstall -n cattle-elemental-system elemental-operator
release "elemental-operator" uninstalled
$ helm uninstall -n cattle-elemental-system elemental-operator-crds
release "elemental-operator-crds" uninstalled

Elemental cluster is still seen on Rancher Manager (as expected), try to delete it but the command stuck forever and the cluster is not deleted:

$ kubectl -n fleet-default delete cluster cluster-k3s
cluster.provisioning.cattle.io "cluster-k3s" deleted
[blocked...]

Status of the cluster in UI:

I saw that MachineInventories are still present but in Removing state forever:

Please note that it ONLY HAPPEN ON RANCHER MANAGER HEAD VERSION (2.7.7)!. I don't have this issue in Rancher Manager Stable (2.7.6). I know that 2.7.7-dev includes some new stuff for CAPI (but I don't know what exactly).

The text was updated successfully, but these errors were encountered:

ldevulder · 2023-09-07T07:18:18Z

Tested this morning: as a workaround the operator can be reinstalled (crds+operator) and the deletion is finished. Operator can be uninstalled then.

Even if it's better to remove all Elemental resources before uninstalling the operator I think it's good to be able to remove remaining resources after the uninstallation, And it worked before.

kkaempf · 2023-09-20T08:59:12Z

Does it still happen on Rancher HEAD (aka 2.8.0) ?

Then we might need to open an issue on rancher/rancher 🤔

anmazzotti · 2023-09-26T09:00:35Z

The educated guess is that the MachineInventories still carry the machineinventory.elemental.cattle.io finalizer, but since the elemental-operator has been uninstalled already, nothing is going to remove these finalizers.

A manual workaround would be to either reinstall the elemental-operator and let it delete the resources, or manually delete the finalizer from all MachineInventories, for example with kubectl.

We have at least 2 ways to fix this:

Instruct Helm to delete all finalizers on uninstall
Implement some OnShutdown function on the elemental-operator to delete all finalizers.

Option n.2 would be better since it does not rely on Helm, however consider this operation may take some long time.

ldevulder · 2023-10-02T08:19:53Z

After more tests I can confirm that on Rancher Manager HEAD version the issue mainly happens because the Machine objects are still present too when the operator is uninstalled (which sounds logical) but when the cluster is delete it is removed in Stable version but not on HEAD. Not sure if it's related to Elemental or Rancher Manager directly. But anyway the MachineInventory objects are still here in both cases and this is not good.

davidcassany · 2023-10-06T14:32:47Z

I have been thinking about it and I struggle to find a good solution.

Generally speaking I consider not a good practice to delete CRs on a helm uninstall elemental-operator call, that's also one of the motivations of having separate charts, so I can uninstall, reinstall and some resources might be kept and still present (I do that regularly for testing rebuilds).

I would expect resources to fully disappear with the second call helm uninstall elemental-operator-crds. But then the finalizer problem kicks in. That collides the notion of OnShutdown in elemental-operator, this is already gone at this stage.

The other problem of the OnShutdown strategy is that it would still require some sort of external signal for uninstall shutdown (having the option to flag cleanup or not) so we can be sure it is only executed for uninstalls and not on pod restarts (some spurious unwanted deletion would be dramatic).

So my suggestion would be to actually have a cleanup command and apply it as a pre-uninstall step in crds chart. I think it is absolutely safe to state that if one uninstalls the crds chart the expectation is that any elemental resources including resource definitions are deleted.

davidcassany · 2023-10-25T13:36:51Z

@ldevulder note witht he change from #553 the work around you implemented is no longer a workaround and it should be the way to go.

Now if trying to reinstall with pending deletions due to machineinventory leftovers are there it will just fail. I wonder if it would make sense testing the sequence:

install
create resetable machine inventories
uninstall
reinstall with failure
remove finalizers
reinstall

I think is almost the current case, just that we are not validating the reinstall failure and the finalizers removal is done as a parallel thread of the tests, while probably it should be part of the test sequence. What you think? does it make sense?

ldevulder · 2023-10-31T16:40:33Z

@davidcassany yes it could be implemented to validate that the re-installation is failing "as expected". I opened issue rancher/elemental#1075 to track this in CI.

ldevulder mentioned this issue Sep 11, 2023

ci: fix sequential/upgrade tests rancher/elemental#981

Closed

ldevulder added kind/bug Something isn't working kind/QA labels Sep 11, 2023

ldevulder added this to Elemental Sep 11, 2023

ldevulder moved this to 🗳️ To Do in Elemental Sep 11, 2023

ldevulder changed the title ~~Cannot delete remaining Elemental cluster after uninstallation of operator~~ e2e: cannot delete remaining Elemental cluster after uninstallation of operator Sep 11, 2023

kkaempf added kind/chore kind/regression and removed kind/chore labels Sep 12, 2023

kkaempf added this to the 2023-Q4-2.x.x milestone Sep 26, 2023

ldevulder mentioned this issue Sep 26, 2023

ci: add workaround for issue #515 rancher/elemental#1026

Merged

davidcassany self-assigned this Oct 6, 2023

davidcassany moved this from 🗳️ To Do to 🏃🏼‍♂️ In Progress in Elemental Oct 6, 2023

davidcassany mentioned this issue Oct 9, 2023

Adding the cleanup command to manage finalizers #536

Closed

davidcassany moved this from 🏃🏼‍♂️ In Progress to 👀 Needs review in Elemental Oct 11, 2023

davidcassany moved this from 👀 Needs review to 🏃🏼‍♂️ In Progress in Elemental Oct 24, 2023

davidcassany mentioned this issue Oct 24, 2023

Prevent installing if previous CRDs are pending to be removed #553

Merged

davidcassany closed this as completed in #553 Oct 24, 2023

github-project-automation bot moved this from 🏃🏼‍♂️ In Progress to ✅ Done in Elemental Oct 24, 2023

ldevulder mentioned this issue Oct 31, 2023

e2e: add a test to validate that the operator cannot be (re)install if resources are still present rancher/elemental#1075

Closed

davidcassany mentioned this issue Jun 26, 2024

Block reinstall if crds are still pending to be deleted #784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

ldevulder commented Sep 6, 2023

ldevulder commented Sep 7, 2023

kkaempf commented Sep 20, 2023

anmazzotti commented Sep 26, 2023

ldevulder commented Oct 2, 2023

davidcassany commented Oct 6, 2023 •

edited

Loading

davidcassany commented Oct 25, 2023

ldevulder commented Oct 31, 2023 •

edited

Loading

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

Comments

ldevulder commented Sep 6, 2023

ldevulder commented Sep 7, 2023

kkaempf commented Sep 20, 2023

anmazzotti commented Sep 26, 2023

ldevulder commented Oct 2, 2023

davidcassany commented Oct 6, 2023 • edited Loading

davidcassany commented Oct 25, 2023

ldevulder commented Oct 31, 2023 • edited Loading

davidcassany commented Oct 6, 2023 •

edited

Loading

ldevulder commented Oct 31, 2023 •

edited

Loading