Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

Closed
ldevulder opened this issue Sep 6, 2023 · 7 comments · Fixed by #553
Closed

e2e: cannot delete remaining Elemental cluster after uninstallation of operator #515

ldevulder opened this issue Sep 6, 2023 · 7 comments · Fixed by #553
Assignees
Labels
Milestone

Comments

@ldevulder
Copy link
Contributor

It happens on Elemental CI, for example: https://github.com/rancher/elemental/actions/runs/6068163183/job/16460773463.

How to reproduce:

  • Install Rancher Manager HEAD version (ontop of K3s or RKE2, it doesn't matter)
  • Install elemental-operator Dev version
  • Deploy an Elemental 3 nodes cluster with Dev ISO
  • Uninstall Elemental operator with helm, first the operator then the CRDs chart (should be the same with the UI, not tested):
$ helm uninstall -n cattle-elemental-system elemental-operator
release "elemental-operator" uninstalled
$ helm uninstall -n cattle-elemental-system elemental-operator-crds
release "elemental-operator-crds" uninstalled
  • Elemental cluster is still seen on Rancher Manager (as expected), try to delete it but the command stuck forever and the cluster is not deleted:
$ kubectl -n fleet-default delete cluster cluster-k3s
cluster.provisioning.cattle.io "cluster-k3s" deleted
[blocked...]
  • Status of the cluster in UI:
    Screenshot from 2023-09-06 17-17-02

I saw that MachineInventories are still present but in Removing state forever:
Screenshot from 2023-09-06 16-18-57

Please note that it ONLY HAPPEN ON RANCHER MANAGER HEAD VERSION (2.7.7)!. I don't have this issue in Rancher Manager Stable (2.7.6). I know that 2.7.7-dev includes some new stuff for CAPI (but I don't know what exactly).

@ldevulder
Copy link
Contributor Author

Tested this morning: as a workaround the operator can be reinstalled (crds+operator) and the deletion is finished. Operator can be uninstalled then.

Even if it's better to remove all Elemental resources before uninstalling the operator I think it's good to be able to remove remaining resources after the uninstallation, And it worked before.

@ldevulder ldevulder added kind/bug Something isn't working kind/QA labels Sep 11, 2023
@ldevulder ldevulder moved this to 🗳️ To Do in Elemental Sep 11, 2023
@ldevulder ldevulder changed the title Cannot delete remaining Elemental cluster after uninstallation of operator e2e: cannot delete remaining Elemental cluster after uninstallation of operator Sep 11, 2023
@kkaempf
Copy link
Contributor

kkaempf commented Sep 20, 2023

Does it still happen on Rancher HEAD (aka 2.8.0) ?

Then we might need to open an issue on rancher/rancher 🤔

@anmazzotti
Copy link
Contributor

The educated guess is that the MachineInventories still carry the machineinventory.elemental.cattle.io finalizer, but since the elemental-operator has been uninstalled already, nothing is going to remove these finalizers.

A manual workaround would be to either reinstall the elemental-operator and let it delete the resources, or manually delete the finalizer from all MachineInventories, for example with kubectl.

We have at least 2 ways to fix this:

  1. Instruct Helm to delete all finalizers on uninstall
  2. Implement some OnShutdown function on the elemental-operator to delete all finalizers.

Option n.2 would be better since it does not rely on Helm, however consider this operation may take some long time.

@ldevulder
Copy link
Contributor Author

After more tests I can confirm that on Rancher Manager HEAD version the issue mainly happens because the Machine objects are still present too when the operator is uninstalled (which sounds logical) but when the cluster is delete it is removed in Stable version but not on HEAD. Not sure if it's related to Elemental or Rancher Manager directly. But anyway the MachineInventory objects are still here in both cases and this is not good.

@davidcassany davidcassany self-assigned this Oct 6, 2023
@davidcassany davidcassany moved this from 🗳️ To Do to 🏃🏼‍♂️ In Progress in Elemental Oct 6, 2023
@davidcassany
Copy link
Contributor

davidcassany commented Oct 6, 2023

I have been thinking about it and I struggle to find a good solution.

Generally speaking I consider not a good practice to delete CRs on a helm uninstall elemental-operator call, that's also one of the motivations of having separate charts, so I can uninstall, reinstall and some resources might be kept and still present (I do that regularly for testing rebuilds).

I would expect resources to fully disappear with the second call helm uninstall elemental-operator-crds. But then the finalizer problem kicks in. That collides the notion of OnShutdown in elemental-operator, this is already gone at this stage.

The other problem of the OnShutdown strategy is that it would still require some sort of external signal for uninstall shutdown (having the option to flag cleanup or not) so we can be sure it is only executed for uninstalls and not on pod restarts (some spurious unwanted deletion would be dramatic).

So my suggestion would be to actually have a cleanup command and apply it as a pre-uninstall step in crds chart. I think it is absolutely safe to state that if one uninstalls the crds chart the expectation is that any elemental resources including resource definitions are deleted.

@davidcassany davidcassany moved this from 🏃🏼‍♂️ In Progress to 👀 Needs review in Elemental Oct 11, 2023
@davidcassany davidcassany moved this from 👀 Needs review to 🏃🏼‍♂️ In Progress in Elemental Oct 24, 2023
@github-project-automation github-project-automation bot moved this from 🏃🏼‍♂️ In Progress to ✅ Done in Elemental Oct 24, 2023
@davidcassany
Copy link
Contributor

@ldevulder note witht he change from #553 the work around you implemented is no longer a workaround and it should be the way to go.

Now if trying to reinstall with pending deletions due to machineinventory leftovers are there it will just fail. I wonder if it would make sense testing the sequence:

  1. install
  2. create resetable machine inventories
  3. uninstall
  4. reinstall with failure
  5. remove finalizers
  6. reinstall

I think is almost the current case, just that we are not validating the reinstall failure and the finalizers removal is done as a parallel thread of the tests, while probably it should be part of the test sequence. What you think? does it make sense?

@ldevulder
Copy link
Contributor Author

ldevulder commented Oct 31, 2023

@davidcassany yes it could be implemented to validate that the re-installation is failing "as expected". I opened issue rancher/elemental#1075 to track this in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment