Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elemental process is OOM killed during OS upgrade #865

Closed
ldevulder opened this issue Jun 8, 2023 · 6 comments
Closed

Elemental process is OOM killed during OS upgrade #865

ldevulder opened this issue Jun 8, 2023 · 6 comments
Labels
kind/bug Something isn't working

Comments

@ldevulder
Copy link
Contributor

ldevulder commented Jun 8, 2023

What steps did you take and what happened:

  • Install Rancher Manager Stable (v2.7.4)
  • Install Dev version of Elemental operator
  • Provision a simple K3s cluster with Elemental Stable ISO, one node is enough, just set RAM size to 3GB (not more!)
  • Wait for the cluster to be in Active state
  • Trigger an OS upgrade with Rancher Manager
  • OOM killer will happen on the node (on elemental process)

What did you expect to happen:
Upgrade to be done without any issue. As nothing is deployed on the cluster it's very weird that 3GB is not enough to perform an OS upgrade.

Anything else you would like to add:
I tried directly with elemental upgrade command but I wasn't able to reproduce the issue. It happens only when the OS upgrade is triggered with Rancher Manager.

With K3s v1.25.7+k3s1 behavior is a little bit different: I don't always get an OOM kill but some processes are core dumped.
EDIT: after re-configuring my lab from scratch I can confirm that I'm able to get an OOM kill even with v1.25.7+k3s1.

In Elemental docs there is nothing about minimal RAM value, so if I check for K3s I found 512MB as the minimal value and 1GB recommended. For SLE Micro I found 1GB as the minimal value. So it seems that 1GB should be enough but it's clearly not.
The CI used 3GB without any issue until recently. During my manual tests I was able to run an OS upgrade on a node with 4GB without any issue, but if I create a cluster of 3 nodes with 4GB RAM for each I can still see sporadic OOM kills (same with 6GB of RAM, but it appears less in that case).

So, even if the minimal values are, in my opinion, too small, it seems that we have some weird issues with the memory.

Environment:

  • Elemental release version (use cat /etc/os-release): Dev for operator and Stable for ISO
  • Rancher version: v2.7.4
  • Kubernetes version (use kubectl version): v1.24.10+k3s1 and v1.25.7+k3s1
  • Cloud provider or hardware configuration: N/A

Please find attached some logs I was able to catch (not easy after an OOM kill).
elemental_oom_kill_2gb_ram.log
elemental_oom_kill_3gb_ram.log
node-55392308-8b8f-4a04-b3b9-6dc1eed55689-2023-06-08T084512Z.tar.gz

@ldevulder ldevulder moved this to 🗳️ To Do in Elemental Jun 8, 2023
@kkaempf kkaempf added the kind/bug Something isn't working label Jun 13, 2023
@davidcassany
Copy link
Contributor

I have done a couple of tests: from staging downgrade to stable and then from stable upgrade to staging. Both of them worked smooth with 2GiB of memory.

However when trying to upgrade from staging to dev I saw the issue. Hence I'd say this is a problem only happening on Dev (upgrading to Dev, from Dev downgrading to staging I bet it is functional). At a glance it looks like the issue is pointing the elemental client, hence there might be a regression with the latest changes on elemental-cli. IIRC the biggest change in there is related to the rsync wrapper, so there is where I'd start having a look. Probably would be worth tracking the exact command being executed during an upgrade on older versions and check the differences.

In any case, I'd say the good news are this is not affecting staging and stable. Seams to be a regression on Dev, and according to the changes we did I'd say this is a regression on elemental-cli.

@davidcassany
Copy link
Contributor

Needing a final confirmation, but this should be already fixed in Dev by rancher/elemental-toolkit#1789

@ldevulder was this consistently (or nearly consistent) failing in Dev tests? can we relay on automated tests to verify this is fixed?

@ldevulder
Copy link
Contributor Author

@ldevulder was this consistently (or nearly consistent) failing in Dev tests? can we relay on automated tests to verify this is fixed?

the current automated tests use a workaround by using 8GB of RAM. But I can easy test it on my lab with 2 or 3GB as soon as it is integrated in Dev image.

@davidcassany
Copy link
Contributor

This should be already integrated in Dev, I'd say if a manual test with 2GB passes then we should revert the workaround and close this bug.

@ldevulder
Copy link
Contributor Author

Ok, I will quickly check after lunch manually.

@ldevulder
Copy link
Contributor Author

After manual tests I can confirm that the PR mentioned fix the issue.

@github-project-automation github-project-automation bot moved this from 🗳️ To Do to ✅ Done in Elemental Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants