Elemental process is OOM killed during OS upgrade #865

ldevulder · 2023-06-08T13:46:46Z

What steps did you take and what happened:

Install Rancher Manager Stable (v2.7.4)
Install Dev version of Elemental operator
Provision a simple K3s cluster with Elemental Stable ISO, one node is enough, just set RAM size to 3GB (not more!)
Wait for the cluster to be in Active state
Trigger an OS upgrade with Rancher Manager
OOM killer will happen on the node (on elemental process)

What did you expect to happen:
Upgrade to be done without any issue. As nothing is deployed on the cluster it's very weird that 3GB is not enough to perform an OS upgrade.

Anything else you would like to add:
I tried directly with elemental upgrade command but I wasn't able to reproduce the issue. It happens only when the OS upgrade is triggered with Rancher Manager.

~~With K3s v1.25.7+k3s1 behavior is a little bit different: I don't always get an OOM kill but some processes are core dumped.~~
EDIT: after re-configuring my lab from scratch I can confirm that I'm able to get an OOM kill even with v1.25.7+k3s1.

In Elemental docs there is nothing about minimal RAM value, so if I check for K3s I found 512MB as the minimal value and 1GB recommended. For SLE Micro I found 1GB as the minimal value. So it seems that 1GB should be enough but it's clearly not.
The CI used 3GB without any issue until recently. During my manual tests I was able to run an OS upgrade on a node with 4GB without any issue, but if I create a cluster of 3 nodes with 4GB RAM for each I can still see sporadic OOM kills (same with 6GB of RAM, but it appears less in that case).

So, even if the minimal values are, in my opinion, too small, it seems that we have some weird issues with the memory.

Environment:

Elemental release version (use cat /etc/os-release): Dev for operator and Stable for ISO
Rancher version: v2.7.4
Kubernetes version (use kubectl version): v1.24.10+k3s1 and v1.25.7+k3s1
Cloud provider or hardware configuration: N/A

Please find attached some logs I was able to catch (not easy after an OOM kill).
elemental_oom_kill_2gb_ram.log
elemental_oom_kill_3gb_ram.log
node-55392308-8b8f-4a04-b3b9-6dc1eed55689-2023-06-08T084512Z.tar.gz

The text was updated successfully, but these errors were encountered:

davidcassany · 2023-06-15T10:10:22Z

I have done a couple of tests: from staging downgrade to stable and then from stable upgrade to staging. Both of them worked smooth with 2GiB of memory.

However when trying to upgrade from staging to dev I saw the issue. Hence I'd say this is a problem only happening on Dev (upgrading to Dev, from Dev downgrading to staging I bet it is functional). At a glance it looks like the issue is pointing the elemental client, hence there might be a regression with the latest changes on elemental-cli. IIRC the biggest change in there is related to the rsync wrapper, so there is where I'd start having a look. Probably would be worth tracking the exact command being executed during an upgrade on older versions and check the differences.

In any case, I'd say the good news are this is not affecting staging and stable. Seams to be a regression on Dev, and according to the changes we did I'd say this is a regression on elemental-cli.

davidcassany · 2023-06-16T08:53:51Z

Needing a final confirmation, but this should be already fixed in Dev by rancher/elemental-toolkit#1789

@ldevulder was this consistently (or nearly consistent) failing in Dev tests? can we relay on automated tests to verify this is fixed?

ldevulder · 2023-06-16T09:36:09Z

@ldevulder was this consistently (or nearly consistent) failing in Dev tests? can we relay on automated tests to verify this is fixed?

the current automated tests use a workaround by using 8GB of RAM. But I can easy test it on my lab with 2 or 3GB as soon as it is integrated in Dev image.

davidcassany · 2023-06-16T10:06:08Z

This should be already integrated in Dev, I'd say if a manual test with 2GB passes then we should revert the workaround and close this bug.

ldevulder · 2023-06-16T10:08:10Z

Ok, I will quickly check after lunch manually.

ldevulder · 2023-06-16T13:57:01Z

After manual tests I can confirm that the PR mentioned fix the issue.

ldevulder added this to Elemental Jun 8, 2023

ldevulder moved this to 🗳️ To Do in Elemental Jun 8, 2023

kkaempf added the kind/bug Something isn't working label Jun 13, 2023

ldevulder closed this as completed Jun 16, 2023

github-project-automation bot moved this from 🗳️ To Do to ✅ Done in Elemental Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elemental process is OOM killed during OS upgrade #865

Elemental process is OOM killed during OS upgrade #865

ldevulder commented Jun 8, 2023 •

edited

Loading

davidcassany commented Jun 15, 2023

davidcassany commented Jun 16, 2023

ldevulder commented Jun 16, 2023

davidcassany commented Jun 16, 2023

ldevulder commented Jun 16, 2023

ldevulder commented Jun 16, 2023

Elemental process is OOM killed during OS upgrade #865

Elemental process is OOM killed during OS upgrade #865

Comments

ldevulder commented Jun 8, 2023 • edited Loading

davidcassany commented Jun 15, 2023

davidcassany commented Jun 16, 2023

ldevulder commented Jun 16, 2023

davidcassany commented Jun 16, 2023

ldevulder commented Jun 16, 2023

ldevulder commented Jun 16, 2023

ldevulder commented Jun 8, 2023 •

edited

Loading