-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2400: Graduate swap to Beta 1 #3957
Conversation
harche
commented
Apr 17, 2023
•
edited
Loading
edited
- One-line PR description: Graduate swap to Beta 1
- Issue link: Node memory swap support #2400
- Other comments:
Skipping CI for Draft Pull Request. |
31c6e9c
to
f205be1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For each container, divide its memory request by the `TotalMemory`. The result will be the proportion of requested memory for that container within the pod. Let's call this value `RequestedMemoryProportion`. | ||
|
||
3. **Calculate the total swap memory available:** | ||
- Determine the total swap memory available in the system. If the total swap memory is equal to the total physical memory, you can skip this step. Otherwise, calculate the proportion of available swap memory compared to the total physical memory. Let's call this value `SwapMemoryProportion`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
What do you think about dropping If the total swap memory is equal to the total physical memory, you can skip this step
? Although unlikely in practice, in theory the available swap memory might also be bigger than the available memory on the node.
Perhaps something like divide the available swap memory by the total physical memory
is more generalized and complete, and covers all cases.
|
||
#### Example 2 - Configured swap memory on the host is NOT equal to the physical memory | ||
|
||
If the total physical memory is 4 GB and the total swap memory available is 2 GB, we can adjust the example 1 using the same approach: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
What do you think about making this example a bit more realistic?
Maybe we can show total memory of 400G
and swap size of 30G
?
Not sure it's important though.
@@ -300,6 +369,8 @@ and/or workloads in a number of different scenarios. | |||
Since swap provisioning is out of scope of this proposal, this enhancement | |||
poses low risk to Kubernetes clusters that will not enable swap. | |||
|
|||
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general Linux configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it is a general Linux configuration concern"
As it's not linux-specific issue, maybe we can go with "it is a general OS configuration concern"
instead?
- Better understand relationship of swap with memory QoS in cgroup v2 | ||
(particularly `memory.high` usage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better understand relationship of swap with memory QoS in cgroup v2 (particularly
memory.high
usage).
Do we want to remove this line?
Is there something missing regarding that?
### Enable Swap Support only for Burstable QoS Pods | ||
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually. | ||
|
||
Swap access is granted only for pods of Burstrable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictible, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Burstrable typo
2. **Determine the swap proportion for each container:** | ||
- For each container, divide its memory request by the `TotalMemory`. The result will be the proportion of requested memory for that container within the pod. Let's call this value `RequestedMemoryProportion`. | ||
|
||
3. **Calculate the total swap memory available:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set any aside for the system (kernel + systemd + kubelet + runtime, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#105271
This should be based on the allocatable swap after it is supported in system-reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign |
- Add swap memory to the Kubelet stats api. | ||
- Determine a set of metrics for node QoS in order to evaluate the performance | ||
of nodes with and without swap enabled. | ||
- Better understand relationship of swap with memory QoS in cgroup v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cgroup v2, memory.swap.high
may be helpful at the node level. We can set it in the node level using a factor like the other KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos for memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to align with v2 QoS work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a section on using memory.swap.high
- https://github.com/kubernetes/enhancements/pull/3957/files#diff-32d9cbd2b6817965ecee751dfe43b0d91269aee738a5799c9c9a68d04a329ec6R194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that using swap throttling is the right approach here.
There are significant differences between memory.high
and memory.swap.high
and they are used in a completely different manner.
memory.high
is described as "the main mechanism to control memory usage."
and "when hit, it throttles allocations by forcing them into direct reclaim to work off the excess"
[1].
However, memory.swap.high
is described as [1]:
Swap usage throttle limit. If a cgroup's swap usage exceeds
this limit, all its further allocations will be throttled to
allow userspace to implement custom out-of-memory procedures.
marks a point of no return for the cgroup. It is NOT
designed to manage the amount of swapping a workload does
during regular operation. Compare to memory.swap.max, which
prohibits swapping past a set amount, but lets the cgroup
continue unimpeded as long as other memory can be reclaimed.
Healthy workloads are not expected to reach this limit.
In addition, the Linux kernel commit message that brought memory.swap.high
says [2] That is not to say that the knob itself is equivalent to memory.high
and that Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react
.
Since we don't implement any user-space OOM killers, IIUC reaching the memory.swap.high limit would freeze the processes in the cgroup forever.
This makes sense, as swap memory isn't being reclaimed as often as ram, and it's much more expensive to do so. If processes in a cgroup would reach the memory.swap.max
boundary, however, they would have to keep allocated memory in-ram. If memory.high
is defined, as the KEP suggests, the processes would reach a point where they freeze up, forced to reclaim memory to continue their operation.
The bottom line is that processes that are out of memory should reclaim ram memory, not swap memory.
[1] https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst
[2] https://lore.kernel.org/linux-mm/20200602044952.g4tpmtOiL%25akpm@linux-foundation.org/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react
IMO is a good thing. We have seen so many issues in the past where the workload would end up consuming the memory so fast that kernel won't get adequate time to react. In this scenario, the entire system would just freeze and the node would eventually transition into NOT_READY state in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two lines I'm most concerned about, and don't fully understand, are:
marks a point of no return for the cgroup.
What does "point of no return" actually means here?
Does it mean that a user-space procedure needs to resume the cgroup's execution or else it won't ever be able to allocate memory?
It is NOT designed to manage the amount of swapping a workload does during regular operation.
What are "regular operations"?
It worries me that this description is very different than the description for memory.high
, which is said to be the "the main mechanism to control memory usage".
- Container A: (20 GB) / (30 GB) = 2/3 | ||
- Container B: (10 GB) / (30 GB) = 1/3 | ||
|
||
Step 3: Calculate the total swap memory available: Since the total swap memory (2 GB) is not equal to the total physical memory (4 GB), the `SwapMemoryProportion` will be 2 GB / 4 GB = 1/2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If swap can be reserved in --system-reserved of kubelet, this should base on that input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few concerns about how is it implemented in kubernetes/kubernetes#105271 (comment), however I have added a section in KEP on how we can effectively set aside some swap space for system critical daemons.
68e4f60
to
f775c0a
Compare
- Remove support for setting unlimited amount of swap (including [swapBehavior](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-MemorySwapConfiguration) flag of the Kubelet) as any workload with aggressive memory allocation can bring down a node with having no limits on swap usage. | ||
|
||
#### Beta 2 | ||
- Add support for controlling swap consumption at the container level [via cgroups] in Pod API in [container resources](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L2443). More specifically add a new [ResourceName](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L5522) `swap`. This will make sure we stay consistent with other resources like `cpu`, `memory` etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop this since we don't really want to commit to adding this to the pod API yet. We can evaluate that and then decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
ff1c064
to
1d4dc62
Compare
@SergeyKanzhelev Could you kindly have another look? KEP freeze is within 1 day. I think we basically agree on everything, there are only two minor [1][2] things left to discuss. Since we basically already agree, can we try to merge this in time? |
- Improve coverage for appropriate scenarios in testgrid. | ||
- Remove support for setting unlimited amount of swap (including [swapBehavior](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-MemorySwapConfiguration) flag of the Kubelet) as any workload with aggressive memory allocation can bring down a node with having no limits on swap usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow how it will bring the node down if it has unlimited swap comparing to many Gb limit? Unlimited swap is useful for scenarios when the node is being used essentially for one big Pod doing some heavy processing that requires memory mapping of a large size without a lot of random access. This scenario still works with limiting as proposed - proportional to the node limit. But I would question the reasoning suggested here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, your point is valid. I will revise the KEP to eliminate that concern. Should we encounter any issues regarding unlimited swap during our testing phase, we will promptly investigate and address them as needed.
|
||
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system. | ||
|
||
Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we guarantee that system reserved can be set to zero so no system processes will ever use swap? Can we add this as an explicit goal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With @pacoxu's PR we can reserve some swap space to the system, but reserving 0 swap to the system does not mean that the system cannot use swap, it's just that there won't be a swap space that's dedicated to it.
no system processes will ever use swap
What do you mean by "system processes"? Do you mean non-kubernetes processes?
If so - I'm not sure how we can guarantee that unless we can mess with the node's non-kubernetes cgroups hierarchies, and I assume that we can't do that.
Do you refer to Kubernetes system non-containerized processes? If so, do they live under /sys/fs/cgroup/kubepods
?
If so - that's a problem since this is the top cgroup hierarchy for Kubernetes, and limiting swap at that level would affect all of the containers.
In the comment below, you wrote:
configure k8s so system daemonsets are not swapped
So here we are talking about Kubernetes containers. How can "system" daemonsets be identified? Why is it wrong to allocate swap to them just like any other container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev we can set memory.swap.max
to 0
on system reserved cgroup (in our case usually that happens to be /sys/fs/cgroup/system.slice
) and if we set memory.swap.max
to max
on burstable pods slice (which in our case usually happens to be /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice
) we can
make sure that no system processes will ever use swap.
As long as system reserved cgroup is setup in a way such that it doesn't end up becoming the parent of the cgroups where burstable pods are running this arrangement will work fine. Because if someone configures their cgroup in a way where system cgroup is a parent of burstable pods cgroup then this will silently fail because a child cgroup cannot have memory.swap.max
set to max
while parent is set to 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a check in kubelet during startup to verify that burstable pods cgroup is not a child of system reserved cgroup. That should take care of it.
I will update the KEP with this info.
|
||
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system. | ||
|
||
Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks. | |
Additionally end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
Overall lgtm. Let's add a note to make sure we can configure k8s so system daemonsets are not swapped. This will address the security concerns. Also I'm not sure about the reasoning to disable unlimited swap. I left a comment about it. If this is indeed a problem, I'm fine with removing this option. |
Signed-off-by: Harshal Patil <harpatil@redhat.com> Co-authored-by: Itamar Holder <iholder@redhat.com>
Thanks @SergeyKanzhelev for your review. I have added a note to make sure system daemons aren't swapped as well as dropped the point about removing the support for unlimited swap. |
|
||
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system. | ||
|
||
To guarantee that system daemons are not swapped, the kubelet must configure the `memory.swap.max` setting to `0` within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In otherwords: kubeleconfig.CgroupRoot can't be a child of kubeletconfig.SystemReservedCgroups.
as an implementation note: this should probably be added to the kubelet config documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
thank you!
Thanks @SergeyKanzhelev /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: harche, mrunalp, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |