Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2400: Update swap KEP for 1.23 beta #2858

Merged
merged 4 commits into from
Sep 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2400.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2400
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
72 changes: 68 additions & 4 deletions keps/sig-node/2400-node-swap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,8 +401,14 @@ For alpha:
and further development efforts.
- Focus should be on supported user stories as listed above.

Once this data is available, additional test plans should be added for the next
phase of graduation.
For beta:
ehashman marked this conversation as resolved.
Show resolved Hide resolved

- Add e2e tests that exercise all available swap configurations via the CRI.
- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
- Verify new system-reserved settings for swap memory.
- Verify MemoryPressure behaviour with swap enabled and document any changes
for configuring eviction.

### Graduation Criteria

Expand All @@ -416,8 +422,6 @@ phase of graduation.

#### Beta

_(Tentative.)_

- Add support for controlling swap consumption at the pod level [via cgroups].
- Handle usage of swap during container restart boundaries for writes to tmpfs
(which may require pod cgroup change beyond what container runtime will do at
Expand All @@ -426,6 +430,7 @@ _(Tentative.)_
detects on the host.
- Consider introducing new configuration modes for swap, such as a node-wide
swap limit for workloads.
- Add swap memory to the Kubelet stats api.
Copy link
Member Author

@ehashman ehashman Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr suggested this addition for beta.

@dashpole WDYT?

See also #2858 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I don't remember what swap metrics are available for cgroups v1 vs v2, but I'd want to at least be able to tell if swap is causing problems at the node level. It may also be nice to have swap metrics at the pod/container level if the available metrics can tell me if swapping is hurting my application's performance in some way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole cgroup v2 has memory.swap.current which exists on non-root cgroups which is total amount of swap currently being used by the cgroup and its descendants.

It appears to already be supported in cAdvisor, we just did not include it in k8s API for MemoryStats as it was not yet supported to be on.

see: https://github.com/google/cadvisor/blob/ef7e64f9efab1257e297d7af339e94bb016cf221/container/libcontainer/handler.go#L800

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like cgroups v1 has an equivalent, so we should be all set in that regard

- Determine a set of metrics for node QoS in order to evaluate the performance
of nodes with and without swap enabled.
- Better understand relationship of swap with memory QoS in cgroup v2
Expand All @@ -437,6 +442,8 @@ _(Tentative.)_

#### GA

_(Tentative.)_

- Test a wide variety of scenarios that may be affected by swap support.
- Remove feature flag.

Expand Down Expand Up @@ -587,13 +594,30 @@ Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
-->

If a new node with swap memory fails to come online, it will not impact any
running components.
Comment on lines +597 to +598
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does an in-place upgrade on a node (stopping kubelet, starting a new kubelet on the same server), can that fail? How?

If it could fail then an upgrade might for example take out the nodes where the control plane ought to be running as static pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place upgrade would not fail unless swap access was added while the node was still online. Normally turning swap on and off at runtime isn't considered best practice for a production environment, I'd expect a node to be reimaged and rebooted, but I can mention it.


It is possible that if a cluster administrator adds swap memory to an already
running node, and then performs an in-place upgrade, the new kubelet could fail
to start unless the configuration was modified to tolerate swap. However, we
would expect that if a cluster admin is adding swap to the node, they will also
update the kubelet's configuration to not fail with swap present.

Generally, it is considered best practice to add a swap memory partition at
node image/boot time and not provision it dynamically after a kubelet is
already running and reporting Ready on a node.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

Workload churn or performance degradations on nodes. The metrics will be
application/use-case specific, but we can provide some suggestions, based on
the stability metrics identified earlier.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Expand All @@ -602,12 +626,17 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must
be restarted with or without swap support.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

No.

### Monitoring Requirements

<!--
Expand All @@ -622,12 +651,26 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

KubeletConfiguration has set `failOnSwap: false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I tell two nodes apart:

  • one has failOnSwap: false and memorySwap set to swapBehavior: LimitedSwap, with the NodeSwap feature gate enabled
  • another has failOnSwap: false and the NodeSwap feature gate disabled

via the kubernetes API? If so, I'd mention how to distinguish them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that is bubbled up to the API Server. The purpose of this question is for beta, when the feature gate is defaulted on, so you can't rely on it being turned on as a sign that the feature is in use. We might be able to check if swapBehavior is explicitly set, but empty string is equivalent to LimitedSwap.

Realistically, this KEP iterates on the existed unsupported configuration with failOnSwap: false. Because it was previously unsupported, I am assuming here that a production environment would not have it set if it were not using this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is beta we should assume that people have this feature gate set to a value of their choice. For alpha it was different: you need to be a little more brave to try it and most clusters run with the default, ie feature enabled.

The switch from unsupported to “mostly supported, but it's still beta” is why I'm asking about observability.


The prometheus `node_exporter` will also export stats on swap memory
utilization.
ehashman marked this conversation as resolved.
Show resolved Hide resolved

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

TBD. We will determine a set of metrics as a requirement for beta graduation.
ehashman marked this conversation as resolved.
Show resolved Hide resolved
We will need more production data; there is not a single metric or set of
metrics that can be used to generally quantify node performance.

This section to be updated before the feature can be marked as graduated, and
to be worked on during 1.23 development.

We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
Expand All @@ -647,13 +690,17 @@ high level (needs more precise definitions) those may be things like:
- 99,9% of /health requests per day finish with 200 code
-->

N/A

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->

N/A
ehashman marked this conversation as resolved.
Show resolved Hide resolved

### Dependencies

<!--
Expand Down Expand Up @@ -784,6 +831,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

No change. Feature is specific to individual nodes.

###### What are other known failure modes?

<!--
Expand All @@ -799,8 +848,23 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->


Individual nodes with swap memory enabled may experience performance
degradations under load. This could potentially cause a cascading failure on
nodes without swap: if nodes with swap fail Ready checks, workloads may be
rescheduled en masse.

Thus, cluster administrators should be careful while enabling swap. To minimize
disruption, you may want to taint nodes with swap available to protect against
this problem. Taints will ensure that workloads which tolerate swap will not
spill onto nodes without swap under load.

###### What steps should be taken if SLOs are not being met to determine the problem?

It is suggested that if nodes with swap memory enabled cause performance or
stability degradations, those nodes are cordoned, drained, and replaced with
nodes that do not use swap memory.

## Implementation History

- **2015-04-24:** Discussed in [#7294](https://github.com/kubernetes/kubernetes/issues/7294).
Expand Down
4 changes: 2 additions & 2 deletions keps/sig-node/2400-node-swap/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ prr-approvers:
- "@deads2k"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"
latest-milestone: "v1.23"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down