- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Required changes for a WATCH request with the SendInitialEvents=true
- Required changes for a WATCH request with the RV set to the last observed value (RV > 0)
- Provide a fix for the long-standing issue kubernetes/kubernetes#59848
- Replacing standard List request with WatchList mechanism for client-go's List method.
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Appendix
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The kube-apiserver is vulnerable to memory explosion. The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption. Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an HA mode but also other programs that share the same machine. In this KEP we propose a solution to this issue.
Today informers are the primary source of LIST requests. The LIST is used to get a consistent snapshot of data to build up a client-side in-memory cache. The primary issue with LIST requests is unpredictable memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects. See the Appendix section for more details on potential sources of LIST request and their impact on memory. In extreme cases, the server can allocate hundreds of megabytes per request. To better visualize the issue let's consider the above graph. It shows the memory usage of an API server during a test (see manual test section for more details). We can see that increasing the number of informers drastically increases the memory consumption of the server. Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests. In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client. The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption. Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.
A situation like that is dangerous twofold. First, as we saw it could slow down if not fully stop an API server that has received the requests. Secondly, a sudden and uncontrolled spike in memory consumption will likely put pressure on the node itself. This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet. Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius. Note that in that scenario even clusters in an HA setup are affected.
Worse, in rare cases (see the Appendix section for more) recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to a very expensive storm of LISTs.
- protect kube-apiserver and its node against list-based OOM attacks
- considerably reduce (temporary) memory footprint of LISTs, down from O(watchers*page-size*object-size*5) to O(watchers*constant), constant around 2 MB.
Example:
512 watches of 400mb data: 5125002MB*5=2.5TB ↘ 2 GB
racing with Golang GC to free this temporary memory before being OOM'ed.
- reduce etcd load by serving from watch cache
- get a replacement for paginated lists from watch-cache, which is not feasible without major investment
- enforce consistency in the sense of freshness of the returned list
- be backward compatible with new client -> old server
- fix the long-standing "stale reads from the cache" issue, kubernetes/kubernetes#59848
- get rid of list or list pagination
- rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).
In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use streaming from the watch-cache instead of paging from etcd. Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see Appendix section for more details on how informers operate today). The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. This would allow us to keep memory allocations constant. The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude) plus a few additional allocations, that will be explained later in this document. The rough idea/plan is as follows:
- step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
- step 2: upon receiving the request from an informer, compute the RV at which the result should be returned (possibly contacting etcd if consistent read was requested). It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will meet the consistency requirements of the request.
- step 2a: send all objects currently stored in memory for the given resource type.
- step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
- step 2c: send a bookmark event to the informer with the given RV.
- step 3: listen for further events using the request from step 1.
Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches. The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons. A watch started with RV="0" may return stale data. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches.
Note 2: informers need consistent lists to avoid time-travel when initializing after restart to avoid time travel in case of switching to another HA instance of kube-apiserver with outdated/lagging watch cache. See the following issue for more details.
The following sequence diagram depicts steps that are needed to complete the proposed feature. A high-level overview of each was provided in a table that follows immediately the diagram. Whereas further down in this section we provided a detailed description of each required step.
Step | Description |
---|---|
1. | The reflector establishes a WATCH request with the watch cache. |
2. | If needed, the watch cache contacts etcd for the most up-to-date ResourceVersion. |
2a. | The watch cache starts streaming initial data it already has in memory. |
2b. | The watch cache waits until it has observed data up to the RV received in step 2. Streaming all new data (if any) to the reflector immediately. |
2c. | The watch cache has observed the RV (from step 2) and sends a bookmark event with the given RV to the reflector. |
3. | The reflector replaces its internal store with collected items, updates its internal resourceVersion to the one obtained from the bookmark event. |
3a. | The reflector uses the WATCH request from step 1 for further progress notifications. |
Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) to ensure freshness and setting resourceVersionMatch=NotOlderThan and sendInitialEvents=true. We do that only during the initial ListAndWatch call. Each event (ADD, UPDATE, DELETE) except the BOOKMARK event received from the server is collected. Passing resourceVersion="" tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.
Note: This ensures that returned data is consistent, served from etcd via a quorum read and prevents "going back in time".
Note 2: Watch cache currently doesn't have the feature of supporting resourceVersion="" and thus is vulnerable to stale reads, see kubernetes/kubernetes#59848 for more details.
Step 2: Right after receiving a request from the reflector, the cacher gets the current resourceVersion (aka bookmarkAfterResourceVersion) directly from the etcd. It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data. There are ways to do that cheaply, e.g. we could issue a count request against the datastore. Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache. After sending initial data the cacheWatcher starts listening on an input channel for new events, including a bookmark event. At some point, the cacher will receive an event with the resourceVersion equal or greater to the bookmarkAfterResourceVersion. It will be propagated to the cacheWatcher and then back to the reflector as a BOOKMARK event.
Step 2a: Where does the initial data come from?
During construction, the cacher creates the reflector and the watchCache. Since the watchCache implements the Store interface it is used by the reflector to store all data it has received from etcd.
Step 2b: What happens when new events are received while the cacheWatcher is sending initial data?
The cacher maintains a list of all current watchers (cacheWatcher) and a separate goroutine (dispatchEvents) for delivering new events to the watchers. New events are added via the cacheWatcher.nonblockingAdd method that adds an event to the cacheWatcher.input channel. The cacheWatcher.input is a buffered channel and has a different size for different Resources (10 or 1000). Since the cacheWatcher starts processing the cacheWatcher.input channel only after sending all initial events it might block once its buffered channel tips over. In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. All watchers that have failed to deliver the event will be closed.
Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.
For an alpha version, we will delay closing the watch request until all data is sent to the client. We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV, what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.
For a beta version, we have further options if they turn out to be necessary:
- comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on
and waiting until the difference between the RVs is < 1000 (the buffer size). We could do that even before sending the initial events.
If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV.
Assuming all updates would be for the resource the watch request was opened for (which seems unlikely).
In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache.
Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see https://etcd.io/docs/v3.4/op-guide/performance/)
Once we are past this step (we know the difference is smaller) and the buffer fills up we:
-
case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer. In that case, we will deliver the initial events, any other events we have received which RVs are <= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection. An informer will reconnect with the RV from the bookmark event. Note that any new event received was ignored since the buffer was full.
-
case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn't exist in the buffer. An informer will reconnect arriving at the step that compares the RVs first.
-
- make the buffer dynamic - especially when the difference between RVs is > than 1000
- inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before
- cap the size (cannot allocate more than X MB of memory) of the buffer
- maybe even apply some compression techniques to the buffer (for example by only storing a low-memory shallow reference and take the actual objects for the event from the store)
Note: The RV is effectively a global counter that is incremented every time an object is updated. This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request.
Note 2: Currently, there is a timeout for LIST requests of 60s. That means a slow reflector might fail synchronization as well and would have to re-establish the connection.
Step 2c: How bookmarks are delivered to the cacheWatcher?
First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening. There are two sources of resourceVersions. The first one is regular events that contain RVs besides objects. The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval only to the kube-apiserver. As already mentioned in 2a the watchCache is driven by the reflector. Every event will be eventually propagated from the watchCache to the cacher.processEvent method. For simplicity, we can assume that the processEvent method will simply update the resourceVersion maintained by the cacher.
At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event. As of today, the interval is set to 1 second. The bookmark event contains an empty object and the current resourceVersion. By default, a cacheWatcher expires roughly every 1 minute.
The expiry interval initially will be decreased to 1 second in this feature's code-path. This helps us deliver a bookmark event that is >= bookmarkAfterResourceVersion much faster. After that, the interval will be put back to the previous value.
Note: Since we get a notification every 5 seconds from etcd and we try to deliver a bookmark every 1 second. It seems the maximum delay time a reflector will have to wait after receiving initial data is 6 seconds (assuming small dataset). It might be unlikely in practice since we might get bookmarkAfterResourceVersion even before handling initial data. Also sending data itself takes some time as well.
Step 3: After receiving a BOOKMARK event the reflector is considered to be synchronized. It replaces its internal store with the collected items (syncWith) and reuses the current connection for getting further events.
Extend the ListOptions
struct with the following field:
type ListOptions struct {
...
// `sendInitialEvents=true` may be set together with `watch=true`.
// In that case, the watch stream will begin with synthetic events to
// produce the current state of objects in the collection. Once all such
// events have been sent, a synthetic "Bookmark" event will be sent.
// The bookmark will report the ResourceVersion (RV) corresponding to the
// set of objects, and be marked with `"k8s.io/initial-events-end": "true"` annotation.
// Afterwards, the watch stream will proceed as usual, sending watch events
// corresponding to changes (subsequent to the RV) to objects watched.
//
// When `sendInitialEvents` option is set, we require `resourceVersionMatch`
// option to also be set. The semantic of the watch request is as following:
// - `resourceVersionMatch` = NotOlderThan
// is interpreted as "data at least as new as the provided `resourceVersion`"
// and the bookmark event is send when the state is synced
// to a `resourceVersion` at least as fresh as the one provided by the ListOptions.
// If `resourceVersion` is unset, this is interpreted as "consistent read" and the
// bookmark event is send when the state is synced at least to the moment
// when request started being processed.
// - `resourceVersionMatch` set to any other value or unset
// Invalid error is returned.
//
// Defaults to true if `resourceVersion=""` or `resourceVersion="0"` (for backward
// compatibility reasons) and to false otherwise.
SendInitialEvents *bool
}
The watch bookmark marking the end of initial events stream will have a dedicated annotation:
"k8s.io/initial-events-end": "true"
(the exact name is subject to change during API review). It will allow clients to precisely figure out when the initial stream of events is finished.
It's worth noting that explicitly setting SendInitialEvents to false with ResourceVersion="0" will result in not sending initial events, which makes the option works exactly the same across every potential resource version passed as a parameter.
-
Avoid DeepCopying of initial data
The watchCache has an important optimization of wrapping objects into a cachingObject. Given that objects aren't usually modified (since selfLink has been disabled) and the fact that there might be multiple watchers interested in receiving an event. Wrapping allows us for serializing an object only once. The watchCache maintains two internal data structures. The first one is called the store and is driven by the reflector. It essentially mirrors the content stored in etcd. It is used to serve LIST requests. The second one is called the cache, which represents a sliding window of recent events received from the reflector. It is effectively used to serve WATCH requests from a given RV.
By design cachingObjects are stored only in the cache. As described in Step 2, the cacheWatcher gets initial data from the watchCacher. The latter, in turn, gets data straight from the store. That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.
Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject. Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed. -
Reduce the number of allocations in the WatchServer
The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients. It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image. Two of which deserve special attention, namely the allocations 1 and 3 because they won't reuse memory and rely on the GC for cleanup. In other words, the more events we need to send, the more (temporary) memory will be used. In contrast, the other two allocations are already optimizedas they reuse memory instead of creating new buffers for every single event. For better utilization, a similar technique of reusing memory could be used to save precious RAM and scale the system even further.
For the past few years, we have seen many clusters suffering from the issue.
Sadly, our only possible recommendation was to ask customers to reduce the cluster in size.
Since adding more memory in most of the cases would not fix the issue.
Recall from the motivation section that just a few requests can allocate gigabytes of data in a fraction of a second
In order to reproduce the issue, we executed the following manual test, it is the simplest and cheapest way of putting yourself into customers' shoes: the reproducer creates a namespace with 400 secrets, each containing 1 MB of data.
Next, it uses informers to get all secrets in the cluster.
The rough estimate is that a single informer will have to bring at least 400MB from the datastore to get all secrets.
The result: 16 informers were able to take down the test cluster.
We have prepared the following PR kubernetes/kubernetes#106477 which is almost identical to the proposed solution.
It just differs in a few details.
The following image depicts the results we obtained after running the synthetic test described in 4.
First of all, it is worth mentioning that the PR was deployed onto the same cluster so that we could ensure an identical setup (CPU, Memory) between the tests.
The graph tells us a few things.
Firstly, the proposed solution is at least 100 times better than the current state.
Around 12:05 we started the test with 1024 informers, all eventually synced without any errors.
Moreover during that time the server was stable and responsive.
That particular test ended around 12:30. That means it needed ~25 minutes to bring ~400 GB of data across the network!
Impressive achievement.
Secondly, it tells us that memory allocation is not proportional yet to the number of informers! Given the size of individual objects of 1MB and the actual number of informers, we should allocate roughly around 2GB of RAM.
We managed to get and analyze the memory profile that showed a few additional allocations inside the watch server.
At this point, it is worth mentioning that the results were achieved with only the first optimization applied.
We expect the system will scale even better with the second optimization as it will put significantly less pressure on the GC.
In that case, no additional changes are required. We stick to existing semantics. That is we start a watch at an exact resource version. The watch events are for all changes after the provided resource version. This is safe because the client is assumed to already have the initial state at the starting resource version since the client provided the resource version.
Provide a fix for the long-standing issue kubernetes/kubernetes#59848
The issue is still open mainly because informers default to resourceVersion="0" for their initial LIST requests. This is problematic because the initial LIST requests served from the watch cache might return data that are arbitrarily delayed. This in turn could make clients connected to that server read old data and undo recent work that has been done.
To make consistent reads from cache for LIST requests and thus prevent "going back in time" we propose to use the same technique for ensuring the cache is not stale as described in the previous section.
In that case we are going to change informers to pass "resourceVersion="0" and resourceVersionMatch=MostRecent" for their initial LIST requests. Then on the server side we:
- get the current revision from etcd.
- use the existing waitUntilFreshAndBlock function to wait for the watch to catch up to the revision requested in the previous step.
- reject the request if waitUntilFreshAndBlock times out, thus forcing informers to retry.
- otherwise, construct the final list and send back to a client.
Replacing the underlying implementation of the List method for client-go based clients (like typed or dynamic client) with the WatchList mechanism requires ensuring that the data returned by both the standard List request and the new WatchList mechanism remains identical. The challenge is that WatchList no longer retrieves the entire list from the server at once but only receives individual items, which forces us to "manually" reconstruct the list object on the client side.
To correctly construct the list object on the client side, we need ListKind information. However, simply reconstructing the list object based on these data is not enough. In the case of a standard List request, the server's response (a versioned list) is processed through a chain of decoders, which can potentially modify the resulting list object. A good example is the WithoutVersionDecoder, which removes the GVK information from the list object. Thus the "manually" constructed list object may not be consistent with the transformations applied by the decoders, leading to differences.
To ensure full compatibility, the server must provide a versioned empty list in the format requested by the client (e.g., protobuf representation). We don't know how the client's decoder behaves for different encodings, i.e., whether the decoder actually supports the encoding we intend to use for reconstruction. Therefore, to ensure maximal compatibility, we will ensure that the encoding used for the reconstruction of the list matches the format that the client originally requested. This guarantees that the returned list object can be correctly decoded by the client, preserving the actual encoding format as intended.
The proposed solution is to add a new annotation (k8s.io/initial-events-list-blueprint
) to the object returned
in the bookmark event (The bookmark event is sent when the state is synced and marks the end of WatchList stream).
This annotation will store an empty, versioned list encoded as a Base64 string.
This annotation will be added to the same object/place the k8s.io/initial-events-end
annotation is added.
When the client receives such a bookmark, it will base64 decode the empty list and pass it to the decoder chain. Only after a successful response from the decoders the list will be populated with data received from subsequent watch events and returned.
For example:
GET /api/v1/namespaces/test/pods?watch=1&sendInitialEvents=true&allowWatchBookmarks=true&resourceVersion=&resourceVersionMatch=NotOlderThan
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json
{
"type": "ADDED",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "8467", "name": "foo"}, ...}
}
{
"type": "ADDED",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "5726", "name": "bar"}, ...}
}
{
"type":"BOOKMARK",
"object":{"kind":"Pod","apiVersion":"v1","metadata":{"resourceVersion":"13519","annotations":{"k8s.io/initial-events-end":"true","k8s.io/initial-events-embedded-list":"eyJraW5kIjoiUG9kTGlzdCIsImFwaVZlcnNpb24iOiJ2MSIsIm1ldGFkYXRhIjp7fSwiaXRlbXMiOm51bGx9Cg=="}} ...}
}
...
<followed by regular watch stream starting>
Alternatives
We could modify the type of the object passed in the last bookmark event to include the list. This approach would require changes to the reflector, as it would need to recognize the new object type in the bookmark event. However, this could potentially break other clients that are not expecting a different object in the bookmark event.
Another option would be to issue an empty list request to the API server to receive a list response from the client. This approach would involve modifying client-go and implementing some form of caching mechanism, possibly with invalidation policies. Non-client-go clients that want to use this new feature would need to rebuild this mechanism as well.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
- k8s.io/apiserver/pkg/storage/cacher: 02/02/2023 - 74,7%
- k8s.io/client-go/tools/cache/reflector: 02/02/2023 - 88,6%
- For alpha, tests asserting fallback mechanism for reflector will be added.
- For alpha, tests exercising this feature will be added.
- The Feature is implemented behind
WatchList
feature flag - Initial e2e tests completed and enabled
- Scalability/Performance tests confirm gains of this feature
- Add support for watchlist to APF
- Metrics are added to the kube-apiserver (see the monitoring-requirements section for more details)
- Implement
SendInitialEvents
forwatch
requests in the etcd storage implementation - The feature is enabled for kube-apiserver and kube-controller-manager
- The generic feature gate mechanism is implemented in client-go. It will be used to enable a new functionality for reflectors/informers.
- Implement a consistency check detector that will compare data received through a new watchlist request with data obtained through a standard list request. The detector will be added to the reflector and activated when an environment variable is set. The environment variable will be set for all jobs run in the Kube CI.
- Update the client-go generated List function to watchList data when the feature gate has been enabled and the ListOptions are satisfied. This change must be applied to the typed, dynamic and metadata clients.
- Implement a mechanism for automatically detecting etcd configuration Whether it is safe to use the RequestWatchProgress API call or if the experimental-watch-progress-notify-interval flag has been set. Knowing etcd configuration will be used to automatically disable the streaming feature.
- Use WatchProgressRequester to request progress notifications directly from etcd. This mechanism was developed in Consistent Reads from Cache KEP and will reduce the overall latency for watchlist requests.
- The watchlist call, which serves as a drop-in replacement for list calls in client libraries, must properly set the kind and apiVersion fields. These fields are important for the correct decoding of the objects. See also: kubernetes/kubernetes#126191
- Switch
the
storage/cacher
to use streaming directly from etcd (This will also allow us to remove thereflector.UseWatchList
field).
- Make list calls expensive in APF. Once all supported releases have the streaming list enabled by default (client-go, control plane components) and the feature itself is locked to its default value, we can increase the cost of regular list requests in APF. This ensures that the fallback mechanism, which switches back to the standard list when streaming has issues, will not be affected.
Our immediate idea to ensure backward compatibility between new clients and old servers would be to return a 401 response in old Kubernetes releases (via backports).
This approach however would limit the maximum skew version mismatch to just a few previous releases, and would also force customers to update to latest minor versions.
Therefore we propose to make use of the already existing "resourceVersionMatch" LIST option.
WATCH requests with that option set will be immediately rejected with a 403 (Forbidden) response by previous servers.
In that case, new clients will fall back to the previous mode (ListAndWatch).
New servers will allow for WATCH requests to have "resourceVersionMatch=MostRecent" set.
Existing clients will be forward and backward compatible and won't require any changes since the server will preserve the old behavior (ListAndWatch).
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: WatchList
- Components depending on the feature gate:
- kube-apiserver
- Feature gate name: WatchListClient (the actual name might be different because it hasn't been added yet)
- Components depending on the feature gate:
- kube-controller-manager via client-go library
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
No. Because users must enable the feature on the client side (client-go).
Yes, by disabling WatchList
FeatureGate for kube-apiserver
.
In this case kube-apiserver
will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.
Yes, by disabling WatchListClient
FeatureGate for kube-controller-manager
.
In this case informers will follow standard LIST/WATCH semantics.
Note that for safety reasons, reflectors/informers will always fallback to a regular LIST operation regardless of the error that occurred.
The expected behavior of the feature will be restored.
Yes. There is an integration test that verifies the fallback mechanism
of the reflector when interacting with servers that has the WatchList
feature enabled/disabled.
Feature does not have a direct impact on rollout/rollback.
However, faulty behavior of a feature can result in incorrect functioning of components that rely on that feature. For the Beta version, we plan to enable it exclusively for kube-controller-manager. The main issues can arise during the initial informer synchronization, which may result in controller failures.
Furthermore, if data consistency issues arise, such as missing data, the controllers simply do not consider the missing data.
apiserver_terminated_watchers_total
- a large number of terminated watchers might indicate synchronization issues.
For example, we have some client-side error where we're not getting data from the server. Or we have a server-side error, and the buffer is getting cluttered.
apiserver_request_duration_second_bucket
- in general, a large number of "short" watch requests can indicate synchronization issues.
apiserver_watch_list_duration_seconds
- the absence of this metric may indicate that the client did not receive a special bookmark.
The issue here could be that the server never sent it due to an error or didn't even receive it from the database.
apiserver_watch_list_duration_seconds
- long synchronization times may indicate that the server is lagging behind etcd.
Forr example, not receiving progress notifications from the database frequently.
apiserver_watch_cache_lag
- tells how far behind the server is compared to the database.
Significant discrepancies affect the times for full data synchronization.
A good metric can also be the number of kube-controller-manager restarts. Which may indicate issues with informers synchronization.
Upgrade->downgrade->upgrade testing was done manually using the following steps:
Build and run Kubernetes from the master branch using Kind.
kind build node-image --arch "arm64"
kind create cluster --image kindest/node:latest
kubectl get no
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 26s v1.29.0-alpha.1.47+f8571dabf79717
Check if the kube-apiserver
(aka kas
) has recorded the watchlist latency metric.
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1
Disable the WatchList
feature gate for the kas
by editing the static pod manifest directly.
docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml
and pass - --feature-gates=WatchList=false
to the kas
container.
Check if the kas
has not recorded the watchlist latency metric.
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
Check if kube-controler-manger
(aka kcm
) is running.
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (44s ago) 3m28s
Check if informers used by the kcm
fell back to standard LIST/WATCH semantics.
kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"
W1002 09:11:40.656641 1 reflector.go:340] The watch-list feature is not supported by the server, falling back to the previous LIST/WATCH semantics
…
Disable the WatchListClient
feature gate for the kcm
by editing the static pod manifest directly.
docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
and pass - --feature-gates=WatchListClient=false
to the kcm
container.
Check if kcm
is running.
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 12s
Check if the kas
has not recorded the watchlist latency metric.
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
Check if there are no traces of informers for kcm
falling back to standard LIST/WATCH semantics.
kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"
Enable the WatchList
feature gate for the kas
by editing the static pod manifest directly.
docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml
and remove - --feature-gates=WatchList=false
from the kas
container.
Check if kcm
is running.
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 1 (22s ago) 86s
Check if the kas
has not recorded the watchlist latency metric.
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
Enable the WatchListClient
feature gate for the kcm
by editing the static pod manifest directly.
docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
and remove - --feature-gates=WatchListClient=false
for the cm
container.
Check if kcm
is running.
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
…
kube-controller-manager-kind-control-plane 1/1 Running 0 13s
Check if the kas
has recorded the watchlist latency metric.
kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
If apiserver_watch_list_duration_seconds
metric has some data then this feature is in use.
Assuming that historical data is available then comparing the number of LIST and WATCH requests to the server will tell whether the feature was enabled. When this feature is enabled, the number of LIST requests will be smaller. The difference primarily arises from switching informers to a new mode of operation.
Checking whether WatchListClient
FeatureGate has been set for the given component.
Knowing the username
for a component, the audit logs could be examined to see whether sendInitialEvents=true
in the requestURI
has been set for that user.
Scanning the component's logs for the phrase Reflector WatchList
. For requests lasting more than 10 seconds, traces will be reported.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
None have been defined yet.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)
- Metric name: apiserver_watch_list_duration_seconds (histogram, measures latency of watch-list requests)
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
No.
No. On the contrary. The number of requests originating from informers will be reduced by half from 2 (LIST/WATCH) to just 1 (WATCH)
No.
No
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
On the contrary. It will decrease the memory usage of kube-apiservers needed to handle "list" requests.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
On the contrary. It will decrease the memory usage required for master nodes.
When the kube-apiserver is unavailable then this feature will also be unavailable.
When etcd is unavailable, requests attempting to retrieve the most recent state of the cluster will fail.
- kube-controller-manager is unable to start.
- Detection: How can it be detected via metrics? Examine the prometheus
up
time series or examine the pod status or the number of restarts. - Mitigations: What can be done to stop the bleeding, especially for already
running user workloads? Disable the feature. Pass
WatchListClient=false
tofeature-gates
command line flag. - Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? N/A
- Testing: Are there any tests for failure mode? If not, describe why. Yes, if kube-controller-manager is unable to start then a lot of existing e2e tests will fail.
- Detection: How can it be detected via metrics? Examine the prometheus
None SLOs have been defined for this feature yet.
The KEP was proposed on 2022-01-14
N/A
- We could tune the cost function used by the priority and fairness feature. There are at least a few issues with this approach. The first is that we would have to come up with a cost estimation function that can approximate the temporary memory consumption. This might be challenging since we don't know the true cost of the entire list upfront as object sizes can vastly differ throughout the keyspace (imagine some namespaces with giant secrets, some with small secrets). The second issue, assuming we could estimate it, would mean that we would have to throttle the server to handle just a few requests at a given time as the estimate would likely be uniform over resource type or other coarse dimensions
- We could attempt to define a function that would prevent the server from allocating more memory than a given threshold.
A function like that would require measuring memory usage in real-time. Things we evaluated:
- runtime.ReadMemStats gives us accurate measurement but at the same time is very expensive. It requires STW (stop-the-world) which is equivalent to stopping all running goroutines. Running with 100ms frequency would block the runtime 10 times per second.
- reading from proc would probably increase the CPU usage (polling) and would add some delay (propagation time from the kernel about current memory usage). Since the spike might be very sudden (milliseconds) it doesn’t seem to be a viable option.
- there seems to be no other API provided by golang runtime that would allow for gathering memory stats in real-time other than runtime.ReadMemStats
- using cgroup notification API is efficient (epoll) and near real-time but it seems to be limited in functionality. We could be notified about crossing previously defined memory thresholds but we would still need to calculate available(free) memory on a node.
- We could allow for paginated LIST requests to be served directly from the watch cache. This approach has a few advantages. Primarily it doesn't require changing informers, no version skew issues. At the same time, it also presents a few challenges. The most concerning is that it would actually not solve the issue. It seems it would also lead to (temporary) memory consumption because we would need to allocate space for the entire response (LIST), keep it in memory until the whole response has been sent to the client (which can be up to 60s) and this could be O({2,3}*the-size-of-the-page).
A LIST request can be satisfied from two places, largely depending on used query parameters:
- by default directly from etcd. In such cases, the memory demand might be extensive, exceeding the full response size from the data store many times.
- from the watch cache if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion="0"). This is actually how most client-go-based controllers actually prime their caches due to performance reasons. The memory usage will be much lower than in the first case. However, it is not perfect as we still need space to store serialized objects and to hold the full response until is sent.
The following steps depict a flow of how client-go-based informers work today.
- on startup: informers issue a LIST RV="0" request with pagination, which due to performance reasons translates to a full (pagination is ignored) LIST from the watch cache.
- repeated until ResourceExpired 410: establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
- on resumption: establish a new LIST request to the watch cache with RV="last-known-from-step2" (step1) and then another WATCH request.
- after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2
In rare cases, an informer might connect to an API server whose watch cache hasn't been fully synchronized (after kube-apiserver restart). In that case its flow will be slightly different.
- on startup: informers issue a LIST RV="0" request with pagination, which effectively equals a paginated LIST RV="", i.e. it gets a consistent snapshot of data directly from etcd (quorum read) in chunks (pagination).
- repeated until ResourceExpired 410: they establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
- on resumption: establish a paginated LIST RV="last-known-from-step2" request (step1) and then another WATCH request.
- after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2
N/A