KEP-3157: allow informers for getting a stream of data instead of chunking.

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Appendix
- Sources of LIST request
- Steps followed by informers
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The kube-apiserver is vulnerable to memory explosion. The issue is apparent in larger clusters, where only a few LIST requests might cause serious disruption. Uncontrolled and unbounded memory consumption of the servers does not only affect clusters that operate in an HA mode but also other programs that share the same machine. In this KEP we propose a solution to this issue.

Motivation

Today informers are the primary source of LIST requests. The LIST is used to get a consistent snapshot of data to build up a client-side in-memory cache. The primary issue with LIST requests is unpredictable memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects. See the Appendix section for more details on potential sources of LIST request and their impact on memory. In extreme cases, the server can allocate hundreds of megabytes per request. To better visualize the issue let's consider the above graph. It shows the memory usage of an API server during a test (see manual test section for more details). We can see that increasing the number of informers drastically increases the memory consumption of the server. Moreover, around 16:40 we lost the server after running 16 informers. During an investigation, we realized that the server allocates a lot of memory for handling LIST requests. In short, it needs to bring data from the database, unmarshal it, do some conversions and prepare the final response for the client. The bottom line is around O(5*the_response_from_etcd) of temporary memory consumption. Neither priority and fairness nor Golang garbage collection is able to protect the system from exhausting memory.

A situation like that is dangerous twofold. First, as we saw it could slow down if not fully stop an API server that has received the requests. Secondly, a sudden and uncontrolled spike in memory consumption will likely put pressure on the node itself. This might lead to thrashing, starving, and finally losing other processes running on the same node, including kubelet. Stopping kubelet has serious issues as it leads to workload disruption and a much bigger blast radius. Note that in that scenario even clusters in an HA setup are affected.

Worse, in rare cases (see the Appendix section for more) recovery of large clusters with therefore many kubelets and hence informers for pods, secrets, configmap can lead to a very expensive storm of LISTs.

Goals

protect kube-apiserver and its node against list-based OOM attacks
considerably reduce (temporary) memory footprint of LISTs, down from O(watchers*page-size*object-size*5) to O(watchers*constant), constant around 2 MB.

Example:

512 watches of 400mb data: 5125002MB*5=2.5TB ↘ 2 GB

racing with Golang GC to free this temporary memory before being OOM'ed.

reduce etcd load by serving from watch cache
get a replacement for paginated lists from watch-cache, which is not feasible without major investment
enforce consistency in the sense of freshness of the returned list
be backward compatible with new client -> old server
fix the long-standing "stale reads from the cache" issue, kubernetes/kubernetes#59848

Non-Goals

get rid of list or list pagination
rewrite the list storage stack to allow streaming, but rather use the existing streaming infrastructure (watches).

Proposal

In order to lower memory consumption while getting a list of data and make it more predictable, we propose to use streaming from the watch-cache instead of paging from etcd. Initially, the proposed changes will be applied to informers as they are usually the heaviest users of LIST requests (see Appendix section for more details on how informers operate today). The primary idea is to use standard WATCH request mechanics for getting a stream of individual objects, but to use it for LISTs. This would allow us to keep memory allocations constant. The server is bounded by the maximum allowed size of an object of 1.5 MB in etcd (note that the same object in memory can be much bigger, even by an order of magnitude) plus a few additional allocations, that will be explained later in this document. The rough idea/plan is as follows:

step 1: change the informers to establish a WATCH request with a new query parameter instead of a LIST request.
step 2: upon receiving the request from an informer, compute the RV at which the result should be returned (possibly contacting etcd if consistent read was requested). It will be used to make sure the watch cache has seen objects up to the received RV. This step is necessary and ensures we will meet the consistency requirements of the request.
step 2a: send all objects currently stored in memory for the given resource type.
step 2b: propagate any updates that might have happened meanwhile until the watch cache catches up to the latest RV received in step 2.
step 2c: send a bookmark event to the informer with the given RV.
step 3: listen for further events using the request from step 1.

Note: the proposed watch-list semantics (without bookmark event and without the consistency guarantee) kube-apiserver follows already in RV="0" watches. The mode is not used in informers today but is supported by every kube-apiserver for legacy, compatibility reasons. A watch started with RV="0" may return stale data. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches.

Note 2: informers need consistent lists to avoid time-travel when initializing after restart to avoid time travel in case of switching to another HA instance of kube-apiserver with outdated/lagging watch cache. See the following issue for more details.

Risks and Mitigations

Design Details

Required changes for a WATCH request with the SendInitialEvents=true

The following sequence diagram depicts steps that are needed to complete the proposed feature. A high-level overview of each was provided in a table that follows immediately the diagram. Whereas further down in this section we provided a detailed description of each required step.

Step	Description
1.	The reflector establishes a WATCH request with the watch cache.
2.	If needed, the watch cache contacts etcd for the most up-to-date ResourceVersion.
2a.	The watch cache starts streaming initial data it already has in memory.
2b.	The watch cache waits until it has observed data up to the RV received in step 2. Streaming all new data (if any) to the reflector immediately.
2c.	The watch cache has observed the RV (from step 2) and sends a bookmark event with the given RV to the reflector.
3.	The reflector replaces its internal store with collected items, updates its internal resourceVersion to the one obtained from the bookmark event.
3a.	The reflector uses the WATCH request from step 1 for further progress notifications.

Step 1: On initialization the reflector gets a snapshot of data from the server by passing RV=”” (= unset value) to ensure freshness and setting resourceVersionMatch=NotOlderThan and sendInitialEvents=true. We do that only during the initial ListAndWatch call. Each event (ADD, UPDATE, DELETE) except the BOOKMARK event received from the server is collected. Passing resourceVersion="" tells the cacher it has to guarantee that the cache is at least up to date as a LIST executed at the same time.

Note: This ensures that returned data is consistent, served from etcd via a quorum read and prevents "going back in time".

Note 2: Watch cache currently doesn't have the feature of supporting resourceVersion="" and thus is vulnerable to stale reads, see kubernetes/kubernetes#59848 for more details.

Step 2: Right after receiving a request from the reflector, the cacher gets the current resourceVersion (aka bookmarkAfterResourceVersion) directly from the etcd. It is used to make sure the cacher is up to date (has seen data stored in etcd) and to let the reflector know it has seen all initial data. There are ways to do that cheaply, e.g. we could issue a count request against the datastore. Next, the cacher creates a new cacheWatcher (implements watch.Interface) passing the given bookmarkAfterResourceVersion, and gets initial data from the watchCache. After sending initial data the cacheWatcher starts listening on an input channel for new events, including a bookmark event. At some point, the cacher will receive an event with the resourceVersion equal or greater to the bookmarkAfterResourceVersion. It will be propagated to the cacheWatcher and then back to the reflector as a BOOKMARK event.

Step 2a: Where does the initial data come from?

During construction, the cacher creates the reflector and the watchCache. Since the watchCache implements the Store interface it is used by the reflector to store all data it has received from etcd.

Step 2b: What happens when new events are received while the cacheWatcher is sending initial data?

The cacher maintains a list of all current watchers (cacheWatcher) and a separate goroutine (dispatchEvents) for delivering new events to the watchers. New events are added via the cacheWatcher.nonblockingAdd method that adds an event to the cacheWatcher.input channel. The cacheWatcher.input is a buffered channel and has a different size for different Resources (10 or 1000). Since the cacheWatcher starts processing the cacheWatcher.input channel only after sending all initial events it might block once its buffered channel tips over. In that case, it will be added to the list of blockedWatchers and will be given another chance to deliver an event after all nonblocking watchers have sent the event. All watchers that have failed to deliver the event will be closed.

Closing the watchers would make the clients retry the requests and download the entire dataset again even though they might have received a complete list before.

For an alpha version, we will delay closing the watch request until all data is sent to the client. We expect this to behave well even in heavily loaded clusters. To increase confidence in the approach, we will collect metrics for measuring how far the cache is behind the expected RV, what's the average buffer size, and a counter for closed watch requests due to an overfull buffer.

For a beta version, we have further options if they turn out to be necessary:

comparing the bookmarkAfterResourceVersion (from Step 2) with the current RV the watchCache is on and waiting until the difference between the RVs is < 1000 (the buffer size). We could do that even before sending the initial events. If the difference is greater than that it seems there is no need to go on since the buffer could be filled before we will receive an event with the expected RV. Assuming all updates would be for the resource the watch request was opened for (which seems unlikely). In case the watchCache was unable to catch up to the bookmarkAfterResourceVersion for some timeout value hard-close (ends the current connection by tearing down the current TCP connection with the client) the current connection so that client re-connects to a different API server with most-up to date cache. Taking into account the baseline etcd performance numbers waiting for 10 seconds will allow us to receive ~5K events, assuming ~500 QPS throughput (see https://etcd.io/docs/v3.4/op-guide/performance/) Once we are past this step (we know the difference is smaller) and the buffer fills up we:
- case-1: won’t close the connection immediately if the bookmark event with the expected RV exists in the buffer. In that case, we will deliver the initial events, any other events we have received which RVs are <= bookmarkAfterResourceVersion, and finally the bookmark event, and only then we will soft-close (simply ends the current connection without tearing down the TCP connection) the current connection. An informer will reconnect with the RV from the bookmark event. Note that any new event received was ignored since the buffer was full.
- case-2: soft-close the connection if the bookmark event with the expected RV for some reason doesn't exist in the buffer. An informer will reconnect arriving at the step that compares the RVs first.
make the buffer dynamic - especially when the difference between RVs is > than 1000
inject new events directly to the initial list, i.e. to have the initial list loop consume the channel directly and avoid to wait for the whole initial list being processed before
cap the size (cannot allocate more than X MB of memory) of the buffer
maybe even apply some compression techniques to the buffer (for example by only storing a low-memory shallow reference and take the actual objects for the event from the store)

Note: The RV is effectively a global counter that is incremented every time an object is updated. This imposes a global order of events. It is equivalent to a LIST followed by a WATCH request.

Note 2: Currently, there is a timeout for LIST requests of 60s. That means a slow reflector might fail synchronization as well and would have to re-establish the connection.

Step 2c: How bookmarks are delivered to the cacheWatcher?

First of all, the primary purpose of bookmark events is to deliver the current resourceVersion to watchers, continuously even without regular events happening. There are two sources of resourceVersions. The first one is regular events that contain RVs besides objects. The second one is a special type of etcd event called progressNotification delivering the most up-to-date revision with the given interval only to the kube-apiserver. As already mentioned in 2a the watchCache is driven by the reflector. Every event will be eventually propagated from the watchCache to the cacher.processEvent method. For simplicity, we can assume that the processEvent method will simply update the resourceVersion maintained by the cacher.

At regular intervals, the cacher checks expired watchers and tries to deliver a bookmark event. As of today, the interval is set to 1 second. The bookmark event contains an empty object and the current resourceVersion. By default, a cacheWatcher expires roughly every 1 minute.

The expiry interval initially will be decreased to 1 second in this feature's code-path. This helps us deliver a bookmark event that is >= bookmarkAfterResourceVersion much faster. After that, the interval will be put back to the previous value.

Note: Since we get a notification every 5 seconds from etcd and we try to deliver a bookmark every 1 second. It seems the maximum delay time a reflector will have to wait after receiving initial data is 6 seconds (assuming small dataset). It might be unlikely in practice since we might get bookmarkAfterResourceVersion even before handling initial data. Also sending data itself takes some time as well.

Step 3: After receiving a BOOKMARK event the reflector is considered to be synchronized. It replaces its internal store with the collected items (syncWith) and reuses the current connection for getting further events.

API changes

Extend the ListOptions struct with the following field:

type ListOptions struct {
    ...

        // `sendInitialEvents=true` may be set together with `watch=true`.
	// In that case, the watch stream will begin with synthetic events to
	// produce the current state of objects in the collection. Once all such
	// events have been sent, a synthetic "Bookmark" event  will be sent.
	// The bookmark will report the ResourceVersion (RV) corresponding to the
	// set of objects, and be marked with `"k8s.io/initial-events-end": "true"` annotation.
	// Afterwards, the watch stream will proceed as usual, sending watch events
	// corresponding to changes (subsequent to the RV) to objects watched.
	//
	// When `sendInitialEvents` option is set, we require `resourceVersionMatch`
	// option to also be set. The semantic of the watch request is as following:
	// - `resourceVersionMatch` = NotOlderThan
	//   is interpreted as "data at least as new as the provided `resourceVersion`"
	//   and the bookmark event is send when the state is synced
	//   to a `resourceVersion` at least as fresh as the one provided by the ListOptions.
	//   If `resourceVersion` is unset, this is interpreted as "consistent read" and the
	//   bookmark event is send when the state is synced at least to the moment
	//   when request started being processed.
	// - `resourceVersionMatch` set to any other value or unset
	//   Invalid error is returned.
	//
	// Defaults to true if `resourceVersion=""` or `resourceVersion="0"` (for backward
	// compatibility reasons) and to false otherwise.
        SendInitialEvents *bool
}

The watch bookmark marking the end of initial events stream will have a dedicated annotation:

"k8s.io/initial-events-end": "true"

(the exact name is subject to change during API review). It will allow clients to precisely figure out when the initial stream of events is finished.

It's worth noting that explicitly setting SendInitialEvents to false with ResourceVersion="0" will result in not sending initial events, which makes the option works exactly the same across every potential resource version passed as a parameter.

Important optimisations

Avoid DeepCopying of initial data

The watchCache has an important optimization of wrapping objects into a cachingObject. Given that objects aren't usually modified (since selfLink has been disabled) and the fact that there might be multiple watchers interested in receiving an event. Wrapping allows us for serializing an object only once. The watchCache maintains two internal data structures. The first one is called the store and is driven by the reflector. It essentially mirrors the content stored in etcd. It is used to serve LIST requests. The second one is called the cache, which represents a sliding window of recent events received from the reflector. It is effectively used to serve WATCH requests from a given RV.

By design cachingObjects are stored only in the cache. As described in Step 2, the cacheWatcher gets initial data from the watchCacher. The latter, in turn, gets data straight from the store. That means initial data is not wrapped into cachingObject and hence not subject to this existing optimization.

Before sending objects any further the cacheWatcher does a DeepCopy of every object that has not been wrapped into the cachingObject. Making a copy of every object is both CPU and memory intensive. It is a serious issue that needs to be addressed.
Reduce the number of allocations in the WatchServer

The WatchServer is largely responsible for streaming data received from the storage layer (in our case from the cacher) back to clients. It turns out that sending a single event per consumer requires 4 memory allocations, visualized in the following image. Two of which deserve special attention, namely the allocations 1 and 3 because they won't reuse memory and rely on the GC for cleanup. In other words, the more events we need to send, the more (temporary) memory will be used. In contrast, the other two allocations are already optimizedas they reuse memory instead of creating new buffers for every single event. For better utilization, a similar technique of reusing memory could be used to save precious RAM and scale the system even further.

Manual testing without the changes in place

For the past few years, we have seen many clusters suffering from the issue. Sadly, our only possible recommendation was to ask customers to reduce the cluster in size. Since adding more memory in most of the cases would not fix the issue. Recall from the motivation section that just a few requests can allocate gigabytes of data in a fraction of a second

In order to reproduce the issue, we executed the following manual test, it is the simplest and cheapest way of putting yourself into customers' shoes: the reproducer creates a namespace with 400 secrets, each containing 1 MB of data. Next, it uses informers to get all secrets in the cluster. The rough estimate is that a single informer will have to bring at least 400MB from the datastore to get all secrets.

The result: 16 informers were able to take down the test cluster.

Results with WATCH-LIST

We have prepared the following PR kubernetes/kubernetes#106477 which is almost identical to the proposed solution. It just differs in a few details. The following image depicts the results we obtained after running the synthetic test described in 4. First of all, it is worth mentioning that the PR was deployed onto the same cluster so that we could ensure an identical setup (CPU, Memory) between the tests. The graph tells us a few things.

Firstly, the proposed solution is at least 100 times better than the current state. Around 12:05 we started the test with 1024 informers, all eventually synced without any errors. Moreover during that time the server was stable and responsive. That particular test ended around 12:30. That means it needed ~25 minutes to bring ~400 GB of data across the network! Impressive achievement.

Secondly, it tells us that memory allocation is not proportional yet to the number of informers! Given the size of individual objects of 1MB and the actual number of informers, we should allocate roughly around 2GB of RAM. We managed to get and analyze the memory profile that showed a few additional allocations inside the watch server. At this point, it is worth mentioning that the results were achieved with only the first optimization applied. We expect the system will scale even better with the second optimization as it will put significantly less pressure on the GC.

Required changes for a WATCH request with the RV set to the last observed value (RV > 0)

In that case, no additional changes are required. We stick to existing semantics. That is we start a watch at an exact resource version. The watch events are for all changes after the provided resource version. This is safe because the client is assumed to already have the initial state at the starting resource version since the client provided the resource version.

Provide a fix for the long-standing issue kubernetes/kubernetes#59848

The issue is still open mainly because informers default to resourceVersion="0" for their initial LIST requests. This is problematic because the initial LIST requests served from the watch cache might return data that are arbitrarily delayed. This in turn could make clients connected to that server read old data and undo recent work that has been done.

To make consistent reads from cache for LIST requests and thus prevent "going back in time" we propose to use the same technique for ensuring the cache is not stale as described in the previous section.

In that case we are going to change informers to pass "resourceVersion="0" and resourceVersionMatch=MostRecent" for their initial LIST requests. Then on the server side we:

get the current revision from etcd.
use the existing waitUntilFreshAndBlock function to wait for the watch to catch up to the revision requested in the previous step.
reject the request if waitUntilFreshAndBlock times out, thus forcing informers to retry.
otherwise, construct the final list and send back to a client.

Replacing standard List request with WatchList mechanism for client-go's List method.

Replacing the underlying implementation of the List method for client-go based clients (like typed or dynamic client) with the WatchList mechanism requires ensuring that the data returned by both the standard List request and the new WatchList mechanism remains identical. The challenge is that WatchList no longer retrieves the entire list from the server at once but only receives individual items, which forces us to "manually" reconstruct the list object on the client side.

To correctly construct the list object on the client side, we need ListKind information. However, simply reconstructing the list object based on these data is not enough. In the case of a standard List request, the server's response (a versioned list) is processed through a chain of decoders, which can potentially modify the resulting list object. A good example is the WithoutVersionDecoder, which removes the GVK information from the list object. Thus the "manually" constructed list object may not be consistent with the transformations applied by the decoders, leading to differences.

To ensure full compatibility, the server must provide a versioned empty list in the format requested by the client (e.g., protobuf representation). We don't know how the client's decoder behaves for different encodings, i.e., whether the decoder actually supports the encoding we intend to use for reconstruction. Therefore, to ensure maximal compatibility, we will ensure that the encoding used for the reconstruction of the list matches the format that the client originally requested. This guarantees that the returned list object can be correctly decoded by the client, preserving the actual encoding format as intended.

The proposed solution is to add a new annotation (k8s.io/initial-events-list-blueprint) to the object returned in the bookmark event (The bookmark event is sent when the state is synced and marks the end of WatchList stream). This annotation will store an empty, versioned list encoded as a Base64 string. This annotation will be added to the same object/place the k8s.io/initial-events-end annotation is added.

When the client receives such a bookmark, it will base64 decode the empty list and pass it to the decoder chain. Only after a successful response from the decoders the list will be populated with data received from subsequent watch events and returned.

For example:

GET /api/v1/namespaces/test/pods?watch=1&sendInitialEvents=true&allowWatchBookmarks=true&resourceVersion=&resourceVersionMatch=NotOlderThan
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json

{
  "type": "ADDED",
  "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "8467", "name": "foo"}, ...}
}
{
  "type": "ADDED",
  "object": {"kind": "Pod", "apiVersion": "v1", "metadata":     {"resourceVersion": "5726", "name": "bar"}, ...}
}
{
"type":"BOOKMARK",
"object":{"kind":"Pod","apiVersion":"v1","metadata":{"resourceVersion":"13519","annotations":{"k8s.io/initial-events-end":"true","k8s.io/initial-events-embedded-list":"eyJraW5kIjoiUG9kTGlzdCIsImFwaVZlcnNpb24iOiJ2MSIsIm1ldGFkYXRhIjp7fSwiaXRlbXMiOm51bGx9Cg=="}} ...}
}
...
<followed by regular watch stream starting>

Alternatives

We could modify the type of the object passed in the last bookmark event to include the list. This approach would require changes to the reflector, as it would need to recognize the new object type in the bookmark event. However, this could potentially break other clients that are not expecting a different object in the bookmark event.

Another option would be to issue an empty list request to the API server to receive a list response from the client. This approach would involve modifying client-go and implementing some form of caching mechanism, possibly with invalidation policies. Non-client-go clients that want to use this new feature would need to rebuild this mechanism as well.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/apiserver/pkg/storage/cacher: 02/02/2023 - 74,7%
k8s.io/client-go/tools/cache/reflector: 02/02/2023 - 88,6%

Integration tests

For alpha, tests asserting fallback mechanism for reflector will be added.

e2e tests

For alpha, tests exercising this feature will be added.

Graduation Criteria

Alpha

The Feature is implemented behind WatchList feature flag
Initial e2e tests completed and enabled
Scalability/Performance tests confirm gains of this feature
Add support for watchlist to APF

Beta

Metrics are added to the kube-apiserver (see the monitoring-requirements section for more details)
Implement SendInitialEvents for watch requests in the etcd storage implementation
The feature is enabled for kube-apiserver and kube-controller-manager
The generic feature gate mechanism is implemented in client-go. It will be used to enable a new functionality for reflectors/informers.
Implement a consistency check detector that will compare data received through a new watchlist request with data obtained through a standard list request. The detector will be added to the reflector and activated when an environment variable is set. The environment variable will be set for all jobs run in the Kube CI.
Update the client-go generated List function to watchList data when the feature gate has been enabled and the ListOptions are satisfied. This change must be applied to the typed, dynamic and metadata clients.
Implement a mechanism for automatically detecting etcd configuration Whether it is safe to use the RequestWatchProgress API call or if the experimental-watch-progress-notify-interval flag has been set. Knowing etcd configuration will be used to automatically disable the streaming feature.
Use WatchProgressRequester to request progress notifications directly from etcd. This mechanism was developed in Consistent Reads from Cache KEP and will reduce the overall latency for watchlist requests.
The watchlist call, which serves as a drop-in replacement for list calls in client libraries, must properly set the kind and apiVersion fields. These fields are important for the correct decoding of the objects. See also: kubernetes/kubernetes#126191

GA

Switch the storage/cacher to use streaming directly from etcd (This will also allow us to remove the reflector.UseWatchList field).

Post-GA

Make list calls expensive in APF. Once all supported releases have the streaming list enabled by default (client-go, control plane components) and the feature itself is locked to its default value, we can increase the cost of regular list requests in APF. This ensures that the fallback mechanism, which switches back to the standard list when streaming has issues, will not be affected.

Upgrade / Downgrade Strategy

Version Skew Strategy

Our immediate idea to ensure backward compatibility between new clients and old servers would be to return a 401 response in old Kubernetes releases (via backports). This approach however would limit the maximum skew version mismatch to just a few previous releases, and would also force customers to update to latest minor versions.

Therefore we propose to make use of the already existing "resourceVersionMatch" LIST option. WATCH requests with that option set will be immediately rejected with a 403 (Forbidden) response by previous servers. In that case, new clients will fall back to the previous mode (ListAndWatch). New servers will allow for WATCH requests to have "resourceVersionMatch=MostRecent" set.

Existing clients will be forward and backward compatible and won't require any changes since the server will preserve the old behavior (ListAndWatch).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: WatchList
- Components depending on the feature gate:
- kube-apiserver
- Feature gate name: WatchListClient (the actual name might be different because it hasn't been added yet)
- Components depending on the feature gate:
  - kube-controller-manager via client-go library
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No. Because users must enable the feature on the client side (client-go).

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling WatchList FeatureGate for kube-apiserver. In this case kube-apiserver will reject WATCH requests with the new query parameter forcing informers to fall back to the previous mode.

Yes, by disabling WatchListClient FeatureGate for kube-controller-manager. In this case informers will follow standard LIST/WATCH semantics.

Note that for safety reasons, reflectors/informers will always fallback to a regular LIST operation regardless of the error that occurred.

What happens if we reenable the feature if it was previously rolled back?

The expected behavior of the feature will be restored.

Are there any tests for feature enablement/disablement?

Yes. There is an integration test that verifies the fallback mechanism of the reflector when interacting with servers that has the WatchList feature enabled/disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Feature does not have a direct impact on rollout/rollback.

However, faulty behavior of a feature can result in incorrect functioning of components that rely on that feature. For the Beta version, we plan to enable it exclusively for kube-controller-manager. The main issues can arise during the initial informer synchronization, which may result in controller failures.

Furthermore, if data consistency issues arise, such as missing data, the controllers simply do not consider the missing data.

What specific metrics should inform a rollback?

apiserver_terminated_watchers_total - a large number of terminated watchers might indicate synchronization issues. For example, we have some client-side error where we're not getting data from the server. Or we have a server-side error, and the buffer is getting cluttered.

apiserver_request_duration_second_bucket - in general, a large number of "short" watch requests can indicate synchronization issues.

apiserver_watch_list_duration_seconds - the absence of this metric may indicate that the client did not receive a special bookmark. The issue here could be that the server never sent it due to an error or didn't even receive it from the database.

apiserver_watch_list_duration_seconds - long synchronization times may indicate that the server is lagging behind etcd. Forr example, not receiving progress notifications from the database frequently.

apiserver_watch_cache_lag - tells how far behind the server is compared to the database. Significant discrepancies affect the times for full data synchronization.

A good metric can also be the number of kube-controller-manager restarts. Which may indicate issues with informers synchronization.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Upgrade->downgrade->upgrade testing was done manually using the following steps:

Build and run Kubernetes from the master branch using Kind.

kind build node-image --arch "arm64"

kind create cluster --image kindest/node:latest

kubectl get no
NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   26s   v1.29.0-alpha.1.47+f8571dabf79717

Check if the kube-apiserver(aka kas) has recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1

Disable the WatchList feature gate for the kas by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml

and pass - --feature-gates=WatchList=false to the kas container.

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Check if kube-controler-manger(aka kcm) is running.

kubectl get po -n kube-system
NAME                                         READY   STATUS    RESTARTS      AGE
…
kube-controller-manager-kind-control-plane   1/1     Running   1 (44s ago)   3m28s

Check if informers used by the kcm fell back to standard LIST/WATCH semantics.

kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"
W1002 09:11:40.656641       1 reflector.go:340] The watch-list feature is not supported by the server, falling back to the previous LIST/WATCH semantics
…

Disable the WatchListClient feature gate for the kcm by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml

and pass - --feature-gates=WatchListClient=false to the kcm container.

Check if kcm is running.

kubectl get po -n kube-system
NAME                                         READY   STATUS    RESTARTS        AGE
…
kube-controller-manager-kind-control-plane   1/1     Running   0               12s

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Check if there are no traces of informers for kcm falling back to standard LIST/WATCH semantics.

kubectl logs -n kube-system kube-controller-manager-kind-control-plane | grep -e "watch-list"

Enable the WatchList feature gate for the kas by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-apiserver.yaml

and remove - --feature-gates=WatchList=false from the kas container.

Check if kcm is running.

kubectl get po -n kube-system
NAME                                         READY   STATUS             RESTARTS      AGE
…
kube-controller-manager-kind-control-plane   1/1     Running            1 (22s ago)   86s

Check if the kas has not recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"

Enable the WatchListClient feature gate for the kcm by editing the static pod manifest directly.

docker exec -ti kind-control-plane bash
vim /etc/kubernetes/manifests/kube-controller-manager.yaml

and remove - --feature-gates=WatchListClient=false for the cm container.

Check if kcm is running.

kubectl get po -n kube-system
NAME                                         READY   STATUS    RESTARTS      AGE
…
kube-controller-manager-kind-control-plane   1/1     Running   0             13s

Check if the kas has recorded the watchlist latency metric.

kubectl get --raw '/metrics' | grep "apiserver_watch_list_duration_seconds"
# HELP apiserver_watch_list_duration_seconds [ALPHA] Response latency distribution in seconds for watch list requests broken by group, version, resource and scope.
# TYPE apiserver_watch_list_duration_seconds histogram
…
apiserver_watch_list_duration_seconds_bucket{group="",resource="configmaps",scope="cluster",version="v1",le="6"} 1

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

If apiserver_watch_list_duration_seconds metric has some data then this feature is in use.

How can someone using this feature know that it is working for their instance?

Assuming that historical data is available then comparing the number of LIST and WATCH requests to the server will tell whether the feature was enabled. When this feature is enabled, the number of LIST requests will be smaller. The difference primarily arises from switching informers to a new mode of operation.

Checking whether WatchListClient FeatureGate has been set for the given component.

Knowing the username for a component, the audit logs could be examined to see whether sendInitialEvents=true in the requestURI has been set for that user.

Scanning the component's logs for the phrase Reflector WatchList. For requests lasting more than 10 seconds, traces will be reported.

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

None have been defined yet.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: apiserver_terminated_watchers_total (counter, already defined, needs to be updated (by an attribute) so that we count closed watch requests due to an overfull buffer in the new mode)
- Metric name: apiserver_watch_list_duration_seconds (histogram, measures latency of watch-list requests)
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No. On the contrary. The number of requests originating from informers will be reduced by half from 2 (LIST/WATCH) to just 1 (WATCH)

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

On the contrary. It will decrease the memory usage of kube-apiservers needed to handle "list" requests.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

On the contrary. It will decrease the memory usage required for master nodes.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

When the kube-apiserver is unavailable then this feature will also be unavailable.

When etcd is unavailable, requests attempting to retrieve the most recent state of the cluster will fail.

What are other known failure modes?

kube-controller-manager is unable to start.
- Detection: How can it be detected via metrics? Examine the prometheus up time series or examine the pod status or the number of restarts.
- Mitigations: What can be done to stop the bleeding, especially for already running user workloads? Disable the feature. Pass WatchListClient=false to feature-gates command line flag.
- Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? N/A
- Testing: Are there any tests for failure mode? If not, describe why. Yes, if kube-controller-manager is unable to start then a lot of existing e2e tests will fail.

What steps should be taken if SLOs are not being met to determine the problem?

None SLOs have been defined for this feature yet.

Implementation History

The KEP was proposed on 2022-01-14

Drawbacks

N/A

Alternatives

We could tune the cost function used by the priority and fairness feature. There are at least a few issues with this approach. The first is that we would have to come up with a cost estimation function that can approximate the temporary memory consumption. This might be challenging since we don't know the true cost of the entire list upfront as object sizes can vastly differ throughout the keyspace (imagine some namespaces with giant secrets, some with small secrets). The second issue, assuming we could estimate it, would mean that we would have to throttle the server to handle just a few requests at a given time as the estimate would likely be uniform over resource type or other coarse dimensions
We could attempt to define a function that would prevent the server from allocating more memory than a given threshold. A function like that would require measuring memory usage in real-time. Things we evaluated:
- runtime.ReadMemStats gives us accurate measurement but at the same time is very expensive. It requires STW (stop-the-world) which is equivalent to stopping all running goroutines. Running with 100ms frequency would block the runtime 10 times per second.
- reading from proc would probably increase the CPU usage (polling) and would add some delay (propagation time from the kernel about current memory usage). Since the spike might be very sudden (milliseconds) it doesn’t seem to be a viable option.
- there seems to be no other API provided by golang runtime that would allow for gathering memory stats in real-time other than runtime.ReadMemStats
- using cgroup notification API is efficient (epoll) and near real-time but it seems to be limited in functionality. We could be notified about crossing previously defined memory thresholds but we would still need to calculate available(free) memory on a node.
We could allow for paginated LIST requests to be served directly from the watch cache. This approach has a few advantages. Primarily it doesn't require changing informers, no version skew issues. At the same time, it also presents a few challenges. The most concerning is that it would actually not solve the issue. It seems it would also lead to (temporary) memory consumption because we would need to allocate space for the entire response (LIST), keep it in memory until the whole response has been sent to the client (which can be up to 60s) and this could be O({2,3}*the-size-of-the-page).

Appendix

Sources of LIST request

A LIST request can be satisfied from two places, largely depending on used query parameters:

by default directly from etcd. In such cases, the memory demand might be extensive, exceeding the full response size from the data store many times.
from the watch cache if explicitly requested by setting ResourceVersion param of the list (e.g. ResourceVersion="0"). This is actually how most client-go-based controllers actually prime their caches due to performance reasons. The memory usage will be much lower than in the first case. However, it is not perfect as we still need space to store serialized objects and to hold the full response until is sent.

Steps followed by informers

The following steps depict a flow of how client-go-based informers work today.

on startup: informers issue a LIST RV="0" request with pagination, which due to performance reasons translates to a full (pagination is ignored) LIST from the watch cache.
repeated until ResourceExpired 410: establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
on resumption: establish a new LIST request to the watch cache with RV="last-known-from-step2" (step1) and then another WATCH request.
after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2

In rare cases, an informer might connect to an API server whose watch cache hasn't been fully synchronized (after kube-apiserver restart). In that case its flow will be slightly different.

on startup: informers issue a LIST RV="0" request with pagination, which effectively equals a paginated LIST RV="", i.e. it gets a consistent snapshot of data directly from etcd (quorum read) in chunks (pagination).
repeated until ResourceExpired 410: they establish a WATCH request with an RV from the previous step. Each received event updates the last-known RV. On disconnect, it repeats in this step until “IsResourceExpired” (410) error is returned.
on resumption: establish a paginated LIST RV="last-known-from-step2" request (step1) and then another WATCH request.
after compaction (410): we set RV=”” and get a snapshot via quorum read from etcd in chunks and go back to step2

Infrastructure Needed (Optional)

N/A

Files

README.md

Latest commit

History

README.md

File metadata and controls

KEP-3157: allow informers for getting a stream of data instead of chunking.

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

Risks and Mitigations

Design Details

Required changes for a WATCH request with the SendInitialEvents=true

API changes

Important optimisations

Manual testing without the changes in place

Results with WATCH-LIST

Required changes for a WATCH request with the RV set to the last observed value (RV > 0)

Provide a fix for the long-standing issue kubernetes/kubernetes#59848

Replacing standard List request with WatchList mechanism for client-go's List method.

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Post-GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Appendix

Sources of LIST request

Steps followed by informers

Infrastructure Needed (Optional)