-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an API concepts document and describe terminology and API chunking #6540
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
--- | ||
title: Kubernetes API Concepts | ||
approvers: | ||
- bgrant0607 | ||
- smarterclayton | ||
- lavalamp | ||
- liggitt | ||
--- | ||
|
||
{% capture overview %} | ||
This page describes common concepts in the Kubernetes API. | ||
{% endcapture %} | ||
|
||
{% capture body %} | ||
The Kubernetes API is a resource-based (RESTful) programatic interface provided via HTTP. It supports retrieving, creating, | ||
updating, and deleting primary resources via the standard HTTP verbs (POST, PUT, PATCH, DELETE, GET), includes additional subresources for many objects that allow fine grained authorization (such as binding a pod to a node), and can accept and serve those resources in different representations for convenience or efficiency. It also supports efficient change notifications on resources via "watches" and consistent lists to allow other components to effectively cache and synchronize the state of resources. | ||
|
||
## Standard API terminology | ||
|
||
Most Kubernetes API resource types are "objects" - they represent a concrete instance of a concept on the cluster, like a pod or namespace. A smaller number of API resource types are "virtual" - they often represent operations rather than objects, such as a permission check (use a POST with a JSON-encoded body of `SubjectAccessReview` to the `subjectaccessreviews` resource). All objects will have a unique name to allow idempotent creation and retrieval, but virtual resource types may not have unique names if they are not retrievable or do not rely on idempotency. | ||
|
||
Kubernetes generally leverages standard RESTful terminology to describe the API concepts: | ||
|
||
* A **resource type** is the name used in the URL (`pods`, `namespaces`, `services`) | ||
* All resource types have a concrete representation in JSON (their object schema) which is called a **kind** | ||
* A list of instances of a resource type is known as a **collection** | ||
* A single instance of the resource type is called a **resource** | ||
|
||
All resource types are either scoped by the cluster (`/apis/GROUP/VERSION/*`) or to a namespace (`/apis/GROUP/VERSION/namespaces/NAMESPACE/*`). A namespace-scoped resource type will be deleted when its namespace is deleted and access to that resource type is controlled by authorization checks on the namespace scope. The following paths are used to retrieve collections and resources: | ||
|
||
* Cluster-scoped resources: | ||
* `GET /apis/GROUP/VERSION/RESOURCETYPE` - return the collection of resources of the resource type | ||
* `GET /apis/GROUP/VERSION/RESOURCETYPE/NAME` - return the resource with NAME under the resource type | ||
* Namespace-scoped resources: | ||
* `GET /apis/GROUP/VERSION/RESOURCETYPE` - return the collection of all instances of the resource type across all namespaces | ||
* `GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE` - return collection of all instances of the resource type in NAMESPACE | ||
* `GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME` - return the instance of the resource type with NAME in NAMESPACE | ||
|
||
Since a namespace is a cluster-scoped resource type, you can retrieve the list of all namespaces with `GET /api/v1/namespaces` and details about a particular namespace with `GET /api/v1/namespaces/NAME`. | ||
|
||
Almost all object resource types support the standard HTTP verbs - GET, POST, PUT, PATCH, and DELETE. Kubernetes uses the term **list** to describe returning a collection of resources to distinguish from retrieving a single resource which is usually called a **get**. | ||
|
||
Some resource types will have one or more sub-resources, represented as sub paths below the resource: | ||
|
||
* Cluster-scoped subresource: `GET /apis/GROUP/VERSION/RESOURCETYPE/NAME/SUBRESOURCE` | ||
* Namespace-scoped subresource: `GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME/SUBRESOURCE` | ||
|
||
The verbs supported for each subresource will differ depending on the object - see the API documentation more information. It is not possible to access sub-resources across multiple resources - generally a new virtual resource type would be used if that becomes necessary. | ||
|
||
|
||
## Efficient detection of changes | ||
|
||
To enable clients to build a model of the current state of a cluster, all Kubernetes object resource types are required to support consistent lists and an incremental change notification feed called a **watch**. Every Kubernetes object has a `resourceVersion` field representing the version of that resource as stored in the underlying database. When retrieving a collection of resources (either namespace or cluster scoped), the response from the server will contain a `resourceVersion` value that can be used to initiate a watch against the server. The server will return all changes (creates, deletes, and updates) that occur after the supplied `resourceVersion`. This allows a client to fetch the current state and then watch for changes without missing any updates. If the client watch is disconnected they can restart a new watch from the last returned `resourceVersion`, or perform a new collection request and begin again. | ||
|
||
For example: | ||
|
||
1. List all of the pods in a given namespace. | ||
|
||
GET /api/v1/namespaces/test/pods | ||
--- | ||
200 OK | ||
Content-Type: application/json | ||
{ | ||
"kind": "PodList", | ||
"apiVersion": "v1", | ||
"metadata": {"resourceVersion":"10245"}, | ||
"items": [...] | ||
} | ||
|
||
2. Starting from resource version 10245, receive notifications of any creates, deletes, or updates as individual JSON objects. | ||
|
||
GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245 | ||
--- | ||
200 OK | ||
Transfer-Encoding: chunked | ||
Content-Type: application/json | ||
{ | ||
"type": "ADDED", | ||
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "10596", ...}, ...} | ||
} | ||
{ | ||
"type": "MODIFIED", | ||
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "11020", ...}, ...} | ||
} | ||
... | ||
|
||
A given Kubernetes server will only preserve a historical list of changes for a limited time. Older clusters using etcd2 preserve a maximum of 1000 changes. Newer clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code `410 Gone`, clearing their local cache, performing a list operation, and starting the watch from the `resourceVersion` returned by that new list operation. Most client libraries offer some form of standard tool for this logic. (In Go this is called a `Reflector` and is located in the `k8s.io/client-go/cache` package.) | ||
|
||
## Retrieving large results sets in chunks | ||
|
||
On large clusters, retrieving the collection of some resource types may result in very large responses that can impact the server and client. For instance, a cluster may have tens of thousands of pods, each of which is 1-2kb of encoded JSON. Retrieving all pods across all namespaces may result in a very large response (10-20MB) and consume a large amount of server resources. Starting in Kubernetes 1.9 the server supports the ability to break a single large collection request into many smaller chunks while preserving the consistency of the total request. Each chunk can be returned sequentially which reduces both the total size of the request and allows user-oriented clients to display results incrementally to improve responsiveness. | ||
|
||
To retrieve a single list in chunks, two new parameters `limit` and `continue` are supported on collection requests and a new field `continue` is returned from all list operations in the list `metadata` field. A client should specify the maximum results they wish to receive in each chunk with `limit` and the server will return up to `limit` resources in the result and include a `continue` value if there are more resources in the collection. The client can then pass this `continue` value to the server on the next request to instruct the server to return the next chunk of results. By continuing until the server returns an empty `continue` value the client can consume the full set of results. | ||
|
||
Like a watch operation, a `continue` token will expire after a short amount of time (by default 5 minutes) and return a `410 Gone` if more results cannot be returned. In this case, the client will need to start from the beginning or omit the `limit` parameter. | ||
|
||
For example, if there are 1,253 pods on the cluster and the client wants to receive chunks of 500 pods at a time, they would request those chunks as follows: | ||
|
||
1. List all of the pods on a cluster, retrieving up to 500 pods each time. | ||
|
||
GET /api/v1/pods?limit=500 | ||
--- | ||
200 OK | ||
Content-Type: application/json | ||
{ | ||
"kind": "PodList", | ||
"apiVersion": "v1", | ||
"metadata": { | ||
"resourceVersion":"10245", | ||
"continue": "ENCODED_CONTINUE_TOKEN", | ||
... | ||
}, | ||
"items": [...] // returns pods 1-500 | ||
} | ||
|
||
2. Continue the previous call, retrieving the next set of 500 pods. | ||
|
||
GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN | ||
--- | ||
200 OK | ||
Content-Type: application/json | ||
{ | ||
"kind": "PodList", | ||
"apiVersion": "v1", | ||
"metadata": { | ||
"resourceVersion":"10245", | ||
"continue": "ENCODED_CONTINUE_TOKEN_2", | ||
... | ||
}, | ||
"items": [...] // returns pods 501-1000 | ||
} | ||
|
||
3. Continue the previous call, retrieving the last 253 pods. | ||
|
||
GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN_2 | ||
--- | ||
200 OK | ||
Content-Type: application/json | ||
{ | ||
"kind": "PodList", | ||
"apiVersion": "v1", | ||
"metadata": { | ||
"resourceVersion":"10245", | ||
"continue": "", // continue token is empty because we have reached the end of the list | ||
... | ||
}, | ||
"items": [...] // returns pods 1001-1253 | ||
} | ||
|
||
Note that the `resourceVersion` of the list remains constant across each request, indicating the server is showing us a consistent snapshot of the pods. Pods that are created, updated, or deleted after version `10245` would not be shown unless the user makes a list request without the `continue` token. This allows clients to break large requests into smaller chunks and then perform a watch operation on the full set without missing any updates. | ||
|
||
|
||
## Alternate representations of resources | ||
|
||
By default Kubernetes returns objects serialized to JSON with content type `application/json`. This is the default serialization format for the API. However, clients may request the more efficient Protobuf representation of these objects for better performance at scale. The Kubernetes API implements standard HTTP content type negotation: passing an `Accept` header with a `GET` call will request that the server return objects in the provided content type, while sending an object in Protobuf to the server for a `PUT` or `POST` call takes the `Content-Type` header. The server will return a `Content-Type` header if the requested format is supported, or the `406 Not acceptable` error if an invalid content type is provided. | ||
|
||
See the API documentation for a list of supported content types for each API. | ||
|
||
For example: | ||
|
||
1. List all of the pods on a cluster in Protobuf format. | ||
|
||
GET /api/v1/pods | ||
Accept: application/vnd.kubernetes.protobuf | ||
--- | ||
200 OK | ||
Content-Type: application/vnd.kubernetes.protobuf | ||
... binary encoded PodList object | ||
|
||
2. Create a pod by sending Protobuf encoded data to the server, but request a response in JSON. | ||
|
||
POST /api/v1/namespaces/test/pods | ||
Content-Type: application/vnd.kubernetes.protobuf | ||
Accept: application/json | ||
... binary encoded Pod object | ||
--- | ||
200 OK | ||
Content-Type: application/json | ||
{ | ||
"kind": "Pod", | ||
"apiVersion": "v1", | ||
... | ||
} | ||
|
||
Not all API resource types will support Protobuf, specifically those defined via Custom Resource Definitions or those that are API extensions. Clients that must work against all resource types should specify multiple content types in their `Accept` header to support fallback to JSON: | ||
|
||
``` | ||
Accept: application/vnd.kubernetes.protobuf, application/json | ||
``` | ||
|
||
|
||
### Protobuf encoding | ||
|
||
Kubernetes uses an envelope wrapper to encode Protobuf responses. That wrapper starts with a 4 byte magic number to help identify content in disk or in etcd as Protobuf (as opposed to JSON), and then is followed by a Protobuf encoded wrapper message, which describes the encoding and type of the underlying object and then contains the object. | ||
|
||
The wrapper format is: | ||
|
||
``` | ||
A four byte magic number prefix: | ||
Bytes 0-3: "k8s\x00" [0x6b, 0x38, 0x73, 0x00] | ||
|
||
An encoded Protobuf message with the following IDL: | ||
message Unknown { | ||
// typeMeta should have the string values for "kind" and "apiVersion" as set on the JSON object | ||
optional TypeMeta typeMeta = 1; | ||
|
||
// raw will hold the complete serialized object in protobuf. See the protobuf definitions in the client libraries for a given kind. | ||
optional bytes raw = 2; | ||
|
||
// contentEncoding is encoding used for the raw data. Unspecified means no encoding. | ||
optional string contentEncoding = 3; | ||
|
||
// contentType is the serialization method used to serialize 'raw'. Unspecified means application/vnd.kubernetes.protobuf and is usually | ||
// omitted. | ||
optional string contentType = 4; | ||
} | ||
|
||
message TypeMeta { | ||
// apiVersion is the group/version for this type | ||
optional string apiVersion = 1; | ||
// kind is the name of the object schema. A protobuf definition should exist for this object. | ||
optional string kind = 2; | ||
} | ||
``` | ||
|
||
Clients that receive a response in `application/vnd.kubernetes.protobuf` that does not match the expected prefix should reject the response, as future versions may need to alter the serialization format in an incompatible way and will do so by changing the prefix. | ||
|
||
{% endcapture %} | ||
|
||
{% include templates/concept.md %} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, is this official? 📜 👑
We've bumped into how to restart watch from same point in ruby kubeclient (ManageIQ/kubeclient#275) and I see python client too (kubernetes-client/python#124)...
So far all I read said
resourceVersion
is opaque and it wasn't clear they come from same timeline for different objects and their collection...Unlike List that returns both collection resourceVersion and individual objects' versions, during watch you only see updates to individual objects' versions.
I see experimentally that the collection's resourceVersion as returned by List is de-facto same for all collections, increments on any change to object in any collection, and equals resourceVersion of that last changed object, so this would work.
So, can clients assume collection resourceVersion >= max(obj.resourceVersion for obj in collection)?
Can clients assume watching collection from max(obj.resourceVersion seen during previous watch) will yield same point watch stopped? (as long as that history is not lost...)
More things that would be great if docs explicitly allowed or disallowed:
Or only know that last string seen, over single watch, is semantically latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
You can always attempt to restart from the last resourceVersion you received. If the api server no longer has enough history to let you start from that point, it will return a 410 error and you'll have to relist to get a new fresh resourceVersion
That is correct. You should not make any assumptions about how two resourceVersions relate to each other.
Things you can do:
Things you cannot do:
No
You should not take the max... that involves interpreting the resourceVersion. You should treat it as opaque and remember the most recent one received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton maybe you can incorporate some of the below, or do we have a canonical place for this to be documented to link to?
@cben, answers below.
Per collection, this happens to be true today. However the only operation on resourceVersions that we're promising will work in the indefinite future is that of equality (==). One can imagine data storage techniques where resourceVersions aren't linear or aren't numeric. Please don't bake assumptions about resourceVersions into clients, that's not future proof.
Do not use max; use last seen. Then, yes, that is the way it is intended to function.
Allowed.
NO. Empirically it works... until it doesn't. The cluster administrator can choose to move resource types between different backends, which would cause this to stop being true. The default setup for Events already does this.
NO.
Probably not safe depending on what you want to do, in that there could be things you haven't yet seen over the watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! <3
Right, for watch on single resource this was always clear.
Just to emphasize, I'm asking about watch of a collection, and restarting it from the last version of an individual resource within that collection — which is all you get as you watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp hello, a quick question: so today, we can say for EntityA, if it has bigger resourceVersion, it is newer (due to etcd's modifiedIndex).
So for the future, will there always be some way to say which version of the EntityA is newer? Because we should have such attribute. I would guess that many applications have this need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not guaranteed by the API
there is creationTimestamp on the object, but other than that, no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt would it be possible to add something like that? It's quite common e.g. for public clouds, that entity has something comparable (e.g. updated_on timestamp)
So I wonder why we don't have it here? Is there something blocking it? Or is it just because nobody had the usecase?
Our usecase is we want to process the data in parallel, without something comparable, we are forced to do everything in 1 process (which is quite bad when we have envs with 100k of containers, or other entities)
But even for single process, if we combine the data from API and watches, the data from watches can temporarily give us old data, that we want to throw away, but we can't because we don't know what is newer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combining data from two streams that differ in time seems problematic. Typically, controllers are driven by watch alone, and if updates to the API hit a conflict error, that means the resource was updated in the meantime, and the controller can simply wait for the next event to arrive via the watch stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the workqueue used by most of the kubernetes controllers allows parallel processing of watch events... seeing how that is structured might be helpful (https://github.com/kubernetes/client-go/blob/master/examples/workqueue/main.go#L131-L133)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thank you, I'll try to check it out, although my Go skills are not great :-)
In detail what we want to do:
We fetch and save k8s data for various purposes (reporting, chargeback, etc.) into our Postgre SQL DB.
Given we have e.g. 100k pods and the app was down for a while (therefore the watches history is missing), we'd like to fetch it and start watches at the same time. So we get fresh changes immediately, while we are getting the full inventory on the background (saving in all in parallel into our DB)
Fetching 100k pods and associated objects and storing them takes quite some time, so if we need to do this sequentially, the user needs to wait to get fresh changes (in bigger envs it's like 0.5h or more of processing time). Also the sequential processing needs more orchestration and that makes it more complex.
There is no way of saving the data in parallel, if we can't compare the version of entities. So what would you advise here? I am not sure if kubernetes controllers are solving the same issue as described ^ ?
So API and watches are taking the data from different sources? So if we would put e.g. the timestamp to k8s db for pods, the watches would not see it? Only the API query? Or what is the reason we can't have a comparable attribute?