-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add overview and migrate existing to github #1805
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,10 +7,45 @@ document. | |
Basic workflow starting from registration is as follows: | ||
|
||
1. csi-node-driver-registrar retrieves information about csi plugin (mayastor) using csi identity service. | ||
1. csi-node-driver-registrar registers csi plugin with kubelet passing plugin's csi endpoint as parameter. | ||
1. kubelet uses csi identity and node services to retrieve information about the plugin (including plugin's ID string). | ||
1. kubelet creates a custom resource (CR) "csi node info" for the CSI plugin. | ||
1. kubelet issues requests to publish/unpublish and stage/unstage volume to the CSI plugin when mounting the volume. | ||
2. csi-node-driver-registrar registers csi plugin with kubelet passing plugin's csi endpoint as parameter. | ||
3. kubelet uses csi identity and node services to retrieve information about the plugin (including plugin's ID string). | ||
4. kubelet creates a custom resource (CR) "csi node info" for the CSI plugin. | ||
5. kubelet issues requests to publish/unpublish and stage/unstage volume to the CSI plugin when mounting the volume. | ||
|
||
The registration of mayastor storage nodes with control plane (moac) is handled | ||
by a separate protocol using NATS message bus that is independent on CSI plugin. | ||
The registration of the storage nodes (i/o engines) with the control plane is handled | ||
by a gRPC service which is independent of the CSI plugin. | ||
|
||
<br> | ||
|
||
```mermaid | ||
graph LR | ||
; | ||
PublicApi{"Public<br>API"} | ||
CO[["Container<br>Orchestrator"]] | ||
|
||
subgraph "Mayastor Control-Plane" | ||
Rest["Rest"] | ||
InternalApi["Internal<br>API"] | ||
InternalServices["Agents"] | ||
end | ||
|
||
subgraph "Mayastor Data-Plane" | ||
IO_Node_1["Node 1"] | ||
end | ||
|
||
subgraph "Mayastor CSI" | ||
Controller["Controller<br>Plugin"] | ||
Node_1["Node<br>Plugin"] | ||
end | ||
|
||
%% Connections | ||
CO -.-> Node_1 | ||
CO -.-> Controller | ||
Controller -->|REST/http| PublicApi | ||
PublicApi -.-> Rest | ||
Rest -->|gRPC| InternalApi | ||
InternalApi -.->|gRPC| InternalServices | ||
Node_1 <--> PublicApi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this csi-node to REST public API link represent? Where do we do that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It represents exactly that, we've had to do that to for one of the latest fixes on v2.7.2 unfortunately as the CSI volume context becomes stale and it's immutable so we can't modify it. |
||
Node_1 -.->|NVMe-oF| IO_Node_1 | ||
IO_Node_1 <-->|gRPC| InternalServices | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
# Control Plane Behaviour | ||
|
||
This document describes the types of behaviour that the control plane will exhibit under various situations. By | ||
providing a high-level view it is hoped that the reader will be able to more easily reason about the control plane. \ | ||
<br> | ||
|
||
## REST API Idempotency | ||
|
||
Idempotency is a term used a lot but which is often misconstrued. The following definition is taken from | ||
the [Mozilla Glossary](https://developer.mozilla.org/en-US/docs/Glossary/Idempotent): | ||
|
||
> An [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP) method is **idempotent** if an identical request can be | ||
> made once or several times in a row with the same effect while leaving the server in the same state. In other words, | ||
> an idempotent method should not have any side-effects (except for keeping statistics). Implemented correctly, the `GET`, | ||
`HEAD`,`PUT`, and `DELETE` methods are idempotent, but not the `POST` method. | ||
> All [safe](https://developer.mozilla.org/en-US/docs/Glossary/Safe) methods are also ***idempotent***. | ||
OK, so making multiple identical requests should produce the same result ***without side effects***. Great, so does the | ||
return value for each request have to be the same? The article goes on to say: | ||
|
||
> To be idempotent, only the actual back-end state of the server is considered, the status code returned by each request | ||
> may differ: the first call of a `DELETE` will likely return a `200`, while successive ones will likely return a`404`. | ||
The control plane will behave exactly as described above. If, for example, multiple `create volume` calls are made for | ||
the same volume, the first will return success (`HTTP 200` code) while subsequent calls will return a failure status | ||
code (`HTTP 409` code) indicating that the resource already exists. \ | ||
<br> | ||
|
||
## Handling Failures | ||
|
||
There are various ways in which the control plane could fail to satisfy a `REST` request: | ||
|
||
- Control plane dies in the middle of an operation. | ||
- Control plane fails to update the persistent store. | ||
- A gRPC request to Mayastor fails to complete successfully. \ | ||
<br> | ||
|
||
Regardless of the type of failure, the control plane has to decide what it should do: | ||
|
||
1. Fail the operation back to the callee but leave any created resources alone. | ||
|
||
2. Fail the operation back to the callee but destroy any created resources. | ||
|
||
3. Act like kubernetes and keep retrying in the hope that it will eventually succeed. \ | ||
<br> | ||
|
||
Approach 3 is discounted. If we never responded to the callee it would eventually timeout and probably retry itself. | ||
This would likely present even more issues/complexity in the control plane. | ||
|
||
So the decision becomes, should we destroy resources that have already been created as part of the operation? \ | ||
<br> | ||
|
||
### Keep Created Resources | ||
|
||
Preventing the control plane from having to unwind operations is convenient as it keeps the implementation simple. A | ||
separate asynchronous process could then periodically scan for unused resources and destroy them. | ||
|
||
There is a potential issue with the above described approach. If an operation fails, it would be reasonable to assume | ||
that the user would retry it. Is it possible for this subsequent request to fail as a result of the existing unused | ||
resources lingering (i.e. because they have not yet been destroyed)? If so, this would hamper any retry logic | ||
implemented in the upper layers. | ||
|
||
### Destroy Created Resources | ||
|
||
This is the optimal approach. For any given operation, failure results in newly created resources being destroyed. The | ||
responsibility lies with the control plane tracking which resources have been created and destroying them in the event | ||
of a failure. | ||
|
||
However, what happens if destruction of a resource fails? It is possible for the control plane to retry the operation | ||
but at some point it will have to give up. In effect the control plane will do its best, but it cannot provide any | ||
guarantee. So does this mean that these resources are permanently leaked? Not necessarily. Like in | ||
the [Keep Created Resources](#keep-created-resources) section, there could be a separate process which destroys unused | ||
resources. \ | ||
<br> | ||
|
||
## Use of the Persistent Store | ||
|
||
For a control plane to be effective it must maintain information about the system it is interacting with and take | ||
decision accordingly. An in-memory registry is used to store such information. | ||
|
||
Because the registry is stored in memory, it is volatile - meaning all information is lost if the service is restarted. | ||
As a consequence critical information must be backed up to a highly available persistent store (for more detailed | ||
information see [persistent-store.md](./persistent-store.md)). | ||
|
||
The types of data that need persisting broadly fall into 3 categories: | ||
|
||
1. Desired state | ||
|
||
2. Actual state | ||
|
||
3. Control plane specific information \ | ||
<br> | ||
|
||
### Desired State | ||
|
||
This is the declarative specification of a resource provided by the user. As an example, the user may request a new | ||
volume with the following requirements: | ||
|
||
- Replica count of 3 | ||
|
||
- Size | ||
|
||
- Preferred nodes | ||
|
||
- Number of nexuses | ||
|
||
Once the user has provided these constraints, the expectation is that the control plane should create a resource that | ||
meets the specification. How the control plane achieves this is of no concern. | ||
|
||
So what happens if the control plane is unable to meet these requirements? The operation is failed. This prevents any | ||
ambiguity. If an operation succeeds, the requirements have been met and the user has exactly what they asked for. If the | ||
operation fails, the requirements couldn’t be met. In this case the control plane should provide an appropriate means of | ||
diagnosing the issue i.e. a log message. | ||
|
||
What happens to resources created before the operation failed? This will be dependent on the chosen failure strategy | ||
outlined in [Handling Failures](#handling-failures). | ||
|
||
### Actual State | ||
|
||
This is the runtime state of the system as provided by Mayastor. Whenever this changes, the control plane must reconcile | ||
this state against the desired state to ensure that we are still meeting the users requirements. If not, the control | ||
plane will take action to try to rectify this. | ||
|
||
Whenever a user makes a request for state information, it will be this state that is returned (Note: If necessary an API | ||
may be provided which returns the desired state also). \ | ||
<br> | ||
|
||
## Control Plane Information | ||
|
||
This information is required to aid the control plane across restarts. It will be used to store the state of a resource | ||
independent of the desired or actual state. | ||
|
||
The following sequence will be followed when creating a resource: | ||
|
||
1. Add resource specification to the store with a state of “creating” | ||
|
||
2. Create the resource | ||
|
||
3. Mark the state of the resource as “complete” | ||
|
||
If the control plane then crashes mid-operation, on restart it can query the state of each resource. Any resource not in | ||
the “complete” state can then be destroyed as they will be remnants of a failed operation. The expectation here will be | ||
that the user will reissue the operation if they wish to. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reissue is the correct term |
||
|
||
Likewise, deleting a resource will look like: | ||
|
||
1. Mark resources as “deleting” in the store | ||
|
||
2. Delete the resource | ||
|
||
3. Remove the resource from the store. | ||
|
||
For complex operations like creating a volume, all resources that make up the volume will be marked as “creating”. Only | ||
when all resources have been successfully created will their corresponding states be changed to “complete”. This will | ||
look something like: | ||
|
||
1. Add volume specification to the store with a state of “creating” | ||
|
||
2. Add nexus specifications to the store with a state of “creating” | ||
|
||
3. Add replica specifications to the store with a state of “creating” | ||
|
||
4. Create replicas | ||
|
||
5. Create nexus | ||
|
||
6. Mark replica states as “complete” | ||
|
||
7. Mark nexus states as “complete” | ||
|
||
8. Mark volume state as “complete” |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# DiskPool Custom Resource for K8s | ||
|
||
The DiskPool operator is a [K8s] specific component which manages pools in a K8s environment. \ | ||
Simplistically, it drives pools across the various states listed below. | ||
|
||
In [K8s], mayastor pools are represented as [Custom Resources][k8s-cr], which is an extension on top of the existing [K8s API][k8s-api]. \ | ||
tiagolobocastro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This allows users to declaratively create [diskpool], and mayastor will not only eventually create the corresponding mayastor pool but will | ||
also ensure that it gets re-imported after pod restarts, node restarts, crashes, etc... | ||
|
||
> **NOTE**: mayastor pool (msp) has been renamed to diskpool (dsp) | ||
|
||
## DiskPool States | ||
|
||
> *NOTE* | ||
> Non-exhaustive enums could have additional variants added in the future. Therefore, when matching against variants of non-exhaustive enums, an extra | ||
> wildcard arm must be added to account for future variants. | ||
|
||
- Creating \ | ||
The pool is a new OR missing resource, and it has not been created or imported yet. The pool spec ***MAY*** be present but ***DOES NOT*** have a status field. | ||
|
||
- Created \ | ||
The pool has been created in the designated i/o engine node by the control-plane. | ||
|
||
- Terminating \ | ||
A deletion request has been issued by the user. The pool will eventually be deleted by the control-plane and eventually the DiskPool Custom Resource will also get removed from the K8s API. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes because they're two separate events, they're not done in a sequence |
||
|
||
- Error (*Deprecated*) \ | ||
The attempt to transition to the next state has exceeded the maximum number of retries. The retry counts are implemented using an exponential back-off, which by default is set to 10. Once the error state is entered, reconciliation stops. Only external events (a new resource version) will trigger a new attempt. | ||
> NOTE: this State has been deprecated since API version **v1beta1** | ||
|
||
## Reconciler actions | ||
|
||
The operator responds to two types of events: | ||
|
||
- Scheduled \ | ||
When, for example, we try to submit a new PUT request for a pool. On failure (i.e., network) we will reschedule the operation after 5 seconds. | ||
|
||
- CRD updates \ | ||
When the CRD is changed, the resource version is changed. This will trigger a new reconcile loop. This process is typically known as “watching.” | ||
|
||
- Observability \ | ||
During the transition, the operator will emit events to K8s, which can be obtained by kubectl. This gives visibility into the state and its transitions. | ||
tiagolobocastro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[K8s]: https://kubernetes.io/ | ||
[k8s-cr]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ | ||
[k8s-api]: https://kubernetes.io/docs/concepts/overview/kubernetes-api/ | ||
[diskpool]: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we be showing linkages for control-plane agents and csi-node plugin? The ha-node-agent to csi-node one for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted this to be csi only, will add more complete diagram on another section later on