diff --git a/README.md b/README.md index 08cec6295..04918949a 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,6 @@ [![Community Meetings](https://img.shields.io/badge/Community-Meetings-blue)](https://us05web.zoom.us/j/87535654586?pwd=CigbXigJPn38USc6Vuzt7qSVFoO79X.1) [![built with nix](https://builtwithnix.org/badge.svg)](https://builtwithnix.org) - ## Table of contents --- @@ -23,7 +22,7 @@ - [Frequently asked questions](/doc/FAQ.md)

-Mayastor is a cloud-native declarative data plane written in Rust. +Mayastor is a cloud-native declarative data plane written in Rust. Our goal is to abstract storage resources and their differences through the data plane such that users only need to supply the what and do not have to worry about the how so that individual teams stay in control. @@ -53,24 +52,30 @@ The official user documentation for the Mayastor Project is published at: [OpenE ## Overview +![OpenEBS Mayastor](./doc/img/overview.drawio.png) + At a high-level, Mayastor consists of two major components. ### **Control plane:** -- A microservices patterned control plane, centered around a core agent which publically exposes a RESTful API. +- A microservices patterned control plane, centered around a core agent and a RESTful API. This is extended by a dedicated operator responsible for managing the life cycle of "Disk Pools" (an abstraction for devices supplying the cluster with persistent backing storage) and a CSI compliant - external provisioner (controller). - Source code for the control plane components is located in its [own repository](https://github.com/openebs/mayastor-control-plane) + external provisioner (controller). \ -- A daemonset _mayastor-csi_ plugin which implements the identity and node grpc services from CSI protocol. + Source code for the control plane components is located in the [controller repository](https://github.com/openebs/mayastor-control-plane). \ + The helm chart as well as other k8s specific extensions (ex: kubectl-plugin) are located in the [extensions repository](https://github.com/openebs/mayastor-extensions). + +- CSI plugins: + - A daemonset _csi-node_ plugin which implements the identity and node services. + - A deployment _csi-controller_ plugin which implements the identity and controller services. ### **Data plane:** -- Each node you wish to use for storage or storage services will have to run an IO Engine daemonset. Mayastor itself has - two major components: the Nexus and a local storage component. +- Each node you wish to use for storage or storage services will have to run an I/O Engine instance. The Mayastor data-plane (i/o engine) itself has + two major components: the volume target (nexus) and a local storage pools which can be carved out into logical volumes (replicas), which in turn can be shared to other i/o engines via NVMe-oF. -## Nexus +## Volume Target / Nexus

The Nexus is responsible for attaching to your storage resources and making it available to the host that is @@ -89,7 +94,7 @@ they way we do things. Moreover, due to hardware [changes](https://searchstorage we in fact are forced to think about it. Based on storage URIs the Nexus knows how to connect to the resources and will make these resources available as -a single device to a protocol standard protocol. These storage URIs are generated automatically by MOAC and it keeps +a single device to a protocol standard protocol. These storage URIs are managed by the control-plane and it keeps track of what resources belong to what Nexus instance and subsequently to what PVC. You can also directly use the nexus from within your application code. For example: @@ -138,7 +143,7 @@ buf.as_slice().into_iter().map(|b| assert_eq!(b, 0xff)).for_each(drop);

We think this can help a lot of database projects as well, where they typically have all the smarts in their database engine -and they want the most simple (but fast) storage device. For a more elaborate example see some of the tests in mayastor/tests. +and they want the most simple (but fast) storage device. For a more elaborate example see some of the tests in io-engine/tests. To communicate with the children, the Nexus uses industry standard protocols. The Nexus supports direct access to local storage and remote storage using NVMe-oF TCP. Another advantage of the implementation is that if you were to remove @@ -159,8 +164,8 @@ What model fits best for you? You get to decide!

If you do not have a storage system, and just have local storage, i.e block devices attached to your system, we can consume these and make a "storage system" out of these local devices such that -you can leverage features like snapshots, clones, thin provisioning, and the likes. Our K8s tutorial does that under -the water today. Currently, we are working on exporting your local storage implicitly when needed, such that you can +you can leverage features like snapshots, clones, thin provisioning, and the likes. Our K8s deployment does that under +the water. Currently, we are working on exporting your local storage implicitly when needed, such that you can share storage between nodes. This means that your application, when re-scheduled, can still connect to your local storage except for the fact that it is not local anymore. @@ -192,12 +197,8 @@ In following example of a client session is assumed that mayastor has been started and is running: ``` -$ dd if=/dev/zero of=/tmp/disk bs=1024 count=102400 -102400+0 records in -102400+0 records out -104857600 bytes (105 MB, 100 MiB) copied, 0.235195 s, 446 MB/s -$ sudo losetup /dev/loop8 /tmp/disk -$ io-engine-client pool create tpool /dev/loop8 +$ fallocate -l 100M /tmp/disk.img +$ io-engine-client pool create tpool aio:///tmp/disk.img $ io-engine-client pool list NAME STATE CAPACITY USED DISKS tpool 0 96.0 MiB 0 B tpool @@ -232,5 +233,4 @@ Unless you explicitly state otherwise, any contribution intentionally submitted inclusion in Mayastor by you, as defined in the Apache-2.0 license, licensed as above, without any additional terms or conditions. - [![FOSSA Status](https://app.fossa.com/api/projects/custom%2B162%2Fgithub.com%2Fopenebs%2Fmayastor.svg?type=large&issueType=license)](https://app.fossa.com/projects/custom%2B162%2Fgithub.com%2Fopenebs%2Fmayastor?ref=badge_large&issueType=license) diff --git a/doc/csi.md b/doc/csi.md index 0caf99b84..0b63aed42 100644 --- a/doc/csi.md +++ b/doc/csi.md @@ -7,10 +7,45 @@ document. Basic workflow starting from registration is as follows: 1. csi-node-driver-registrar retrieves information about csi plugin (mayastor) using csi identity service. -1. csi-node-driver-registrar registers csi plugin with kubelet passing plugin's csi endpoint as parameter. -1. kubelet uses csi identity and node services to retrieve information about the plugin (including plugin's ID string). -1. kubelet creates a custom resource (CR) "csi node info" for the CSI plugin. -1. kubelet issues requests to publish/unpublish and stage/unstage volume to the CSI plugin when mounting the volume. +2. csi-node-driver-registrar registers csi plugin with kubelet passing plugin's csi endpoint as parameter. +3. kubelet uses csi identity and node services to retrieve information about the plugin (including plugin's ID string). +4. kubelet creates a custom resource (CR) "csi node info" for the CSI plugin. +5. kubelet issues requests to publish/unpublish and stage/unstage volume to the CSI plugin when mounting the volume. -The registration of mayastor storage nodes with control plane (moac) is handled -by a separate protocol using NATS message bus that is independent on CSI plugin. +The registration of the storage nodes (i/o engines) with the control plane is handled +by a gRPC service which is independent of the CSI plugin. + +
+ +```mermaid +graph LR +; + PublicApi{"Public
API"} + CO[["Container
Orchestrator"]] + + subgraph "Mayastor Control-Plane" + Rest["Rest"] + InternalApi["Internal
API"] + InternalServices["Agents"] + end + + subgraph "Mayastor Data-Plane" + IO_Node_1["Node 1"] + end + + subgraph "Mayastor CSI" + Controller["Controller
Plugin"] + Node_1["Node
Plugin"] + end + +%% Connections + CO -.-> Node_1 + CO -.-> Controller + Controller -->|REST/http| PublicApi + PublicApi -.-> Rest + Rest -->|gRPC| InternalApi + InternalApi -.->|gRPC| InternalServices + Node_1 <--> PublicApi + Node_1 -.->|NVMe-oF| IO_Node_1 + IO_Node_1 <-->|gRPC| InternalServices +``` diff --git a/doc/design/control-plane-behaviour.md b/doc/design/control-plane-behaviour.md new file mode 100644 index 000000000..759c5c775 --- /dev/null +++ b/doc/design/control-plane-behaviour.md @@ -0,0 +1,171 @@ +# Control Plane Behaviour + +This document describes the types of behaviour that the control plane will exhibit under various situations. By +providing a high-level view it is hoped that the reader will be able to more easily reason about the control plane. \ +
+ +## REST API Idempotency + +Idempotency is a term used a lot but which is often misconstrued. The following definition is taken from +the [Mozilla Glossary](https://developer.mozilla.org/en-US/docs/Glossary/Idempotent): + +> An [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP) method is **idempotent** if an identical request can be +> made once or several times in a row with the same effect while leaving the server in the same state. In other words, +> an idempotent method should not have any side-effects (except for keeping statistics). Implemented correctly, the `GET`, +`HEAD`,`PUT`, and `DELETE` methods are idempotent, but not the `POST` method. +> All [safe](https://developer.mozilla.org/en-US/docs/Glossary/Safe) methods are also ***idempotent***. + +OK, so making multiple identical requests should produce the same result ***without side effects***. Great, so does the +return value for each request have to be the same? The article goes on to say: + +> To be idempotent, only the actual back-end state of the server is considered, the status code returned by each request +> may differ: the first call of a `DELETE` will likely return a `200`, while successive ones will likely return a`404`. + +The control plane will behave exactly as described above. If, for example, multiple `create volume` calls are made for +the same volume, the first will return success (`HTTP 200` code) while subsequent calls will return a failure status +code (`HTTP 409` code) indicating that the resource already exists. \ +
+ +## Handling Failures + +There are various ways in which the control plane could fail to satisfy a `REST` request: + +- Control plane dies in the middle of an operation. +- Control plane fails to update the persistent store. +- A gRPC request to Mayastor fails to complete successfully. \ +
+ +Regardless of the type of failure, the control plane has to decide what it should do: + +1. Fail the operation back to the callee but leave any created resources alone. + +2. Fail the operation back to the callee but destroy any created resources. + +3. Act like kubernetes and keep retrying in the hope that it will eventually succeed. \ +
+ +Approach 3 is discounted. If we never responded to the callee it would eventually timeout and probably retry itself. +This would likely present even more issues/complexity in the control plane. + +So the decision becomes, should we destroy resources that have already been created as part of the operation? \ +
+ +### Keep Created Resources + +Preventing the control plane from having to unwind operations is convenient as it keeps the implementation simple. A +separate asynchronous process could then periodically scan for unused resources and destroy them. + +There is a potential issue with the above described approach. If an operation fails, it would be reasonable to assume +that the user would retry it. Is it possible for this subsequent request to fail as a result of the existing unused +resources lingering (i.e. because they have not yet been destroyed)? If so, this would hamper any retry logic +implemented in the upper layers. + +### Destroy Created Resources + +This is the optimal approach. For any given operation, failure results in newly created resources being destroyed. The +responsibility lies with the control plane tracking which resources have been created and destroying them in the event +of a failure. + +However, what happens if destruction of a resource fails? It is possible for the control plane to retry the operation +but at some point it will have to give up. In effect the control plane will do its best, but it cannot provide any +guarantee. So does this mean that these resources are permanently leaked? Not necessarily. Like in +the [Keep Created Resources](#keep-created-resources) section, there could be a separate process which destroys unused +resources. \ +
+ +## Use of the Persistent Store + +For a control plane to be effective it must maintain information about the system it is interacting with and take +decision accordingly. An in-memory registry is used to store such information. + +Because the registry is stored in memory, it is volatile - meaning all information is lost if the service is restarted. +As a consequence critical information must be backed up to a highly available persistent store (for more detailed +information see [persistent-store.md](./persistent-store.md)). + +The types of data that need persisting broadly fall into 3 categories: + +1. Desired state + +2. Actual state + +3. Control plane specific information \ +
+ +### Desired State + +This is the declarative specification of a resource provided by the user. As an example, the user may request a new +volume with the following requirements: + +- Replica count of 3 + +- Size + +- Preferred nodes + +- Number of nexuses + +Once the user has provided these constraints, the expectation is that the control plane should create a resource that +meets the specification. How the control plane achieves this is of no concern. + +So what happens if the control plane is unable to meet these requirements? The operation is failed. This prevents any +ambiguity. If an operation succeeds, the requirements have been met and the user has exactly what they asked for. If the +operation fails, the requirements couldn’t be met. In this case the control plane should provide an appropriate means of +diagnosing the issue i.e. a log message. + +What happens to resources created before the operation failed? This will be dependent on the chosen failure strategy +outlined in [Handling Failures](#handling-failures). + +### Actual State + +This is the runtime state of the system as provided by Mayastor. Whenever this changes, the control plane must reconcile +this state against the desired state to ensure that we are still meeting the users requirements. If not, the control +plane will take action to try to rectify this. + +Whenever a user makes a request for state information, it will be this state that is returned (Note: If necessary an API +may be provided which returns the desired state also). \ +
+ +## Control Plane Information + +This information is required to aid the control plane across restarts. It will be used to store the state of a resource +independent of the desired or actual state. + +The following sequence will be followed when creating a resource: + +1. Add resource specification to the store with a state of “creating” + +2. Create the resource + +3. Mark the state of the resource as “complete” + +If the control plane then crashes mid-operation, on restart it can query the state of each resource. Any resource not in +the “complete” state can then be destroyed as they will be remnants of a failed operation. The expectation here will be +that the user will reissue the operation if they wish to. + +Likewise, deleting a resource will look like: + +1. Mark resources as “deleting” in the store + +2. Delete the resource + +3. Remove the resource from the store. + +For complex operations like creating a volume, all resources that make up the volume will be marked as “creating”. Only +when all resources have been successfully created will their corresponding states be changed to “complete”. This will +look something like: + +1. Add volume specification to the store with a state of “creating” + +2. Add nexus specifications to the store with a state of “creating” + +3. Add replica specifications to the store with a state of “creating” + +4. Create replicas + +5. Create nexus + +6. Mark replica states as “complete” + +7. Mark nexus states as “complete” + +8. Mark volume state as “complete” diff --git a/doc/design/control-plane.md b/doc/design/control-plane.md new file mode 100644 index 000000000..fe60f41f3 --- /dev/null +++ b/doc/design/control-plane.md @@ -0,0 +1,480 @@ +# Mayastor Control Plane + +This provides a high-level design description of the control plane and its main components. It does not, for example, in detail, explain how a replica is retired. + +## Background + +The current control implementation started as _"just a [CSI] driver"_ that would provision volumes based on dynamic provisioning requests. The intent was to integrate this [CSI] driver within the `OpenEBS` control plane. As things progressed, it turned out that the control plane which we wanted to ingrate into had little control hooks to integrate into. + +As a result, more complex functionality was introduced into [Mayastor itself (the data plane or io-engine)][Mayastor] and [MOAC] (the `CSI` driver). The increasing complexity of `MOAC`, with the implicit dependency on [K8s], made it apparent that we needed to split up this functionality into a Mayastor's own specific control plane. + +At the same time, however, we figured out how far the stateless approach in [K8s] could be married with the inherently state-full world of [CAS]. + +We have concluded that we could not implement everything using the same – existing primitives directly. However, **we can leverage the same patterns**. + +> "What [K8s] is to (stateless compute) we are to storage." + +We can leverage the majority and implement the specifics elsewhere. A side effect of this is that it also means that it is not [K8s] dependent. + +## High-level overview + +The control plane is our locus of control. It is responsible for what happens to volumes as external events, planned or unexpected, occur. The control plane is extensible through agents. By default, several agents are provided and are part of the core services. + +At a high level, the architecture is depicted below. Core and scheduler are so-called agents. Agents implement a function that varies from inserting new specifications to reconciling the desired state. + +```mermaid +graph TD; + LB["Clients"] + CSIController["CSI Controller"] + REST["REST OpenAPI"] + + subgraph Agents["Core Agents"] + HA["HA Cluster"] + Watcher + Core + end + + subgraph StorageNode + subgraph DataPlane["I/O Engine"] + RBAC + Nexus + Pools + Replicas + end + + subgraph DataPlaneAgent["Data Plane Agent"] + CSINode["CSI Node"] + HANode["HA Node Agent"] + end + end + + subgraph AppNode + subgraph DataPlaneAgent_2["Data Plane Agent"] + CSINode_2["CSI Node"] + HANode_2["HA Node Agent"] + end + end + + CSIController --> REST + LB --> REST + REST --> Agents + Agents --> DataPlane + RBAC -.-> Nexus + RBAC -.-> Replicas + Nexus --> Replicas + Replicas -.-> Pools + Agents --> DataPlaneAgent + Agents --> DataPlaneAgent_2 +``` + +Default functionality provided by the control plane through several agents is: + +- Provisioning of volumes according to specification (spec) + +- Ensuring that as external events take place, the desired state (spec) is reconciled + +- Recreates objects (pools, volumes, shares) after a restart of a data plane instance + +- Provides an OpenAPI v3 REST service to allow for customization. + +- Replica replacement + +- Garbage collection + +- CSI driver + +- CRD operator for the interaction with k8s to create pools + +
+ +### Some key points + +- The control plane is designed to be scalable. That is to say; multiple control planes can operate on the same objects, where the control plane guarantees mutual exclusion. This is achieved by applying either distributed locks and/or leader elections. This currently is in a “should work” state. However, it is perhaps more practical to use namespacing where each control plane operates on a cluster-ID. + + > _**NOTE**_: this multi control-planes in a single cluster was left behind until further notice + +- The control plane does not take part in the IO path, except when there is a dynamic reconfiguration event. If the control plane can not be accessed during such an event, the NVMe controller will remain frozen. The time we allow ourselves to retry operations during such an event is determined by the NVMe IO timeout and the controller loss time-out values. + +- The control plane uses well-known, existing technologies as its building blocks. Most notable technologies applied: + + - etcd v3 and only version 3. 1 & 2 are not supported and will not get support + + - Written in Rust + + - gRPC + +- We need at least three control nodes, where five is preferred. +The control plane is extensible by adding and removing agents where each agents complements the control plane in some way. +Example: the `HA *` agents allow for volume target failover by reconnecting the initiator to another replacement target. + +
+ +## Persistent Store (KVstore for configuration data) + +The Mayastor Control Plane requires a persistent store for storing information that it can use to make intelligent decisions. \ +A key-value store has been selected as the appropriate type of store. \ +[etcd] is very well known in the industry and provides the strong consistency models required for the control plane. + +> _NOTE_: [etcd] is also a fundamental component of Kubernetes itself. + +Throughout the control plane and data plane design, [etcd] is considered the source of truth. + +Somethings to keep in mind when considering a persistent store implementation: + +- **Performance** + - Paxos/Raft consensus is inherently latency-sensitive. Moreover, the KV is memory-mapped, meaning that it suffers greatly from random IO. + - As per their own docs, `etcd is designed to reliably store infrequently updated data…` + - Fortunately, NVMe does not suffer from this; however, it’s not unlikely to assume some users will use rotational devices. + - This limitation is acceptable for the control plane as, by design, we shouldn’t be storing information at anywhere near the limits of etcd. + +- **Role-Based Access** + - Who is allowed to list what? Due to the linear keyspace, this is important to consider by using prefixes. + +- **Queries** + - range-based are encouraged to do. There is no analogue of tables in KVs. + +- **Notifications** + - being notified of changes can be very useful to drive on-change reconciler events. + +
+ +### Persistent Information + +There are two categories of information that the control plane wishes to store: + +1. Configuration + - Specification for volumes, pools, etc + - Policies for various scheduling logic + - etc + +2. System state + - Volume states + - Node states + - Pool states + - etc + +#### System State + +The control plane requires visibility of the state of the system in order to make autonomous decisions. For example, should a volume transition from a +healthy state to a degraded state, the control plane could inspect the state of its children and optionally (based on the policy) replace any that are +unhealthy. + +Additionally, this state information would be useful for implementing an early warning system. If any resource (volume, node, pool) changed state, any + etcd watchers would be notified. We could then potentially have a service which watches for state changes and notifies the upper layers (i.e. operators) + that an error has occurred. + +##### Note + +Although initially planned, the system state is not currently persisted in [etcd] as the initial use-case for watchers could be fulfilled +by making use of an internal in-memory cache of objects, thus moving this problem further down the line. \ +Even though etcd is only used for configuration we've had had users with etcd-related performance issues, which would will no doubt get even further +exacerbated if we also start placing the _**state in etcd**_. And so this will require very careful _**design**_ and _**consideration**_. + +## Control plane agents + +Agents form a specific function and concern themselves around a particular problem. There are several agents. The provisioning of a volume (say) involves + pipelining between different agents. Each agent receives a request and response, and the response _MAY_ be the input for a subsequent request. + +Agents can either be internal within the binary or be implemented as separate processes (containers). + +
+ +```mermaid +sequenceDiagram + Actor User + participant REST + + participant Core + participant PStor as PStor (etcd) + participant Scheduler + participant Pool + Participant Replica + + User ->> REST: Put Create + REST ->> Core: Create Request + Core ->> PStor: Insert Spec + PStor ->> Core:  + Core ->> REST:  + REST ->> User: 200 Ok + + alt Core Agent currently handles this + Scheduler -->> PStor: Watch Specs + Scheduler ->> Pool: Pools(s) select + Pool -->> Scheduler:  + Scheduler ->> Replica: Create + Replica -->> Scheduler:  + Scheduler -->> Core:  + Core ->> PStor: Update status + end +``` + +
+ +> _**NOTE**_: As things stand today, the Core agent has taken the role of reconciler and scheduler. + +
+ +## Reconcilers + +Reconcilers implement the logic that drives the desired state to the actual state. In principle it's the same model as the operator framework provided by K8s, however as mentioned, it's tailored towards storage rather than stateless containers. + +Currently, reconcilers are implemented for pools, replicas, nexuses, volumes, nodes and etcd. When a volume enters the degraded state, it is notified of this event and will reconcile as a result of it. The exact heuristics for picking a new replica is likely to be subjective to user preferences. As such, volume objects as stored with the control plane will have fields to control this behaviour. + +```rust +#[async_trait::async_trait] +trait Reconciler { + /// Run the reconcile logic for this resource. + async fn reconcile(&mut self, context: &PollContext) -> PollResult; +} + +#[async_trait::async_trait] +trait GarbageCollect { + /// Run the `GarbageCollect` reconciler. + /// The default implementation calls all garbage collection methods. + async fn garbage_collect(&mut self, context: &PollContext) -> PollResult { + squash_results(vec![ + self.disown_orphaned(context).await, + self.disown_unused(context).await, + self.destroy_deleting(context).await, + self.destroy_orphaned(context).await, + self.disown_invalid(context).await, + ]) + } + + /// Destroy resources which are in the deleting phase. + /// A resource goes into the deleting phase when we start to delete it and stay in this + /// state until we successfully delete it. + async fn destroy_deleting(&mut self, context: &PollContext) -> PollResult; + + /// Destroy resources which have been orphaned. + /// A resource becomes orphaned when all its owners have disowned it and at that point + /// it is no longer needed and may be destroyed. + async fn destroy_orphaned(&mut self, context: &PollContext) -> PollResult; + + /// Disown resources which are no longer needed by their owners. + async fn disown_unused(&mut self, context: &PollContext) -> PollResult; + /// Disown resources whose owners are no longer in existence. + /// This may happen as a result of a bug or manual edit of the persistent store (etcd). + async fn disown_orphaned(&mut self, context: &PollContext) -> PollResult; + /// Disown resources which have questionable existence, for example non reservable replicas. + async fn disown_invalid(&mut self, context: &PollContext) -> PollResult; + /// Reclaim unused capacity - for example an expanded but unused replica, which may + /// happen as part of a failed volume expand operation. + async fn reclaim_space(&mut self, _context: &PollContext) -> PollResult { + PollResult::Ok(PollerState::Idle) + } +} + +#[async_trait::async_trait] +trait ReCreate { + /// Recreate the state according to the specification. + /// This is required when an io-engine instance crashes/restarts as it always starts with no + /// state. + /// This is because it's the control-plane's job to recreate the state since it has the + /// overview of the whole system. + async fn recreate_state(&mut self, context: &PollContext) -> PollResult; +} +``` + +## Data-Plane Agent + +The data plane agent is the trojan horse. It runs on all nodes that want to consume storage provided by Mayastor. +It implements the CSI node specifications, but it will also offer the ability to register it as a service to the control plane. +This provides us with the ability to manipulate the storage topology on the node(s) to control, for example, various aspects of asymmetric namespace +access. + +> _**NOTE**_: the data-plane agent doesn't exist as its own entity per se today, rather we have the csi-node plugin and the agent-ha-node which perform +> the role of what was to become the data-plane agent. + +Consider the following scenario; + +Given: A node(W) is connected to a mayastor NVMe controller on the node(1) + +When: Node(1) needs to be taken out of service + +Then: A new NVMe controller on node(2) that provides access to the same replicas needs to be added to the node(W) + +This can only be achieved if the control plane can provision a new Nexus and then dynamically add a new path to the node. + +```mermaid +graph TD; + subgraph 1 + AppNode_1["App Node"] ==> Node_1["Node 1"] + Node_1 --> Replicas_1[("Replicas")] + style Node_1 fill:#00C853 + end + + subgraph 2 + AppNode_2["App Node"] -.-> Node_2["Node 1"] + Node_2 --> Replicas_2[("Replicas")] + Node_N["Node 2"] --> Replicas_2[("Replicas")] + style Node_2 fill:#D50000 + end + + subgraph 3 + AppNode_3["App Node"] -.-> Node_3["Node 1"] + AppNode_3["App Node"] --> Node_N2 + Node_3 --> Replicas_3[("Replicas")] + Node_N2["Node 2"] --> Replicas_3[("Replicas")] + style Node_3 fill:#D50000 + style Node_N2 fill:#00C853 + end + + subgraph 4 + Node_4["Node 1"] + AppNode_4["App Node"] ==> Node_N3 + Node_N3["Node 2"] --> Replicas_4[("Replicas")] + style Node_4 fill:#D50000 + style Node_N3 fill:#00C853 + end +``` + +The above picture depicts the sequence of steps. The steps are taken by the control plane but executed by the agent. +The value add is not the ANA feature itself, rather what you do with it. + +## NATS & Fault management + +We used to use NATS as a message bus within mayastor as a whole, but as since switched for gRPC for p2p communications. \ +We will continue to use NATS for async notifications. Async in the sense that we send a message, but we do NOT wait for a reply. This mechanism does not + do any form of "consensus," retries, and the likes. Information transported over NATS will typically be error telemetry that is used to diagnose problems. No work has started yet on this subject. + +At a high level, error detectors are placed in code parts where makes sense; for example, consider the following: + +```rust +fn handle_failure( + &mut self, + child: &dyn BlockDevice, + status: IoCompletionStatus, +) { + // We have experienced a failure on one of the child devices. We need to + // ensure we do not submit more IOs to this child. We do not + // need to tell other cores about this because + // they will experience the same errors on their own channels, and + // handle it on their own. + // + // We differentiate between errors in the submission and completion. + // When we have a completion error, it typically means that the + // child has lost the connection to the nexus. In order for + // outstanding IO to complete, the IO's to that child must be aborted. + // The abortion is implicit when removing the device. + if matches!( + status, + IoCompletionStatus::NvmeError( + NvmeCommandStatus::GenericCommandStatus( + GenericStatusCode::InvalidOpcode + ) + ) + ) { + debug!( + "Device {} experienced invalid opcode error: retiring skipped", + child.device_name() + ); + return; + } + let retry = matches!( + status, + IoCompletionStatus::NvmeError( + NvmeCommandStatus::GenericCommandStatus( + GenericStatusCode::AbortedSubmissionQueueDeleted + ) + ) + ); +} +``` + +In the above snippet, we do not handle any other errors other than aborted and silently ignore invalid opcodes. If, for example, we experience a class of + error, we would emit an error report. Example classes are: + +```text +err.io.nvme.media.* = {} +err.io.nvme.transport.* = {} +err.io.nexus.* = {} +``` + +Subscribes to these events will keep track of payloads and apply corrective actions. In its most simplistic form, it results in a model where one can +define a per class for error an action that needs to be taken. This error handling can be applied to IO but also agents. + +The content of the event can vary, containing some general metadata fields, as well as event specific information. +Example of the event message capsule: + +```protobuf +// Event Message +message EventMessage { + // Event category + EventCategory category = 1; + // Event action + EventAction action = 2; + // Target id for the category against which action is performed + string target = 3; + // Event meta data + EventMeta metadata = 4; +} +``` + +An up to date API of the event format can be fetched + [here](https://github.com/openebs/mayastor-dependencies/blob/develop/apis/events/protobuf/v1/event.proto). + +## Distributed Tracing + +Tracing means different things at different levels. In this case, we are referring to tracing component boundary tracing. + +Tracing is by default implemented using open telemetry and, by default, we have provided a subscriber for jaeger. From jaeger, the information can be +forwarded to, Elastic Search, Cassandra, Kafka, or whatever. In order to achieve full tracing support, all the gRPC requests and replies should add +HTTP headers such that we can easily tie them together in whatever tooling is used. This is standard practice but requires a significant amount of work. +The key reason is to ensure that all requests and responses pass along the headers, from REST to the scheduling pipeline. + +We also need to support several types of transport and serialization mechanisms. For example, HTTP/1.1 REST requests to HTTP/2 gRCP request to + a KV store operation to etcd. For this, we will use [Tower]. \ +[Tower] provides a not-so-easy to use an abstraction of Request to Response mapping. + +```rust +pub trait Service { + /// Responses given by the service. + type Response; + /// Errors produced by the service. + type Error; + /// The future response value. + type Future: Future>; + /// Returns `Poll::Ready(Ok(()))` when the service is able to process requests. + fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll>; + /// Implementations are permitted to panic if `call` is invoked without + /// obtaining `Poll::Ready(Ok(()))` from `poll_ready`. + fn call(&mut self, req: Request) -> Self::Future; +} +``` + +The provided services can then be layered with additional functions that add the required metadata as the service propagates through the system. + +```rust +pub trait Layer { + /// The service for which we want to insert a new layer + type Service; + /// the implementation of the layer itself + fn layer(&self, inner: S) -> Self::Service; +} +``` + +An example where a `REST` client sets the open tracing key/values on the request before it is sent: + +```rust +let layer = TraceLayer::new_for_http().make_span_with(|request: &Request| { + tracing::debug_span!( + "HTTP", + http.method = %request.method(), + http.url = %request.uri(), + http.status_code = tracing::field::Empty, + // otel is a mandatory key/value + otel.name = %format!("HTTP {}", request.method()), + otel.kind = %SpanKind::Client, + otel.status_code = tracing::field::Empty, + ) +}) +``` + +[MOAC]: https://github.com/openebs/moac +[K8s]: https://kubernetes.io/ +[CSI]: https://github.com/container-storage-interface/spec +[Mayastor]: ./mayastor.md +[CAS]: https://openebs.io/docs/2.12.x/concepts/cas +[Tower]: https://docs.rs/tower/latest/tower/ +[etcd]: https://etcd.io/ diff --git a/doc/design/k8s/diskpool-cr.md b/doc/design/k8s/diskpool-cr.md new file mode 100644 index 000000000..6ed8fb06c --- /dev/null +++ b/doc/design/k8s/diskpool-cr.md @@ -0,0 +1,47 @@ +# DiskPool Custom Resource for K8s + +The DiskPool operator is a [K8s] specific component which manages pools in a K8s environment. \ +Simplistically, it drives pools across the various states listed below. + +In [K8s], mayastor pools are represented as [Custom Resources][k8s-cr], which is an extension on top of the existing [K8s API][k8s-api]. \ +This allows users to declaratively create [diskpool], and mayastor will not only eventually create the corresponding mayastor pool but will +also ensure that it gets re-imported after pod restarts, node restarts, crashes, etc... + +> **NOTE**: mayastor pool (msp) has been renamed to diskpool (dsp) + +## DiskPool States + +> *NOTE* +> Non-exhaustive enums could have additional variants added in the future. Therefore, when matching against variants of non-exhaustive enums, an extra +> wildcard arm must be added to account for future variants. + +- Creating \ +The pool is a new OR missing resource, and it has not been created or imported yet. The pool spec ***MAY*** be present but ***DOES NOT*** have a status field. + +- Created \ +The pool has been created in the designated i/o engine node by the control-plane. + +- Terminating \ +A deletion request has been issued by the user. The pool will eventually be deleted by the control-plane and eventually the DiskPool Custom Resource will also get removed from the K8s API. + +- Error (*Deprecated*) \ +The attempt to transition to the next state has exceeded the maximum number of retries. The retry counts are implemented using an exponential back-off, which by default is set to 10. Once the error state is entered, reconciliation stops. Only external events (a new resource version) will trigger a new attempt. + > NOTE: this State has been deprecated since API version **v1beta1** + +## Reconciler actions + +The operator responds to two types of events: + +- Scheduled \ +When, for example, we try to submit a new PUT request for a pool. On failure (i.e., network) we will reschedule the operation after 5 seconds. + +- CRD updates \ +When the CRD is changed, the resource version is changed. This will trigger a new reconcile loop. This process is typically known as “watching.” + +- Observability \ +During the transition, the operator will emit events to K8s, which can be obtained by kubectl. This gives visibility into the state and its transitions. + +[K8s]: https://kubernetes.io/ +[k8s-cr]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ +[k8s-api]: https://kubernetes.io/docs/concepts/overview/kubernetes-api/ +[diskpool]: https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-configuration diff --git a/doc/design/k8s/kubectl-plugin.md b/doc/design/k8s/kubectl-plugin.md new file mode 100644 index 000000000..5529a9b0e --- /dev/null +++ b/doc/design/k8s/kubectl-plugin.md @@ -0,0 +1,179 @@ +# Kubectl Plugin + +## Overview + +The kubectl-mayastor plugin follows the instructions outlined in +the [K8s] [official documentation](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/). + +The name of the plugin binary dictates how it is used. From the documentation: +> For example, a plugin named `kubectl-foo` provides a command `kubectl foo`. + +In our case the name of the binary is specified in the Cargo.toml file as `kubectl-mayastor`, therefore the command is +`kubectl mayastor`. + +This document outlines all workflows and interactions between the plugin, the Mayastor control plane, and [K8s]. +It provides a high-level overview of the plugin's general operation, the features it currently supports, and how + these features integrate with the APIs. + +This is the general flow of the request to generate an output from the plugin: + +1. The flow starts with the CLI command, to be entered from console. + +2. The respective command is supposed to hit the specific API endpoint dedicated for that purpose. + +3. The API request is then forwarded to the Core Agent of the Control Plane. + +4. Core Agent is responsible for the further propagation of the request based on its METHOD and purpose. + +5. A GET request would not bring in any change in spec or state, it would get the needed information from registry and + return it as a response to the request. + +6. A PUT request would bring a change in the spec, and thus a synchronous action would be performed by mayastor. + And updated spec and state would thus be returned as a response. + +> ***NOTE***: A command might have targets other than the Core Agent, and it might not even be sent to the +> control-plane, example: could be sent to a K8s endpoint. + +For a list of commands you can refer to the +docs [here](https://github.com/openebs/mayastor-extensions/blob/HEAD/k8s/plugin/README.md#usage). + +## Command Line Interface + +Some goals for the kubectl-mayastor plugin are: + +- Provide an intuitive and user-friendly CLI for Mayastor. +- Function in similar ways to existing Kubernetes CLI tools. +- Support common Mayastor operations. + +> **NOTE**: There are many principles for a good CLI. An interesting set of guidelines can be +> seen [here](https://clig.dev/) for example. + +All the plugin commands are verb based, providing the user with a similar experience to +the official [kubectl](https://kubernetes.io/docs/reference/kubectl/#operations). + +All the plugin commands and their arguments are defined using a very powerful cli library: [clap]. +Some of these features are: + +- define every command and their arguments in a type-safe way +- add default values for any argument +- custom long and short (single letter) argument names +- parse any argument with a powerful value parser +- add custom or well-defined possible values for an argument +- define conflicts between arguments +- define requirements between arguments +- flatten arguments for code encapsulation +- many more! + +Each command can be output in either `tabled`, `JSON` or `YAML` format. +The `tabled` format is mainly useful for human usage where the others allow for integration with tools (ex: jq, yq) which +can capture, parse and filter. + +Each command (and sub-commands) accepts the `--help | -h` argument, which documents the operation and the supported +arguments. + +> **NOTE**: Not all commands and their arguments are as well documented as we'd wish, and any help improving this would +> be very welcome! \ +> We can also consider auto-generating CLI documenting as markdown. + +## Connection to the K8s Cluster + +Exactly like the K8s kubectl, the kubectl-mayastor plugin runs on the users' system whereas mayastor is running in the K8s cluster. +A mechanism is then required in order to bridge this gap and allow the plugin to talk to the mayastor services running in the cluster. + +The plugin currently supports 2 distinct modes: + +1. Kube ApiServer Proxy +2. Port Forwarding + +### Kube ApiServer Proxy + +It's built-in to the K8s apiserver and allows a user outside of the cluster to connect via the apipserver to a clusterIp which would otherwise +be unreachable. +It proxies using HTTPS and it's capable of doing load balancing for service endpoints. + +```mermaid +graph LR + subgraph Control Plane + APIServer["Api Server"] + end + + subgraph Worker Nodes + Pod_1["pod"] + Pod_2["pod"] + Pod_3["pod"] + SLB["Service
LB"] + end + + %% These don't display on GitHub :( + %%Internet() + %%User() + + User ==> |"kubectl"| APIServer + User -.- |proxied| Pod_1 + APIServer -.-> |"kubectl"| Pod_1 + Internet --> SLB + SLB --> Pod_1 + SLB --> Pod_2 + SLB --> Pod_3 +``` + +Above we highlight the difference between this approach and a load balancer service which exposes the IP externally. +You can try this out yourself with the [kubect-plugin][kubectl-proxy]. + +### Port Forwarding + +K8s provides a [Port Forwarding](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/) to access +applications in a cluster. +This works by forwarding local ports to the cluster. + +You can try this out yourself with the [kubect-plugin][kubectl-port-forward]. + +> *NOTE*: kubect port-forward is currently implemented for TCP ports only. + +
+ +## Distribution + +We distribute the plugin in similar ways to what's recommended by the kubectl plugin docs: + +1. Krew \ + [Krew] offers a cross-platform way to package and distribute your plugins. This way, you use a single packaging format + for all target platforms (Linux, Windows, macOS etc) and deliver updates to your users. \ + Krew also maintains a plugin index so that other people can discover your plugin and install it. +2. "Naked" binary packaged in a tarball \ + This is available as a [GitHub] release asset for the specific version: \ + `vX.Y.Z: https://github.com/openebs/mayastor/releases/download/v$X.$Y.$Z/kubectl-mayastor-$platform.tar.gz` \ + Example, you can get the x86_64 plugin for v2.7.3 can be + retrieved [here](https://github.com/openebs/mayastor/releases/download/v2.7.3/kubectl-mayastor-x86_64-linux-musl.tar.gz). +3. Source code \ + You can download the source code for the released version and build it yourself. \ + You can check the build docs for reference [here](../../build-all.md#building). + +## Supported Platforms + +Although the mayastor installation is only officially supported for Linux x86_64 at the time of writing, the plugin +actually supports a wider range of platforms. \ +This is because although most production K8s cluster are running Linux x86_64, users and admins may interact with the +clusters from a wider range of platforms. + +- [x] Linux + - [x] x86_64 + - [x] aarch64 +- [x] MacOs + - [x] x86_64 + - [x] aarch64 +- [ ] Windows + - [x] x86_64 + - [ ] aarch64 + +[K8s]: https://kubernetes.io/ + +[clap]: https://docs.rs/clap/latest/clap/ + +[GitHub]: https://github.com/openebs/mayastor + +[Krew]: https://krew.sigs.k8s.io/ + +[kubectl-proxy]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#proxy + +[kubectl-port-forward]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#port-forward diff --git a/doc/lvm.md b/doc/design/lvm.md similarity index 86% rename from doc/lvm.md rename to doc/design/lvm.md index 879415c6c..0ff2e3a42 100644 --- a/doc/lvm.md +++ b/doc/design/lvm.md @@ -50,24 +50,25 @@ thin provisioning) within Mayastor.\ Users can resize volumes online. Snapshots are managed transparently. -Features ---- - -- [x] Pool Operations - - [x] Create - - [x] Destroy - - [x] Import - - [x] List -- [x] Replica Operations - - [x] Create - - [x] Destroy - - [x] Share/Unshare - - [x] Resize - - [x] List -- [ ] Thin Provisioning -- [ ] Snapshots -- [ ] Clones -- [ ] RAIDx +## Features + +- [ ] I/O Engine (data-plane) + - [x] Pool Operations + - [x] Create + - [x] Destroy + - [x] Import + - [x] List + - [x] Replica Operations + - [x] Create + - [x] Destroy + - [x] Share/Unshare + - [x] Resize + - [x] List + - [ ] Thin Provisioning + - [ ] Snapshots + - [ ] Clones + - [ ] RAIDx +- [ ] Control-Plane ### Limitation @@ -97,9 +98,9 @@ graph TD; end subgraph Physical Volumes - PV_1 --> VG_1["Volume Group - VG 1"] - PV_2 --> VG_1 - PV_3 --> VG_2["Volume Group - VG 2"] + PV_1["PV 1"] --> VG_1["Vol Group 1"] + PV_2["PV 2"] --> VG_1 + PV_3["PV 3"] --> VG_2["Vol Group 2"] end subgraph Node1 diff --git a/doc/design/mayastor.md b/doc/design/mayastor.md new file mode 100644 index 000000000..c4486f2ad --- /dev/null +++ b/doc/design/mayastor.md @@ -0,0 +1,366 @@ +# Mayastor I/O Engine + +Here we explain how things work in the mayastor data-plane, particularly how it interfaces with `xPDK`. It discusses the +deep internals of mayastor before going into the implementation of the `Nexus`. \ +The goal is not to ensure that everyone fully understands the inner workings of mayastor but for those who would +like to understand it in more detail can use it to get started. + +Contributions to these documents are very much welcome, of course, the better we can explain it to ourselves, the better +we can explain it to our users! + +Our code, as well as [SPDK], is in a high state of flux. For example, the thread library did not exist when we started +to use [SPDK], so keep this in mind. + +## Table of Contents + +- [Memory](#memory) + - [What if we are not using NVMe devices?](#what-if-we-are-not-using-nvme-devices) +- [Lord of the rings](#lord-of-the-rings) +- [Cores](#cores) +- [Reactor](#reactor) +- [Mthreads](#mthreads) +- [IO channels](#io-channels) +- [Passing block devices to mayastor](#passing-block-devices-to-mayastor) +- [Userspace IO](#userspace-io) +- [VF-IO](#vf-io) +- [Acknowledgments](#acknowledgments) + +## Memory + +The first fundamental understanding that requires some background information at best is how `xPDK` uses/manages memory. +During the start, we allocate memory from huge pages. This is not ideal from a "run everywhere" deployment, but it is +fundamental for achieving high performance for several reasons: + +The huge pages result in less [TLB] misses that increase performance significantly. We are not unique in using these. In +fact, the first use of cases for huge pages is found in the databases world. These DBs typically hold a huge amount of +memory, and if you know upfront that you are going to do so, it's going to be more efficient to have 2MB pages than 4KB +pages. + +An undocumented feature of huge pages is that they can/are be pinned in memory. This is required if you want to [DMA] +from userspace buffers to HW. Why? Well – if you write code that says write this range of memory (defined in [SGL]) and +the data is moved to a different location by the memory management system, you would get… Garbage. As we deal (not +always) with NVMe userspace drivers, we want [DMA] buffers straight into the device. Without huge pages, this would not +be possible. + +During runtime, IO buffers and message queues are pre-allocated during startup. This amount of huge pages is mapped into +a list of regions, and this list of regions is allocated from. IOW, we have within the system our own memory allocator. +All the IO's are, for the most part, pre-allocated, which means that during the actual IO path, no allocations are +happening at all. This can be seen within mayastor when you create a new struct [DMA]Buf; it does not call `Box` or `libc:: +malloc()`. The `Drop` implementation does not `free()` the memory rather puts back the buffer on the unused/free list. + +The above illustrates what is described previously; 22 million [TLB] misses – vs 0 with 2M pages. This immediately shows +the benefit of using huge pages in terms of performance but remember, and it is not only because of performance but +also – they are required to be able to do [DMA] transfers from memory to the NVMe device. + +### What if we are not using NVMe devices? + +When we are not using nvme devices, we would, in theory not, not need the huge pages for [DMA] but only for performance. +For cases where the performance requirements are not very high, this would be fine. However, transparent switching +to/from huge pages when needed is a significant amount of work and the work. Setting up the requirements of the huge +page is not hard but inconvenient at best. More so, as k8s does not handle them very well right now. + +## Lord of the rings + +As with most, if not all, parallel systems shared state is a problem. If you use locks over the shared state, then the +parallelism level will be limited by the "hotness" of the shared state. Fortunately, there are lockless algorithms that +allow for `lockless` approaches that are less expensive than, e.g., a `Mutex`. They are less expensive, not zero, as +they use atomic operations, which are more expensive than non-atomic operations. One such algorithm we use is a `lockless +ring buffer` – the implementation of these buffers is out of scope, but you can find more information here that details +a design for a ring buffer but is not used with `xPDK`. + +As mentioned in the memory section, we pre-allocate all memory we need for the IO path during startup. These pre +allocations are put in so-called pools, and you can take and give to/from the pool – without holding locks, as these +pools are implemented using these lockless ring buffers. Needless to say, you don't want to constantly put/take from the +pool because even though atomic, there is an inherent cost to using atomics. + +![4k vs 2M TLB Misses](../img/4kVS2m-tlb-misses.png) + +The above picture illustrates the layout where we start from the huge pages, where on top, we have several APIs to +allocate (malloc) from those huge pages. In turn, this API is used to create a pool of pre-allocated objects of +different sizes and are put in the pool. Each pool is identified with a different name. + +Using these pools, we can create a smaller subset of lockless pools and assign them per core. Or, phrased differently, a +CPU local cache of elements taken out of the pool via the put/get API. Once taken out of the pool, no other CPU access +those objects, and we do not need to lock them once local to ourselves. The contract here that we need to adhere too +though, is that what is local to us should stay local, IOW we as programmers should ensure that we don't reference an +object between different CPUs. + +## Cores + +Deep within the `xPDK` library, a bootstrapping code handles the claiming of the huge pages and sets up several threads +of +execution on a per-core basis. The library knows how many cores to use based on a so-called core mask. Let us assume we +have a 4 core CPU, So when we start mayastor with a core mask of 0x1, only the first core (core0) will be +"bootstrapped." If we were to supply 0x3, then core 0 and core 2 will be used. (0x03 == 0011) and so forth. But what +actually happens? If we leave the memory allocations out of it, not all that much! + +Using the core mask, we, just like any other application, use OS threads. However, what is different is that in the case +of `mask=0x3`, a thread will be created, and through OS-specific system calls, we tell the OS, this thread may only +execute on CPU2. In mayastor, this is handled within the `core::env.rs` file. Once the thread is started – it will wait +to receive a single function to execute. If that function completes, the created thread will return, just as with any +other thread. No magic here. + +With 0x3, it means we need to create one additional thread because when we start the program, we already have at least +one thread. The additional threads we create – are called "remote threads," and in our `core::env.rs` file, we have a +function called `launch_remote()` So all we really do is based on the core mask, create mask-1 threads, and "pin" them +to the core, and execute the launch remote function on each remote core. + +The master core will do some other work (i.e., start gRPC) and eventually call a similar function as the +`launch_remote()` – that is, a function that returns when completed. + +The question might be: why? Why would you not have the OS decide what core is best to execute on? is that not what an OS +is supposed to do? Typically, yes; however, there are other things to consider (NUMA) but also the fact that if we keep +the thread on the same CPU, we avoid context switch overheads. (i.e., the OS moves us from core N to core M) this, in +turn, reduces [TLB] misses and all the things related to it. In short, its locality principle over again. + +For optimal performance, we also need to tell the operating system to pin us to that core and not schedule anything else +on it! This seems like an ideal job for k8s, but unfortunately, it can't, so we have to configure the system to boot +with an option called `isolcpus`. But it's not required; performance would be impaired. + +## Reactor + +So what is launch local, or remote for that matter, supposed to do? Well, it would need to run in a loop; otherwise, the +program would exit right away. So on each of these cores, we have one data structure called a reactor. The reactor is +the main data structure that we use to keep track of things that we need to do, or, for example, our entry point to shut +down when some hits ctrl+c. + +This reactor calls `poll()` in a loop. Poll what? Network connections and yet again, another set of rings. We will go +into more detail later, but for now, it's sufficiently accurate. + +The main thread is responsible for creating the reactors. How many? – the same as the number as value as the core mask. +In mayastor, this looks like this: + +```rust +self.initialize_eal(); + +info!( + "Total number of cores available: {}", + Cores::count().into_iter().count() +); + +// setup our signal handlers +self.install_signal_handlers().unwrap(); + +// allocate a Reactor per core +Reactors::init(); + +// launch the remote cores if any. note that during init these have to +// be running as during setup cross call will take place. +Cores::count() + .into_iter() + .for_each(|c| Reactors::launch_remote(c).unwrap()); +``` + +The last lines start the remote reactors and, as mentioned, call poll. The main thread will go off and do some other +things but eventually will also join the game and start calling poll. As a result, what we end up with is a set of +threads, which are pinned to a specific core – running in a loop doing nothing else but read/writing to network sockets +and calling functions that are placed within, as mentioned a set of other rings. To understand what rings, we have to +introduce a new concept called "threads." Huh?! We already talked about `threads` did we not? Well, If you think we use +poor naming schemes that can confuse people pretty badly, [SPDK] is no different; [SPDK] has its own notion of threads. +In mayastor, these things are called `mthreads` (Mayastor `threads`). + +## Mthreads + +To make things confusing, this part is about so-called "threads." But not the ***threads*** you are used to, rather, +[SPDK] threads. These threads are a subset of a msg pool and a subset of all socket connections for a particular core. +To reiterate, we already established that a reactor is per core structure that is our entry point for housekeeping, if +you will. \ +If we look into the code, we can see that the reactor has several fields but the most important is the `Vec` +field. + +```rust +struct Reactor { + // the core number we run on + core: u32, + // units of work that belong to this reactor + threads: RefCel>>, + // the current state of the reactor + state: AtomicCell, +} + +impl Reactor { + /// poll the mthreads for any incoming work + fn poll(&self) { + self.threads.borrow().iter(|t| { + t.poll(); + }); + } +} +``` + +```mermaid +graph TD + subgraph Core + MsgPool(["Per Core
Msg Pool"]) + PThread>"PThread"] + end + + subgraph "spdk_thread MThread" + Messages("Messages") + Poll_Group["Poll Group"] + Poll_Fn[["Poll Fn"]] + Sockets{"Sockets"} + end + +%% Connections + MsgPool <-.-> Messages + Messages --- Messages + Messages <==> Poll_Group +``` + +The above picture with the including code snippet, hopefully, clears it up somewhat. The reactor structure (per core) +keeps track of a set of `Mthreads`, which are broken down into: + +1. messages: These are functions to be called based on packets read or written to/from the network connections or + explicitly put there by internal functions calls or RPC calls. All these events have the same layout + +2. poll_groups: a set of sockets that are polled every time we poll the thread to read/write data to/from the network + +3. poll_fn: functions that are called constantly, within a specific interval. + +So how many mthreads do we have? Well, as many as you want, but no more than strictly needed. For each reactor, we, for +example, create a thread to handle NVMF connections on that core. We could argue that this mthread is the nvmf thread +for that core. All that core does is handle nvmf work. Similarly, we create one for iSCSI. The idea is that you can +strictly control what core does what by controlling where a thread is started. + +This then implies that each core, independently of other cores, can do storage IO which gets us the linear scalability +we need to achieve these low latency values. However, there is one more thing to consider; we now have this +shared-nothing, lockless model such that every core, in effect, can do whatever it wants to do with the device +underneath. But surely, there has to be some synchronisation, right? For example, Lets say we want to "pause" the device +not to accept any IO? We would need to send each core a message to tell each thread on that core that might be doing IO +to that core, and it needs to, well, pause. + +You perhaps can imagine that this might not be a single indecent situation and that "pause" is just a single type of +operation that each core would need to do. Other scenarios could be to be able to open the device or close it etc. To +make this a bit easier to deal with, these common patterns are abstracted in so-called io channels. These channels can +be compared to go channels, where you can "send messages on and get called back when all receivers processed the +message. + +## IO channels + +When you open a file in a programming language of your choice, apart from semantics, for the most part, it will look +roughly as: + +```C +void main(void) { + FILE *my_file; + if ((my_file = open("path/to/file")) < 0) { /* error */ } else { /* use file */ } +} +``` + +The variable my_file is called a file descriptor, and within mayastor/spdk, this is no different. When you open want to +open block device, you get back a descriptor. However, unlike the "normal" situation, within mayastor, the descriptor +can not be used. Instead, given a descriptor, you must get an "io channel" to the device the descriptor is referencing. +to do IO directly + +```rust +// normal +read(desc, &buf, sizeof(buf)); + +// mayastor +let channel = desc.get_channel(); +read(desc, channel, &buf, sizeof(buf)); +``` + +This is because we need an away to get access to a device within mayastor exclusively. Normally we have the operating +system to handle this for us, but we need to handle this ourselves in userspace. To achieve the parallelism, we the use +a per-core IO channel that we create for that descriptor. Additionally, these channels can be used to "execute something +on each mthread" when we need to change the state of the device/descriptor, like, for example, closing it. + +This is done by deep DPDK internals that are not really relevant, but it boils down to the fact that each block device +has a list of channels, which must have a `Mthread` associated with it. (by the design of the whole thing). Using this +information, we can call a function on each thread that has an io channel to our device and have it (for example) close +the channel. + +```mermaid +block-beta + columns 3 + Reactor_1("Reactor") + Reactor_2("Reactor") + Reactor_3("Reactor") + MThread_1("MThread") + MThread_2("MThread") + MThread_3("MThread") + Channel_3["Channel"] + Channel_2["Channel"] + Channel_1["Channel"] + space space space + space QPairs[/"QPairs"\] space + space NVMeDev["NVMeDev"] space + space space space + ChannelFE["Channel For Each"]:3 + QPairs --> Channel_1 + QPairs --> Channel_2 + QPairs --> Channel_3 + ChannelFE --> Channel_3 + Channel_1 --> Channel_2 + Channel_2 --> Channel_3 + Channel_1 --> ChannelFE +``` + +The flow is depicted within the above figure. We call channel_for_each and return when the function has been executed on +each of the cores that have a (reference) to channel the device we wish to operate on. Another use case for this is, for +example, when we do a rebuild operation. We want to tell each core to LOCK a certain range of the device to avoid +writing to it while we are rebuilding it. + +## Passing block devices to mayastor + +Mayastor has support for several different ways to access or emulate block devices. This can come in handy for several +reasons, but for +**production use cases, we only support devices accessed through [io_uring][io-uring] and `PCI`e devices**. +Originally we planned that you could use all your devices of your choice in any way you want. This creates too much +confusion and a too-wide test matrix. Using this approach, however, we can serve all cases we need except for the direct +remote iSCSI or nvmf targets. The block devices passed to mayastor are used to store replicas. + +To access the `PCI` devices from userspace, more setup is required, and we typically don't talk about that too much as +[io_uring][io-uring], for the most part, will be fast enough. Once you are dealing with Optane devices that can do a +million IOPS per device, the need for using userspace `PCI` IO becomes more appealing. + +Making use of `PCI` devices in user space is certainly not new. In fact, it has been used within the embedded Linux +space for many years, and it's also a foundation for things like `PCI passthrough` in the virtualization space. + +Using devices in mayastor are abstracted using URIs so to use a `/dev/path/to/disk` we can write: +`uring:///dev/path/to/disk`. + +## Userspace IO + +Userspace I/O is the first way to achieve this model. The kernel module driver attached to the device is unloaded, and +then the UIO driver is attached to the device. Put differently, and one could argue we replace the NVMe driver, which is +loaded by default is replaced by the UIO driver. + +```mermaid +block-beta + columns 3 + mayastor:2 user>"user space"] + sysfs /dev/uio interface>"interface"] + UIO["UIO Driver"]:2 kernel>"kernel space"] +``` + +## VF-IO + +A similar interface to use do userspace IO is [VF-IO][VFIO]. The only difference is that, like with memory, there is an +MMU ([IOMMU]) that ensures that there is some protection, and we don't have a VM (for example) by accident write into +the same `PCI` device and create havoc. + +Once the machine is configured to either use vfio or in the `PCI` address to the NVMe device can be used to create a +"pool" using the for example `pci:///000:0067.00`. + +
+ +## Acknowledgments + +This document was originally written by Jeffry and now converted to GitHub markdown. + +[SPDK]: https://spdk.io/ + +[TLB]: https://wiki.osdev.org/TLB + +[DMA]: https://en.wikipedia.org/wiki/Direct_memory_access + +[SGL]: https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing) + +[io-uring]: https://man7.org/linux/man-pages/man7/io_uring.7.html + +[VFIO]: https://docs.kernel.org/driver-api/vfio.html + +[IOMMU]: https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit diff --git a/doc/design/public-api.md b/doc/design/public-api.md new file mode 100644 index 000000000..39d94b4ad --- /dev/null +++ b/doc/design/public-api.md @@ -0,0 +1,30 @@ +# Mayastor Public API + +Mayastor exposes a public api from its [REST] service. +This is a [RESTful][REST] API which can be leveraged by external to mayastor (ex: users or 3rd party tools) as well as +mayastor components which are part of the control-plane. + +## OpenAPI + +The mayastor public API is defined using the [OpenAPI] which has many benefits: + +1. Standardized: OpenAPI allows us to define an API in a standard way, well-used in the industry. + +2. Integration: As a standard, it's easy to integrate with other systems, tools, and platforms (anyone can write a + plugin for it!). + +3. Automation: Auto generate the server and client libraries, reducing manual effort and the potential for errors. + +4. Documentation: Each method and type is documented which makes it easier to understand. + +5. Tooling: There's an abundance of tools and libraries which support the OpenAPI spec, making it easier to develop, + test, and deploy. + +The spec is +available [here](https://raw.githubusercontent.com/openebs/mayastor-control-plane/HEAD/control-plane/rest/openapi-specs/v0_api_spec.yaml), +and you interact with it using one of the many ready-made +tools [here](https://editor.swagger.io/?url=https://raw.githubusercontent.com/openebs/mayastor-control-plane/HEAD/control-plane/rest/openapi-specs/v0_api_spec.yaml). + +[OpenAPI]: https://www.openapis.org/what-is-openapi + +[REST]: https://en.wikipedia.org/wiki/REST diff --git a/doc/design/rest-authentication.md b/doc/design/rest-authentication.md new file mode 100644 index 000000000..485e7a494 --- /dev/null +++ b/doc/design/rest-authentication.md @@ -0,0 +1,115 @@ +# REST Authentication + +## References + +- https://auth0.com/blog/build-an-api-in-rust-with-jwt-authentication-using-actix-web/ +- https://jwt.io/ +- https://russelldavies.github.io/jwk-creator/ +- https://blog.logrocket.com/how-to-secure-a-rest-api-using-jwt-7efd83e71432/ +- https://blog.logrocket.com/jwt-authentication-in-rust/ + +## Overview + +The [REST API][REST] provides a means of controlling Mayastor. It allows the consumer of the API to perform operations +such as creation and deletion of pools, replicas, nexus and volumes. + +It is important to secure the [REST] API to prevent access to unauthorised personnel. This is achieved through the use +of +[JSON Web Tokens (JWT)][JWT] which are sent with every [REST] request. + +Upon receipt of a request the [REST] server extracts the [JWT] and verifies its authenticity. If authentic, the request +is +allowed to proceed otherwise the request is failed with an [HTTP] `401` Unauthorized error. + +## JSON Web Token (JWT) + +Definition taken from here: + +> JSON Web Token ([JWT]) is an open standard ([RFC 7519][JWT]) that defines a compact and self-contained way for +> securely transmitting information between parties as a JSON object. \ +> This information can be verified and trusted because it is digitally signed. \ +> [JWT]s can be signed using a secret (with the [HMAC] algorithm) or a public/private key pair using [RSA] or +> [ECDSA]. + +The [REST] server expects the [JWT] to be signed with a private key and for the public key to be accessible as +a [JSON Web Key (JWK)][JWK]. + +The JWK is used to authenticate the [JWT] by checking that it was indeed signed by the corresponding private key. + +The [JWT] comprises three parts, each separated by a fullstop: + +`

..` + +Each of the above parts are [Base64-URL] encoded strings. + +## JSON Web Key (JWK) + +Definition taken from here: + +> A [JSON] Web Key ([JWK]) is a JavaScript Object Notation ([JSON - RFC 7159][JSON]) data structure that represents a +> cryptographic key. + +An example of the [JWK] structure is shown below: + +```json +{ + "kty": "RSA", + "n": "tTtUE2YgN2te7Hd29BZxeGjmagg0Ch9zvDIlHRjl7Y6Y9Gankign24dOXFC0t_3XzylySG0w56YkAgZPbu-7NRUbjE8ev5gFEBVfHgXmPvFKwPSkCtZG94Kx-lK_BZ4oOieLSoqSSsCdm6Mr5q57odkWghnXXohmRgKVgrg2OS1fUcw5l2AYljierf2vsFDGU6DU1PqeKiDrflsu8CFxDBAkVdUJCZH5BJcUMhjK41FCyYImtEb13eXRIr46rwxOGjwj6Szthd-sZIDDP_VVBJ3bGNk80buaWYQnojtllseNBg9pGCTBtYHB-kd-NNm2rwPWQLjmcY1ym9LtJmrQCXvA4EUgsG7qBNj1dl2NHcG03eEoJBejQ5xwTNgQZ6311lXuKByP5gkiLctCtwn1wGTJpjbLKo8xReNdKgFqrIOT1mC76oZpT3AsWlVH60H4aVTthuYEBCJgBQh5Bh6y44ANGcybj-q7sOOtuWi96sXNOCLczEbqKYpeuckYp1LP", + "e": "AQAB", + "alg": "RS256", + "use": "sig" +} +``` + +The meaning of these keys (as defined on [RFC 7517][[JWK]]) are: + +| Key Name | Meaning | Purpose | +|:---------|:------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| kty | Key Type | Denotes the cryptographic algorithm family used | +| n | Modulus | The modulus used by the public key | +| e | Exponent | The exponent used by the public key | +| alg | The algorithm used | This corresponds to the algorithm used to sign/encrypt the [JWT] | +| use | Public Key Use | Can take one of two values sig or enc. sig indicates the public key should be used only for signature verification, whereas enc denotes that it is used for encrypting the data | + +
+ +## REST Server Authentication + +### Prerequisites + +1. The [JWT] is included in the [HTTP] Authorization Request Header +2. The [JWK], used for signature verification, is accessible + +### Process + +The [REST] server makes use of the [jsonwebtoken] crate to perform [JWT] authentication. + +Upon receipt of a [REST] request the [JWT] is extracted from the header and split into two parts: + +1. message (comprising the header and payload) +2. signature + +This is passed to the jsonwebtoken crate along with the decoding key and algorithm extracted from the [JWK]. + +If authentication succeeds the [REST] request is permitted to continue. If authentication fails, the [REST] request is +rejected with an [HTTP] `401` Unauthorized error. + +[REST]: https://en.wikipedia.org/wiki/REST + +[JWT]: https://datatracker.ietf.org/doc/html/rfc7519 + +[JWK]: https://datatracker.ietf.org/doc/html/rfc7517 + +[HTTP]: https://developer.mozilla.org/en-US/docs/Web/HTTP + +[Base64-URL]: https://base64.guru/standards/base64url + +[HMAC]: https://datatracker.ietf.org/doc/html/rfc2104 + +[RSA]: https://en.wikipedia.org/wiki/RSA_(cryptosystem) + +[ECDSA]: https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm + +[JSON]: https://datatracker.ietf.org/doc/html/rfc7159 + +[jsonwebtoken]: https://github.com/Keats/jsonwebtoken diff --git a/doc/img/4kVS2m-tlb-misses.png b/doc/img/4kVS2m-tlb-misses.png new file mode 100644 index 000000000..2b5b6f9c6 Binary files /dev/null and b/doc/img/4kVS2m-tlb-misses.png differ diff --git a/doc/img/overview.drawio.png b/doc/img/overview.drawio.png new file mode 100644 index 000000000..96c752714 Binary files /dev/null and b/doc/img/overview.drawio.png differ diff --git a/io-engine/src/bdev/malloc.rs b/io-engine/src/bdev/malloc.rs index f6e8fcfb5..7695f5ba4 100644 --- a/io-engine/src/bdev/malloc.rs +++ b/io-engine/src/bdev/malloc.rs @@ -76,17 +76,40 @@ impl TryFrom<&Url> for Malloc { 512 }; - let size: u32 = if let Some(value) = parameters.remove("size_mb") { - value.parse().context(bdev_api::IntParamParseFailed { + let size_mb: Option = if let Some(value) = parameters.remove("size_mb") { + Some(value.parse().context(bdev_api::IntParamParseFailed { uri: uri.to_string(), parameter: String::from("size_mb"), value: value.clone(), - })? + })?) } else { - 0 + None }; - let num_blocks: u32 = if let Some(value) = parameters.remove("num_blocks") { + let size_b: Option = if let Some(value) = parameters.remove("size") { + Some( + byte_unit::Byte::parse_str(&value, true) + .map_err(|error| BdevError::InvalidUri { + uri: uri.to_string(), + message: format!("'size' is invalid: {error}"), + })? + .as_u64(), + ) + } else { + None + }; + + let size = match (size_mb, size_b) { + (Some(_), Some(_)) => Err(BdevError::InvalidUri { + uri: uri.to_string(), + message: "Can't specify both size and size_mb".to_string(), + }), + (Some(size_mb), None) => Ok(size_mb * 1024 * 1024), + (None, Some(size)) => Ok(size), + (None, None) => Ok(0), + }?; + + let num_blocks: u64 = if let Some(value) = parameters.remove("num_blocks") { value.parse().context(bdev_api::IntParamParseFailed { uri: uri.to_string(), parameter: String::from("num_blocks"), @@ -133,8 +156,8 @@ impl TryFrom<&Url> for Malloc { num_blocks: if num_blocks != 0 { num_blocks } else { - (size << 20) / blk_size - } as u64, + size / (blk_size as u64) + }, blk_size, uuid, resizing, diff --git a/io-engine/src/bdev/null_bdev.rs b/io-engine/src/bdev/null_bdev.rs index 71e29de7f..b9b3146ac 100644 --- a/io-engine/src/bdev/null_bdev.rs +++ b/io-engine/src/bdev/null_bdev.rs @@ -63,16 +63,39 @@ impl TryFrom<&Url> for Null { }); } - let size: u64 = if let Some(value) = parameters.remove("size_mb") { - value.parse().context(bdev_api::IntParamParseFailed { + let size_mb: Option = if let Some(value) = parameters.remove("size_mb") { + Some(value.parse().context(bdev_api::IntParamParseFailed { uri: uri.to_string(), parameter: String::from("size_mb"), value: value.clone(), - })? + })?) } else { - 0 + None + }; + + let size_b: Option = if let Some(value) = parameters.remove("size") { + Some( + byte_unit::Byte::parse_str(&value, true) + .map_err(|error| BdevError::InvalidUri { + uri: uri.to_string(), + message: format!("'size' is invalid: {error}"), + })? + .as_u64(), + ) + } else { + None }; + let size = match (size_mb, size_b) { + (Some(_), Some(_)) => Err(BdevError::InvalidUri { + uri: uri.to_string(), + message: "Can't specify both size and size_mb".to_string(), + }), + (Some(size_mb), None) => Ok(size_mb * 1024 * 1024), + (None, Some(size)) => Ok(size), + (None, None) => Ok(0), + }?; + let num_blocks: u64 = if let Some(value) = parameters.remove("num_blocks") { value.parse().context(bdev_api::IntParamParseFailed { uri: uri.to_string(), @@ -86,8 +109,9 @@ impl TryFrom<&Url> for Null { if size != 0 && num_blocks != 0 { return Err(BdevError::InvalidUri { uri: uri.to_string(), - message: "conflicting parameters num_blocks and size_mb are mutually exclusive" - .to_string(), + message: + "conflicting parameters num_blocks and size/size_mb are mutually exclusive" + .to_string(), }); } @@ -104,7 +128,7 @@ impl TryFrom<&Url> for Null { num_blocks: if num_blocks != 0 { num_blocks } else { - (size << 20) / (blk_size as u64) + size / (blk_size as u64) }, blk_size, uuid: uuid.or_else(|| Some(Uuid::new_v4())),