docs: Revise README.md for Kaito inference (#581)

This PR revised the README.md of Kaito inference with more details about APIs, usages, etc. A workload section is added to explain how we construct the pod for inference with adapters. The troubleshooting section is left empty for now. This change also fixes the wrong sku name in the tuning README.md. --------- Signed-off-by: Fei Guo <vrgf2003@gmail.com> Co-authored-by: Ishaan Sehgal <ishaanforthewin@gmail.com>
kaito-project · Aug 27, 2024 · ab3851a · ab3851a
1 parent bf5bc95
commit ab3851a
Show file tree

Hide file tree

Showing 3 changed files with 56 additions and 26 deletions.
diff --git a/docs/img/kaito-inference-adapter.png b/docs/img/kaito-inference-adapter.png
diff --git a/docs/inference/README.md b/docs/inference/README.md
@@ -1,19 +1,10 @@
-# Kaito Inference Workspace API
+# Kaito Inference
 
-This guide provides instructions on how to use the Kaito Inference Workspace API for basic model serving and serving with LoRA adapters.
+This document presents how to use the Kaito `workspace` Custom Resource Definition (CRD) for model serving and serving with LoRA adapters.
 
-## Getting Started
+## Usage
 
-To use the Kaito Inference Workspace API, you need to define a Workspace custom resource (CR). Below are examples of how to define the CR and its various components.
-
-## Example Workspace Definitions
-Here are three examples of using the API to define a workspace for inferencing different models:
-
-Example 1: Inferencing [`phi-3-mini`](../../examples/inference/kaito_workspace_phi_3.yaml)
-
-Example 2: Inferencing [`falcon-7b`](../../examples/inference/kaito_workspace_falcon_7b.yaml) without adapters
-
-Example 3: Inferencing `falcon-7b` with adapters
+The basic usage for inference is simple. Users just need to specify the GPU SKU used for inference in the `resource` spec and one of the Kaito supported model name in the `inference` spec in the `workspace` custom resource. For example,
 
 ```yaml
 apiVersion: kaito.sh/v1alpha1
@@ -28,14 +19,30 @@ resource:
 inference:
   preset:
     name: "falcon-7b"
-  adapters:
-    - source:
-        name: "falcon-7b-adapter"
-        image:  "<YOUR_IMAGE>"
-      strength: "0.2"
 ```
 
-Multiple adapters can be added:
+If a user runs Kaito in an on-premise Kubernetes cluster where GPU SKUs are unavailable, the GPU nodes can be pre-configured. The user should ensure that the corresponding vendor-specific GPU plugin is installed successfully in every prepared node, i.e. the node status should report a non-zero GPU resource in the allocatable field. For example:
+
+```
+$ kubectl get node $NODE_NAME -o json | jq .status.allocatable
+{
+  "cpu": "XXXX",
+  "ephemeral-storage": "YYYY",
+  "hugepages-1Gi": "0",
+  "hugepages-2Mi": "0",
+  "memory": "ZZZZ",
+  "nvidia.com/gpu": "1",
+  "pods": "100"
+}
+```
+
+Next, the user needs to add the node names in the `preferredNodes` field in the `resource` spec. As a result, the Kaito controller will skip the steps for GPU node provisioning and use the prepared nodes to run the inference workload.
+> [!IMPORTANT]
+> The node objects of the preferred nodes need to contain the same matching labels as specified in the `resource` spec. Otherwise, the Kaito controller would not recognize them.
+
+### Inference with LoRA adapters 
+
+Kaito also supports running the inference workload with LoRA adapters produced by [model fine-tuning jobs](../tuning/README.md). Users can specify one or more adapters in the `adapters` field of the `inference` spec. For example,
 
 ```yaml
 apiVersion: kaito.sh/v1alpha1
@@ -55,10 +62,33 @@ inference:
         name: "falcon-7b-adapter"
         image:  "<YOUR_IMAGE>"
       strength: "0.2"
-    - source:
-        name: "additional-source"
-        image: "<YOUR_ADDITIONAL_IMAGE>"
-      strength: "0.5" 
 ```
+Currently, only images are supported as adapter sources. The `strength` field specifies the multiplier applied to the adapter weights relative to the raw model weights.
+
+**Note:** When building a container image for an existing adapter, ensure all adapter files are copied to the **/data** directory inside the container.
+
+For detailed `InferenceSpec` API definitions, refer to the [documentation](https://github.com/Azure/kaito/blob/2ccc93daf9d5385649f3f219ff131ee7c9c47f3e/api/v1alpha1/workspace_types.go#L75).
+
+
+# Inference workload
+
+Depending on whether the specified model supports distributed inference or not, the Kaito controller will choose to use either Kubernetes **apps.deployment** workload (by default) or Kubernetes **apps.statefulset** workload (if the model supports distributed inference) to manage the inference service, which is exposed using a Cluster-IP type of Kubernetes `service`.
+
+When adapters are specified in the `inference` spec, the Kaito controller adds an initcontainer for each adapter in addition to the main container. The pod structure is shown in Figure 1.
+
+<div align="left">
+  <img src="../img/kaito-inference-adapter.png" width=40% title="Kaito inference adapter" alt="Kaito inference adapter">
+</div>
+
+If an image is specified as the adapter source, the corresponding initcontainer uses that image as its container image. These initcontainers ensure all adapter data is available locally before the inference service starts. The main container uses a supported model image, launching the [inference_api.py](https://github.com/Azure/kaito/presets/inference/text-generation/inference_api.py) script.
+
+All containers share local volumes by mounting the same `EmptyDir` volumes, avoiding file copies between containers.
+
+## Workload update
+
+To update the `adapters` field in the `inference` spec, users can modify the `workspace` custom resource. The Kaito controller will apply the changes, triggering a workload deployment update. This will recreate the inference service pod, resulting in a brief service downtime. Once the new adapters are merged with the raw model weights and loaded into GPU memory, the service will resume.
+
+
+# Troubleshooting
 
-Currently, only images are supported as adapter sources, with a default strength of "1.0".
+TBD
diff --git a/docs/tuning/README.md b/docs/tuning/README.md
@@ -17,7 +17,7 @@ kind: Workspace
 metadata:
   name: workspace-tuning-falcon
 resource:
-  instanceType: "Standard_NC6s_v3"
+  instanceType: "Standard_NC24ads_A100_v4"
   labelSelector:
     matchLabels:
       app: tuning-falcon
@@ -35,7 +35,7 @@ tuning:
 
 ```
 
-The detailed `TuningSpec`API definitions can be found [here](https://github.com/Azure/kaito/blob/2ccc93daf9d5385649f3f219ff131ee7c9c47f3e/api/v1alpha1/workspace_types.go#L145).
+The detailed `TuningSpec` API definitions can be found [here](https://github.com/Azure/kaito/blob/2ccc93daf9d5385649f3f219ff131ee7c9c47f3e/api/v1alpha1/workspace_types.go#L145).
 
 ### Tuning configurations
 Kaito provides default tuning configurations for different tuning methods. They are managed by Kubernetes configmaps.