[docs] Polish feature examples and user guidances (#686)

* fix the list display issue * Update the quick start tutorials and model manifest 1. use create instead of apply 2. use ClusterIP for model deployment 3. update tutorial to expose external ip in elegant way * Polish the gateway plugin docs * Update the lora samples and guidance * Update distributed kv cache example * Update multi-node inference examples Signed-off-by: Varun Gupta <varungup90@gmail.com>
vllm-project · Feb 20, 2025 · 63e8fe8 · 63e8fe8
1 parent 9aacdd6
commit 63e8fe8
Show file tree

Hide file tree

Showing 20 changed files with 462 additions and 140 deletions.
diff --git a/docs/source/assets/images/httproute.png b/docs/source/assets/images/httproute.png
diff --git a/docs/source/assets/images/lora-describe-status.png b/docs/source/assets/images/lora-describe-status.png
diff --git a/docs/source/assets/images/lora-service-discovery-resources.png b/docs/source/assets/images/lora-service-discovery-resources.png
diff --git a/docs/source/features/distributed-kv-cache.rst b/docs/source/features/distributed-kv-cache.rst
@@ -24,44 +24,20 @@ Example
 .. note::
     We use a customized version of `vineyard <https://v6d.io/>`_ as the backend for distributed KV cache and an internal version of vLLM integrated with distributed KV cache support to showcase the usage. We are working with the vLLM community to upstream the distributed KV cache API and plugin.
 
-After launching AIBrix's AI Runtime, we can use the following yaml to deploy a distributed KV cache cluster:
-
-.. code-block:: yaml
-
-    apiVersion: orchestration.aibrix.ai/v1alpha1
-    kind: KVCache
-    metadata:
-      name: aibrix-model-deepseek-coder-33b-kvcache
-      namespace: default
-      annotations:
-        kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
-        kvcache.orchestration.aibrix.ai/pod-affinity-workload: aibrix-model-deepseek-coder-33b-instruct
-    spec:
-      replicas: 1
-      service:
-        type: ClusterIP
-        port: 9600
-      cacheSpec:
-        image: aibrix/vineyardd:20241120
-        imagePullPolicy: IfNotPresent
-
-.. note::
-    ``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` MUST match with ``metadata.name`` of the inference service deployment below
 
 After deployment, we can see all the components by using ``kubectl get pods -n aibrix-system`` command:
 
 .. code-block:: RST
 
-    NAME                                                      READY   STATUS    RESTARTS   AGE
-    aibrix-model-deepseek-coder-33b-kvcache-596965997-p86cx   1/1     Running   0          2m
-    aibrix-model-deepseek-coder-33b-kvcache-etcd-0            1/1     Running   0          2m
+    NAME                                                     READY   STATUS    RESTARTS   AGE
+    aibrix-model-deepseek-coder-7b-kvcache-596965997-p86cx   1/1     Running   0          2m
+    aibrix-model-deepseek-coder-7b-kvcache-etcd-0            1/1     Running   0          2m
 
 After all components are running, we can use the following yaml to deploy the inference service:
 
 .. literalinclude:: ../../../samples/kvcache/deployment.yaml
    :language: yaml
 
-
 .. note::
     * ``metadata.name`` MUST match with ``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` in the kv cache deployment
     * We need to include the Unix domain socket used by the distributed KV cache as a volume to the inference service pod (i.e., ``kvcache-socket`` in the example above)
@@ -73,13 +49,25 @@ Now let's use ``kubectl get pods`` command to ensure the inference service is ru
 
 .. code-block:: RST
 
-    NAME                                                        READY   STATUS    RESTARTS   AGE
-    download-model                                              1/1     Running   0          12m
-    aibrix-model-deepseek-coder-33b-instruct-6b885ffd8b-2kfjv   2/2     Running   0          4m
+    NAME                                                       READY   STATUS    RESTARTS   AGE
+    download-model                                             1/1     Running   0          12m
+    aibrix-model-deepseek-coder-7b-instruct-6b885ffd8b-2kfjv   2/2     Running   0          4m
+
+
+After launching AIBrix's deployment, we can use the following yaml to deploy a distributed KV cache cluster:
+
+.. literalinclude:: ../../../samples/kvcache/kvcache.yaml
+   :language: yaml
+
+.. note::
 
-Once the inference service is running, let's set up port fowarding so that we can test the service from local:
+    1. ``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` MUST match with ``metadata.name`` of the inference service deployment below
+    2. ``kvcache.orchestration.aibrix.ai/node-affinity-gpu-type`` is unnecessary unless you deploy the model across different GPUs.
 
-* Run ``kubectl get svc -n envoy-gateway-system`` to get the name of the Envoy Gateway service
+
+Once the inference service is running, let's set up port forwarding so that we can test the service from local:
+
+* Run ``kubectl get svc -n envoy-gateway-system`` to get the name of the Envoy Gateway service.
 
 .. code-block:: RST
 
@@ -101,7 +89,7 @@ Now, let's test the service:
       -H "Content-Type: application/json" \
       -H "Authorization: XXXXXXXXXXXXXXXXXXXXXXXX" \
       -d '{
-         "model": "deepseek-coder-33b-instruct",
+         "model": "deepseek-coder-7b-instruct",
          "messages": [{"role": "user", "content": "Created container vllm-openai"}],
          "temperature": 0.7
        }'
@@ -133,7 +121,7 @@ and its output would be:
       "id": "chat-60f0247aa9294f8abb61e8f24c1503c2",
       "object": "chat.completion",
       "created": 1738281009,
-      "model": "deepseek-coder-33b-instruct",
+      "model": "deepseek-coder-7b-instruct",
       "choices": [
         {
           "index": 0,
@@ -160,4 +148,4 @@ Distribute KV cache metrics can be viewed in the AIBrix Engine Dashboard. The fo
 .. figure:: ../assets/images/aibrix-dist-kv-cache-dashboard.png
   :alt: distributed-kv-cache-dashboard
   :width: 100%
-  :align: center
+  :align: center
diff --git a/docs/source/features/gateway-plugins.rst b/docs/source/features/gateway-plugins.rst
@@ -18,32 +18,123 @@ First, get the external ip and port for the envoy proxy to access gateway.
 
 .. code-block:: bash
 
-    kubectl -n envoy-gateway-system get service   
-    NAME                                     TYPE           CLUSTER-IP       EXTERNAL-IP                                           PORT(S)                                   
-    envoy-aibrix-system-aibrix-eg-903790dc   LoadBalancer   172.19.190.6     10.10.10.10,1000:1000:1000:1000:1000:1000:1000:1000   80:30904/TCP
+    NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                   AGE
+    envoy-aibrix-system-aibrix-eg-903790dc   LoadBalancer   10.96.239.246   101.18.0.4    80:32079/TCP                              10d
+    envoy-gateway                            ClusterIP      10.96.166.226   <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP   10d
 
 
 On a model or lora adapter deployment, their respective controllers create a HTTPRoute object which gateway dynamically discovers to forward input user request. Make sure to verify that httproute status as Accepted. 
 
-.. figure:: ../assets/images/httproute.png
-  :alt: httproute
-  :width: 70%
-  :align: center 
+.. code-block:: bash
+
+    $ kubectl get httproute -A
+    NAMESPACE       NAME                                  HOSTNAMES   AGE
+    aibrix-system   aibrix-reserved-router                            17m # reserved router
+    aibrix-system   deepseek-r1-distill-llama-8b-router               14m # created for each model deployment
+    ....
+
+.. code-block:: bash
+
+    $ kubectl describe httproute deepseek-r1-distill-llama-8b-router -n aibrix-system
+    Name:         deepseek-r1-distill-llama-8b-router
+    Namespace:    aibrix-system
+    Labels:       <none>
+    Annotations:  <none>
+    API Version:  gateway.networking.k8s.io/v1
+    Kind:         HTTPRoute
+    Metadata:
+      Creation Timestamp:  2025-02-16T17:56:03Z
+      Generation:          1
+      Resource Version:    2641
+      UID:                 2f3f9620-bf7c-487a-967e-2436c3809178
+    Spec:
+      Parent Refs:
+        Group:      gateway.networking.k8s.io
+        Kind:       Gateway
+        Name:       aibrix-eg
+        Namespace:  aibrix-system
+      Rules:
+        Backend Refs:
+          Group:
+          Kind:       Service
+          Name:       deepseek-r1-distill-llama-8b
+          Namespace:  default
+          Port:       8000
+          Weight:     1
+        Matches:
+          Headers:
+            Name:   model
+            Type:   Exact
+            Value:  deepseek-r1-distill-llama-8b
+          Path:
+            Type:   PathPrefix
+            Value:  /
+        Timeouts:
+          Request:  120s
+    Status:
+      Parents:
+        Conditions:
+          Last Transition Time:  2025-02-16T17:56:03Z
+          Message:               Route is accepted
+          Observed Generation:   1
+          Reason:                Accepted
+          Status:                True
+          Type:                  Accepted
+          Last Transition Time:  2025-02-16T17:56:03Z
+          Message:               Resolved all the Object references for the Route
+          Observed Generation:   1
+          Reason:                ResolvedRefs
+          Status:                True
+          Type:                  ResolvedRefs
+        Controller Name:         gateway.envoyproxy.io/gatewayclass-controller
+        Parent Ref:
+          Group:      gateway.networking.k8s.io
+          Kind:       Gateway
+          Name:       aibrix-eg
+          Namespace:  aibrix-system
+    Events:           <none>
+
+
+In most Kubernetes setups, ``LoadBalancer`` is supported by default. You can retrieve the external IP using the following command:
 
+.. code-block:: bash
+
+    LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
+    ENDPOINT="${LB_IP}:80"
 
-Sample request, get external ip:port from first step and model-name from deployments label "model.aibrix.ai/name".
+The model name, such as ``deepseek-r1-distill-llama-8b``, must match the label ``model.aibrix.ai/name`` in your deployment.
 
 .. code-block:: bash
 
-    curl -v http://<ip>:<port>/v1/chat/completions \
+    curl -v http://${ENDPOINT}/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "deepseek-r1-distill-llama-8b",
+        "messages": [{"role": "user", "content": "Say this is a test!"}],
+        "temperature": 0.7
+    }'
+
+.. attention::
+
+    AIBrix expose the public endpoint to the internet. Please enable authentication to secure your endpoint.
+    If vLLM, you can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
+    Check `vLLM OpenAI-Compatible Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`_ for more details.
+
+After you enable the authentication, you can query model with ``-H Authorization: bearer your_key`` in this way
+
+.. code-block:: bash
+  :emphasize-lines: 3
+
+    curl -v http://${ENDPOINT}/v1/chat/completions \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer any_key" \
     -d '{
-        "model": "your-model-name",
+        "model": "deepseek-r1-distill-llama-8b",
         "messages": [{"role": "user", "content": "Say this is a test!"}],
         "temperature": 0.7
     }'
 
+
 Routing Strategies
 ------------------
 
@@ -56,16 +147,16 @@ Below are routing strategies gateway supports
 
 .. code-block:: bash
 
-    curl -v http://<ip>:<port>/v1/chat/completions \
+    curl -v http://${ENDPOINT}/v1/chat/completions \
     -H "routing-strategy: least-request" \
     -H "Content-Type: application/json" \
-    -H "Authorization: Bearer any_key" \
     -d '{
         "model": "your-model-name",
         "messages": [{"role": "user", "content": "Say this is a test!"}],
         "temperature": 0.7
     }'
 
+
 Rate Limiting
 -------------
 
@@ -76,7 +167,7 @@ To set up rate limiting, add the user header in the request, like this:
 
 .. code-block:: bash
 
-    curl -v http://<ip>:<port>/v1/chat/completions \
+    curl -v http://${ENDPOINT}/v1/chat/completions \
     -H "user: your-user-id" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer any_key" \

diff --git a/docs/source/features/lora-dynamic-loading.rst b/docs/source/features/lora-dynamic-loading.rst
@@ -59,20 +59,74 @@ Create base model
 .. literalinclude:: ../../../samples/adapter/base.yaml
    :language: yaml
 
+.. code-block:: bash
+
+    # Expose endpoint
+    LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
+    ENDPOINT="${LB_IP}:80"
 
+    # send request to base model
+    curl -v http://${ENDPOINT}/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "qwen-coder-1-5b-instruct",
+            "prompt": "San Francisco is a",
+            "max_tokens": 128,
+            "temperature": 0
+        }'
 
 Create lora model adapter
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. literalinclude:: ../../../samples/adapter/adapter.yaml
    :language: yaml
 
-If you run ```kubectl describe modeladapter llama-2-7b-sql-lora``, you will see the status of the lora adapter.
+If you run ```kubectl describe modeladapter qwen-code-lora``, you will see the status of the lora adapter.
+
+.. code-block:: bash
+
+    $ kubectl describe modeladapter qwen-code-lora
+    .....
+    Status:
+      Conditions:
+        Last Transition Time:  2025-02-16T19:14:50Z
+        Message:               Starting reconciliation
+        Reason:                ModelAdapterPending
+        Status:                Unknown
+        Type:                  Initialized
+        Last Transition Time:  2025-02-16T19:14:50Z
+        Message:               ModelAdapter default/qwen-code-lora has been allocated to pod default/qwen-coder-1-5b-instruct-5587f4c57d-kml6s
+        Reason:                Scheduled
+        Status:                True
+        Type:                  Scheduled
+        Last Transition Time:  2025-02-16T19:14:55Z
+        Message:               ModelAdapter default/qwen-code-lora is ready
+        Reason:                ModelAdapterAvailable
+        Status:                True
+        Type:                  Ready
+      Instances:
+        qwen-coder-1-5b-instruct-5587f4c57d-kml6s
+      Phase:  Running
+    Events:   <none>
+
+Send request using lora model name to the gateway.
+
+.. code-block:: bash
+
+    # Expose endpoint
+    LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
+    ENDPOINT="${LB_IP}:80"
+
+    # send request to base model
+    curl -v http://${ENDPOINT}/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "qwen-code-lora",
+            "prompt": "San Francisco is a",
+            "max_tokens": 128,
+            "temperature": 0
+        }'
 
-.. figure:: ../assets/images/lora-describe-status.png
-  :alt: lora-describe-status
-  :width: 70%
-  :align: center
 
 Here's the resources created associated with the lora custom resource.
 
@@ -84,7 +138,7 @@ Here's the resources created associated with the lora custom resource.
 
 1. A new Kubernetes service will be created with the exact same name as ModelAdapter name.
 
-2. The ``podSelector`` is used to filter the matching pods. In this case, it will match pods with label ``model.aibrix.ai/name=llama2-7b``. Make sure your base model have this label.
+2. The ``podSelector`` is used to filter the matching pods. In this case, it will match pods with label ``model.aibrix.ai/name=qwen-coder-1-5b-instruct``. Make sure your base model have this label.
 This ensures that the LoRA adapter is correctly associated with the right pods.
 
 .. attention::
@@ -121,3 +175,21 @@ In that case, lora model adapter can not query the vLLM server correctly, showin
 
 .. literalinclude:: ../../../samples/adapter/adapter-api-key.yaml
    :language: yaml
+
+Runtime Support Sidecar
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Starting from v0.2.0, controller manager by default will talk to runtime sidecar to register the lora first and then the runtime sidecar will sync with inference engine to finish the eventual registration.
+This is used to build the abstraction between controller manager and inference engine. If you like to directly sync with vLLM to load the loras, you can update the controller-manager ``kubectl edit deployment aibrix-controller-manager -n aibrix-system``
+and remove the ``
+
+.. code-block:: yaml
+  :emphasis-lines: 7
+
+    spec:
+      containers:
+      - args:
+        - --leader-elect
+        - --health-probe-bind-address=:8081
+        - --metrics-bind-address=0
+        - --enable-runtime-sidecar # this line should be removed