Skip to content

Commit

Permalink
[docs] Polish feature examples and user guidances (#686)
Browse files Browse the repository at this point in the history
* fix the list display issue

* Update the quick start tutorials and model manifest

1. use create instead of apply
2. use ClusterIP for model deployment
3. update tutorial to expose external ip in elegant way

* Polish the gateway plugin docs

* Update the lora samples and guidance

* Update distributed kv cache example

* Update multi-node inference examples

Signed-off-by: Varun Gupta <varungup90@gmail.com>
  • Loading branch information
Jeffwan authored and varungup90 committed Feb 20, 2025
1 parent 9aacdd6 commit 63e8fe8
Show file tree
Hide file tree
Showing 20 changed files with 462 additions and 140 deletions.
Binary file removed docs/source/assets/images/httproute.png
Binary file not shown.
Binary file removed docs/source/assets/images/lora-describe-status.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 23 additions & 35 deletions docs/source/features/distributed-kv-cache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,44 +24,20 @@ Example
.. note::
We use a customized version of `vineyard <https://v6d.io/>`_ as the backend for distributed KV cache and an internal version of vLLM integrated with distributed KV cache support to showcase the usage. We are working with the vLLM community to upstream the distributed KV cache API and plugin.

After launching AIBrix's AI Runtime, we can use the following yaml to deploy a distributed KV cache cluster:

.. code-block:: yaml
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
name: aibrix-model-deepseek-coder-33b-kvcache
namespace: default
annotations:
kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
kvcache.orchestration.aibrix.ai/pod-affinity-workload: aibrix-model-deepseek-coder-33b-instruct
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
cacheSpec:
image: aibrix/vineyardd:20241120
imagePullPolicy: IfNotPresent
.. note::
``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` MUST match with ``metadata.name`` of the inference service deployment below

After deployment, we can see all the components by using ``kubectl get pods -n aibrix-system`` command:

.. code-block:: RST
NAME READY STATUS RESTARTS AGE
aibrix-model-deepseek-coder-33b-kvcache-596965997-p86cx 1/1 Running 0 2m
aibrix-model-deepseek-coder-33b-kvcache-etcd-0 1/1 Running 0 2m
NAME READY STATUS RESTARTS AGE
aibrix-model-deepseek-coder-7b-kvcache-596965997-p86cx 1/1 Running 0 2m
aibrix-model-deepseek-coder-7b-kvcache-etcd-0 1/1 Running 0 2m
After all components are running, we can use the following yaml to deploy the inference service:

.. literalinclude:: ../../../samples/kvcache/deployment.yaml
:language: yaml


.. note::
* ``metadata.name`` MUST match with ``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` in the kv cache deployment
* We need to include the Unix domain socket used by the distributed KV cache as a volume to the inference service pod (i.e., ``kvcache-socket`` in the example above)
Expand All @@ -73,13 +49,25 @@ Now let's use ``kubectl get pods`` command to ensure the inference service is ru

.. code-block:: RST
NAME READY STATUS RESTARTS AGE
download-model 1/1 Running 0 12m
aibrix-model-deepseek-coder-33b-instruct-6b885ffd8b-2kfjv 2/2 Running 0 4m
NAME READY STATUS RESTARTS AGE
download-model 1/1 Running 0 12m
aibrix-model-deepseek-coder-7b-instruct-6b885ffd8b-2kfjv 2/2 Running 0 4m
After launching AIBrix's deployment, we can use the following yaml to deploy a distributed KV cache cluster:

.. literalinclude:: ../../../samples/kvcache/kvcache.yaml
:language: yaml

.. note::

Once the inference service is running, let's set up port fowarding so that we can test the service from local:
1. ``kvcache.orchestration.aibrix.ai/pod-affinity-workload`` MUST match with ``metadata.name`` of the inference service deployment below
2. ``kvcache.orchestration.aibrix.ai/node-affinity-gpu-type`` is unnecessary unless you deploy the model across different GPUs.

* Run ``kubectl get svc -n envoy-gateway-system`` to get the name of the Envoy Gateway service

Once the inference service is running, let's set up port forwarding so that we can test the service from local:

* Run ``kubectl get svc -n envoy-gateway-system`` to get the name of the Envoy Gateway service.

.. code-block:: RST
Expand All @@ -101,7 +89,7 @@ Now, let's test the service:
-H "Content-Type: application/json" \
-H "Authorization: XXXXXXXXXXXXXXXXXXXXXXXX" \
-d '{
"model": "deepseek-coder-33b-instruct",
"model": "deepseek-coder-7b-instruct",
"messages": [{"role": "user", "content": "Created container vllm-openai"}],
"temperature": 0.7
}'
Expand Down Expand Up @@ -133,7 +121,7 @@ and its output would be:
"id": "chat-60f0247aa9294f8abb61e8f24c1503c2",
"object": "chat.completion",
"created": 1738281009,
"model": "deepseek-coder-33b-instruct",
"model": "deepseek-coder-7b-instruct",
"choices": [
{
"index": 0,
Expand All @@ -160,4 +148,4 @@ Distribute KV cache metrics can be viewed in the AIBrix Engine Dashboard. The fo
.. figure:: ../assets/images/aibrix-dist-kv-cache-dashboard.png
:alt: distributed-kv-cache-dashboard
:width: 100%
:align: center
:align: center
117 changes: 104 additions & 13 deletions docs/source/features/gateway-plugins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,32 +18,123 @@ First, get the external ip and port for the envoy proxy to access gateway.

.. code-block:: bash
kubectl -n envoy-gateway-system get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 172.19.190.6 10.10.10.10,1000:1000:1000:1000:1000:1000:1000:1000 80:30904/TCP
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 10.96.239.246 101.18.0.4 80:32079/TCP 10d
envoy-gateway ClusterIP 10.96.166.226 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 10d
On a model or lora adapter deployment, their respective controllers create a HTTPRoute object which gateway dynamically discovers to forward input user request. Make sure to verify that httproute status as Accepted.

.. figure:: ../assets/images/httproute.png
:alt: httproute
:width: 70%
:align: center
.. code-block:: bash
$ kubectl get httproute -A
NAMESPACE NAME HOSTNAMES AGE
aibrix-system aibrix-reserved-router 17m # reserved router
aibrix-system deepseek-r1-distill-llama-8b-router 14m # created for each model deployment
....
.. code-block:: bash
$ kubectl describe httproute deepseek-r1-distill-llama-8b-router -n aibrix-system
Name: deepseek-r1-distill-llama-8b-router
Namespace: aibrix-system
Labels: <none>
Annotations: <none>
API Version: gateway.networking.k8s.io/v1
Kind: HTTPRoute
Metadata:
Creation Timestamp: 2025-02-16T17:56:03Z
Generation: 1
Resource Version: 2641
UID: 2f3f9620-bf7c-487a-967e-2436c3809178
Spec:
Parent Refs:
Group: gateway.networking.k8s.io
Kind: Gateway
Name: aibrix-eg
Namespace: aibrix-system
Rules:
Backend Refs:
Group:
Kind: Service
Name: deepseek-r1-distill-llama-8b
Namespace: default
Port: 8000
Weight: 1
Matches:
Headers:
Name: model
Type: Exact
Value: deepseek-r1-distill-llama-8b
Path:
Type: PathPrefix
Value: /
Timeouts:
Request: 120s
Status:
Parents:
Conditions:
Last Transition Time: 2025-02-16T17:56:03Z
Message: Route is accepted
Observed Generation: 1
Reason: Accepted
Status: True
Type: Accepted
Last Transition Time: 2025-02-16T17:56:03Z
Message: Resolved all the Object references for the Route
Observed Generation: 1
Reason: ResolvedRefs
Status: True
Type: ResolvedRefs
Controller Name: gateway.envoyproxy.io/gatewayclass-controller
Parent Ref:
Group: gateway.networking.k8s.io
Kind: Gateway
Name: aibrix-eg
Namespace: aibrix-system
Events: <none>
In most Kubernetes setups, ``LoadBalancer`` is supported by default. You can retrieve the external IP using the following command:

.. code-block:: bash
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
Sample request, get external ip:port from first step and model-name from deployments label "model.aibrix.ai/name".
The model name, such as ``deepseek-r1-distill-llama-8b``, must match the label ``model.aibrix.ai/name`` in your deployment.

.. code-block:: bash
curl -v http://<ip>:<port>/v1/chat/completions \
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
.. attention::

AIBrix expose the public endpoint to the internet. Please enable authentication to secure your endpoint.
If vLLM, you can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
Check `vLLM OpenAI-Compatible Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`_ for more details.

After you enable the authentication, you can query model with ``-H Authorization: bearer your_key`` in this way

.. code-block:: bash
:emphasize-lines: 3
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any_key" \
-d '{
"model": "your-model-name",
"model": "deepseek-r1-distill-llama-8b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Routing Strategies
------------------

Expand All @@ -56,16 +147,16 @@ Below are routing strategies gateway supports

.. code-block:: bash
curl -v http://<ip>:<port>/v1/chat/completions \
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: least-request" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any_key" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Rate Limiting
-------------

Expand All @@ -76,7 +167,7 @@ To set up rate limiting, add the user header in the request, like this:

.. code-block:: bash
curl -v http://<ip>:<port>/v1/chat/completions \
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "user: your-user-id" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any_key" \
Expand Down
84 changes: 78 additions & 6 deletions docs/source/features/lora-dynamic-loading.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,20 +59,74 @@ Create base model
.. literalinclude:: ../../../samples/adapter/base.yaml
:language: yaml

.. code-block:: bash
# Expose endpoint
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# send request to base model
curl -v http://${ENDPOINT}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-coder-1-5b-instruct",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
Create lora model adapter
^^^^^^^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../samples/adapter/adapter.yaml
:language: yaml

If you run ```kubectl describe modeladapter llama-2-7b-sql-lora``, you will see the status of the lora adapter.
If you run ```kubectl describe modeladapter qwen-code-lora``, you will see the status of the lora adapter.

.. code-block:: bash
$ kubectl describe modeladapter qwen-code-lora
.....
Status:
Conditions:
Last Transition Time: 2025-02-16T19:14:50Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-02-16T19:14:50Z
Message: ModelAdapter default/qwen-code-lora has been allocated to pod default/qwen-coder-1-5b-instruct-5587f4c57d-kml6s
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-02-16T19:14:55Z
Message: ModelAdapter default/qwen-code-lora is ready
Reason: ModelAdapterAvailable
Status: True
Type: Ready
Instances:
qwen-coder-1-5b-instruct-5587f4c57d-kml6s
Phase: Running
Events: <none>
Send request using lora model name to the gateway.

.. code-block:: bash
# Expose endpoint
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# send request to base model
curl -v http://${ENDPOINT}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-code-lora",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
.. figure:: ../assets/images/lora-describe-status.png
:alt: lora-describe-status
:width: 70%
:align: center
Here's the resources created associated with the lora custom resource.

Expand All @@ -84,7 +138,7 @@ Here's the resources created associated with the lora custom resource.

1. A new Kubernetes service will be created with the exact same name as ModelAdapter name.

2. The ``podSelector`` is used to filter the matching pods. In this case, it will match pods with label ``model.aibrix.ai/name=llama2-7b``. Make sure your base model have this label.
2. The ``podSelector`` is used to filter the matching pods. In this case, it will match pods with label ``model.aibrix.ai/name=qwen-coder-1-5b-instruct``. Make sure your base model have this label.
This ensures that the LoRA adapter is correctly associated with the right pods.

.. attention::
Expand Down Expand Up @@ -121,3 +175,21 @@ In that case, lora model adapter can not query the vLLM server correctly, showin

.. literalinclude:: ../../../samples/adapter/adapter-api-key.yaml
:language: yaml

Runtime Support Sidecar
^^^^^^^^^^^^^^^^^^^^^^^

Starting from v0.2.0, controller manager by default will talk to runtime sidecar to register the lora first and then the runtime sidecar will sync with inference engine to finish the eventual registration.
This is used to build the abstraction between controller manager and inference engine. If you like to directly sync with vLLM to load the loras, you can update the controller-manager ``kubectl edit deployment aibrix-controller-manager -n aibrix-system``
and remove the ``

.. code-block:: yaml
:emphasis-lines: 7
spec:
containers:
- args:
- --leader-elect
- --health-probe-bind-address=:8081
- --metrics-bind-address=0
- --enable-runtime-sidecar # this line should be removed
Loading

0 comments on commit 63e8fe8

Please sign in to comment.