Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][lmi] standardize structure of backend user guides #1625

Merged
merged 1 commit into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions serving/docs/lmi_new/user_guides/deepspeed_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Auto Tensor-Parallelism:

## Quick Start Configurations

You can leverage DeepSpeed with LMI using the following starter configurations:
You can leverage `deepspeed` with LMI using the following starter configurations:

### serving.properties

Expand All @@ -105,20 +105,24 @@ option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=max
option.rolling_batch=deepspeed
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
```

You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.

### environment variables

```
HF_MODEL_ID=<your model id>
OPTION_ENTRYPOINT=djl_python.deepspeed
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=deepspeed
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
```

You can use the [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.

## Quantization Support

Expand Down
44 changes: 24 additions & 20 deletions serving/docs/lmi_new/user_guides/lmi-dist_user_guide.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# LMI-Dist Engine User Guide

## Model Artifact Structure

LMI-Dist expects the model to be in the [standard HuggingFace format](../deployment_guide/model-artifacts.md).

## Supported Model Architectures

The model architecture that we test for lmi-dist (in CI):
Expand Down Expand Up @@ -31,25 +35,6 @@ Optimized models:

`lmi-dist` can also run other models which are supported by huggingface transformers but not listed above, although **without** the optimizations and `gptq`, `awq` quantization support. `transformers` library support can be found for [CausalLM](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L381) and [Seq2Seq](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L652) models.


### Quantization

Currently, we allow customer to use `option.quantize=<quantization-type>` or `OPTION_QUANTIZE=<quantization-type>` to load a quantized model in `lmi-dist`.

We support the following `<quantization-type>`:

* awq (LMI container versions >= 0.26.0)
* [DEPRECATED] bitsandbytes (LMI container versions >= 0.24.0 and < 0.27.0)
* [DEPRECATED] bitsandbytes8 (LMI container versions >= 0.25.0 and < 0.27.0)
* gptq (LMI container versions >= 0.24.0)

When using pre-quantized models make sure to use the correct model artifacts e.g. `TheBloke/Llama-2-13B-chat-GPTQ`, `TheBloke/Llama-2-13B-chat-AWQ`.


## Model Artifact Structure

`lmi-dist` expects the model to be [standard HuggingFace format](../deployment_guide/model-artifacts.md#huggingface-transformers-pretrained-format).

## Quick Start Configurations

You can leverage `lmi-dist` with LMI using the following starter configurations:
Expand All @@ -61,19 +46,38 @@ engine=MPI
option.tensor_parallel_degree=max
option.rolling_batch=lmi-dist
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
option.max_rolling_batch_prefill_tokens=4096
```

You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.

### environment variables

```
HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=lmi-dist
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
OPTION_MAX_ROLLING_BATCH_PREFILL_TOKENS=4096
```

You can use the [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.

## Quantization Support

Currently, we allow customer to use `option.quantize=<quantization-type>` or `OPTION_QUANTIZE=<quantization-type>` to load a quantized model in `lmi-dist`.

We support the following `<quantization-type>`:

* awq (LMI container versions >= 0.26.0)
* [DEPRECATED] bitsandbytes (LMI container versions >= 0.24.0 and < 0.27.0)
* [DEPRECATED] bitsandbytes8 (LMI container versions >= 0.25.0 and < 0.27.0)
* gptq (LMI container versions >= 0.24.0)

When using pre-quantized models make sure to use the correct model artifacts e.g. `TheBloke/Llama-2-13B-chat-GPTQ`, `TheBloke/Llama-2-13B-chat-AWQ`.

## Advanced LMI-Dist Configurations

Expand Down
32 changes: 15 additions & 17 deletions serving/docs/lmi_new/user_guides/tnx_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## Model Artifacts Structure

LMI Transformers-NeuronX expect the model to be standard HuggingFace format for runtime compilation. For loading of compiled models, both Optimum compiled models and split-models with a separate `neff` cache (compiled models must be compiled with the same Neuron compiler version and model settings).
LMI Transformers-NeuronX expects the model to be [standard HuggingFace format](../deployment_guide/model-artifacts.md) for runtime compilation.

For loading of compiled models, both Optimum compiled models and split-models with a separate `neff` cache (compiled models must be compiled with the same Neuron compiler version and model settings).
The source of the model could be:

- model_id string from huggingface
Expand Down Expand Up @@ -36,46 +38,42 @@ The model architectures that are tested daily for LMI Transformers-NeuronX (in C

We will add more model support for the future versions to have them tested. Please feel free to [file us an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) for more model coverage in CI.

### Quantization

Currently, we allow customer to use `option.quantize=static_int8` or `OPTION_QUANTIZE=static_int8` to load the model using `int8` weight quantization.

## Quick Start Configurations

Most of the LMI Transformers-NeuronX models use the following template (quantization, and other advanced features will require additional configuration which is described [here](../../lmi/configurations_large_model_inference_containers.md#transformers-neuronx-doc)):

### Use serving.properties

You can deploy with a serving.properties:
### serving.properties

```
engine=Python
option.tensor_parallel_degree=4
option.model_id=<your model>
option.entryPoint=djl_python.transformers_neuronx
option.max_rolling_batch_size=8
option.rolling_batch=auto
# Adjust the following based on model size and instance type
option.tensor_parallel_degree=4
option.max_rolling_batch_size=8
option.model_loading_timeout=1600
```

This is the standard no-code experience DJL-Serving provided.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.

### Use environment variables

You can also deploy without even providing any artifacts to run with LMI through specifying everything in ENV:
### environment variables

```
HF_MODEL_ID=<your model>
OPTION_ENTRYPOINT=djl_python.transformers_neuronx
OPTION_ROLLING_BATCH=auto
# Adjust the following based on model size and instance type
TENSOR_PARALLEL_DEGREE=4
OPTION_MAX_ROLLING_BATCH_SIZE=8
OPTION_ROLLING_BATCH=auto
OPTION_MODEL_LOADING_TIMEOUT=1600
```

### SageMaker notebook deployment samples
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.

## Quantization

Here you can find deployment samples with SageMaker notebooks [tutorial](../README.md).
Currently, we allow customer to use `option.quantize=static_int8` or `OPTION_QUANTIZE=static_int8` to load the model using `int8` weight quantization.

## Advanced Transformers NeuronX Configurations

Expand Down
89 changes: 29 additions & 60 deletions serving/docs/lmi_new/user_guides/trt_llm_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ TRT-LLM LMI supports two options for model artifacts
1. [Standard HuggingFace model format](../deployment_guide/model-artifacts.md#huggingface-transformers-pretrained-format): In this case, TRT-LLM LMI will build TRT-LLM engines from HuggingFace model and package them with HuggingFace model config files during model load time.
2. [Custom TRT-LLM LMI model format](../deployment_guide/model-artifacts.md#tensorrt-llmtrt-llm-lmi-model-format): In this case, artifacts are directly loaded without the need to model compilation resulting in faster load times.



## Supported Model Architectures

The below model architectures are supported for JIT model compiltation and tested in our CI.
Expand All @@ -28,73 +26,44 @@ TRT-LLM LMI v8 0.26.0 containers come with [TRT-LLM 0.7.1](https://github.com/NV

We will add more model support in the future versions in our CI. Please feel free to [file an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) if you are looking for a specific model support.

## Quick Start Configurations

## SageMaker Deployment Tutorial
Users need to provide the model id of the model they want to deploy. Model id can be provided using `OPTION_MODEL_ID` environment variable which can take one of the following values:

* Hugging Face model id
* s3 uri of Hugging Face model stored in s3
* s3 uri of pre-compiled TRT-LLM LMI model artifacts
You can leverage `tensorrtllm` with LMI using the following starter configurations:

We also need to set `SERVING_LOAD_MODELS` environment variable which can be set as below.
### serving.properties

```
HF_MODEL_ID=<your model id>
engine=MPI
option.tensor_parallel_degree=max
option.rolling_batch=trtllm
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
option.max_input_len=1024
option.max_output_len=512
```
In addition to these required parameters, users may want to set these parameters according to their use case:

* `TENSOR_PARALLEL_DEGREE`: Determines number of gpus across model will be split into. By default, model will be split across all the gpus available in the instance. For some model architectures, this default behavior will not work. In such cases, users can set this parameter to a value that works for the specific model.
* `OPTION_MAX_INPUT_LEN`: Determines maximum input prompt length the model can process. Default is 1024. Users can decrease/increase this value if they know their application's precise limit.
* `OPTION_MAX_OUTPUT_LEN`: Determines maximum output tokens expected from the model. Default is 512. Users can decrease/increase this value if they know their application's precise limit.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.

We also support customizing additional parameters to boost performance as per specific use case. Please refer to `Common` and `TensorRT-LLM` sections in this [doc](../../lmi/configurations_large_model_inference_containers.md) for advanced configuration
### environment variables

In this tutorial, we will use [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to deploy the model on SageMaker. The below code can be run in SageMaker environment to deploy llama2-13b on g5.12xlarge instance.
````
HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=trtllm
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
OPTION_MAX_INPUT_LEN=1024
OPTION_MAX_OUTPUT_LEN=512
````

```
import sagemaker
from sagemaker import image_uris, Model, Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Setup role and sagemaker session
iam_role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.session.Session()
region = sagemaker_session._region_name

# Fetch the uri of the TRT-LLM LMI container
container_image_uri = image_uris.retrieve(framework="djl-tensorrtllm", version="0.26.0", region=region)

# Create the SageMaker Model object. In this example we'll use vLLM as our inference backend
model = Model(
image_uri=container_image_uri,
role=iam_role,
env={
"HF_MODEL_ID": "TheBloke/Llama-2-13B-fp16",
}
)

# Deploy your model to a SageMaker Endpoint and create a Predictor to make inference requests
endpoint_name = sagemaker.utils.name_from_base("llama-13b-trtllm-endpoint")
model.deploy(instance_type="ml.g5.12xlarge", initial_instance_count=1, endpoint_name=endpoint_name)
predictor = Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sagemaker_session,
serializer=JSONSerializer(),
deserializer=JSONDeserializer(),
)


# Make an inference request against the llama2-13b endpoint
outputs = predictor.predict({
"inputs": "The diamondback terrapin was the first reptile to be",
"parameters": {
"do_sample": True,
"max_new_tokens": 256,
}
})
print(outputs)
```
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.

## Quantization Support

We support two methods of quantization when using TensorRT-LLM with LMI: SmoothQuant, and AWQ.
You can enable these quantization strategies using `option.quantize=<smoothquant|awq>` in serving.properties, or `OPTION_QUANTIZE=<smoothquant|awq>` environment variable.
More details about additional (optional) quantization configurations are available in the advanced configuration table below.


## Advanced TensorRT-LLM Configurations
Expand Down
20 changes: 9 additions & 11 deletions serving/docs/lmi_new/user_guides/vllm_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,36 +48,34 @@ Currently, we allow customer to use `option.quantize=awq` or `OPTION_QUANTIZE=aw

We will have GPTQ supported for vLLM in the upcoming version.

## Deployment tutorial
## Quick Start Configurations

Most of the vLLM model could fall under the following templates:
You can leverage `vllm` with LMI using the following starter configurations:

### serving.properties

You can deploy with a serving.properties:

```
engine=Python
option.tensor_parallel_degree=max
option.model_id=<your model>
option.max_rolling_batch_size=64
option.rolling_batch=vllm
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
```

This is the standard no-code experience DJL-Serving provided.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.

### All environment variables

You can also deploy without even providing any artifacts to run with LMI through specifying everything in ENV:
### environment variables

```
HF_MODEL_ID=<your model>
TENSOR_PARALLEL_DEGREE=max
OPTION_MAX_ROLLING_BATCH_SIZE=64
OPTION_ROLLING_BATCH=vllm
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
```

You can use [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.

### Advanced vLLM Configurations

Expand Down