deepjavalibrary · siddvenk · Mar 13, 2024 · Mar 13, 2024
@@ -95,7 +95,7 @@ Auto Tensor-Parallelism:
 
 ## Quick Start Configurations
 
-You can leverage DeepSpeed with LMI using the following starter configurations:
+You can leverage `deepspeed` with LMI using the following starter configurations:
 
 ### serving.properties
 
@@ -105,20 +105,24 @@ option.entryPoint=djl_python.deepspeed
 option.tensor_parallel_degree=max
 option.rolling_batch=deepspeed
 option.model_id=<your model id>
+# Adjust the following based on model size and instance type
 option.max_rolling_batch_size=64
 ```
 
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
+
 ### environment variables
 
 ```
 HF_MODEL_ID=<your model id>
 OPTION_ENTRYPOINT=djl_python.deepspeed
 TENSOR_PARALLEL_DEGREE=max
 OPTION_ROLLING_BATCH=deepspeed
+# Adjust the following based on model size and instance type
 OPTION_MAX_ROLLING_BATCH_SIZE=64
 ```
 
-You can use the [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
 
 ## Quantization Support
 

@@ -1,5 +1,9 @@
 # LMI-Dist Engine User Guide
 
+## Model Artifact Structure
+
+LMI-Dist expects the model to be in the [standard HuggingFace format](../deployment_guide/model-artifacts.md).
+
 ## Supported Model Architectures
 
 The model architecture that we test for lmi-dist (in CI):
@@ -31,25 +35,6 @@ Optimized models:
 
 `lmi-dist` can also run other models which are supported by huggingface transformers but not listed above, although **without** the optimizations and `gptq`, `awq` quantization support. `transformers` library support can be found for [CausalLM](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L381) and [Seq2Seq](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L652) models.
 
-
-### Quantization
-
-Currently, we allow customer to use `option.quantize=<quantization-type>` or `OPTION_QUANTIZE=<quantization-type>` to load a quantized model in `lmi-dist`.
-
-We support the following `<quantization-type>`:
-
-* awq (LMI container versions >= 0.26.0)
-* [DEPRECATED] bitsandbytes (LMI container versions >= 0.24.0 and < 0.27.0)
-* [DEPRECATED] bitsandbytes8 (LMI container versions >= 0.25.0 and < 0.27.0)
-* gptq (LMI container versions >= 0.24.0)
-
-When using pre-quantized models make sure to use the correct model artifacts e.g. `TheBloke/Llama-2-13B-chat-GPTQ`, `TheBloke/Llama-2-13B-chat-AWQ`.
-
-
-## Model Artifact Structure
-
-`lmi-dist` expects the model to be [standard HuggingFace format](../deployment_guide/model-artifacts.md#huggingface-transformers-pretrained-format).
-
 ## Quick Start Configurations
 
 You can leverage `lmi-dist` with LMI using the following starter configurations:
@@ -61,19 +46,38 @@ engine=MPI
 option.tensor_parallel_degree=max
 option.rolling_batch=lmi-dist
 option.model_id=<your model id>
+# Adjust the following based on model size and instance type
+option.max_rolling_batch_size=64
 option.max_rolling_batch_prefill_tokens=4096
 ```
 
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
+
 ### environment variables
 
 ```
 HF_MODEL_ID=<your model id>
 TENSOR_PARALLEL_DEGREE=max
 OPTION_ROLLING_BATCH=lmi-dist
+# Adjust the following based on model size and instance type
+OPTION_MAX_ROLLING_BATCH_SIZE=64
 OPTION_MAX_ROLLING_BATCH_PREFILL_TOKENS=4096
 ```
 
-You can use the [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
+
+## Quantization Support
+
+Currently, we allow customer to use `option.quantize=<quantization-type>` or `OPTION_QUANTIZE=<quantization-type>` to load a quantized model in `lmi-dist`.
+
+We support the following `<quantization-type>`:
+
+* awq (LMI container versions >= 0.26.0)
+* [DEPRECATED] bitsandbytes (LMI container versions >= 0.24.0 and < 0.27.0)
+* [DEPRECATED] bitsandbytes8 (LMI container versions >= 0.25.0 and < 0.27.0)
+* gptq (LMI container versions >= 0.24.0)
+
+When using pre-quantized models make sure to use the correct model artifacts e.g. `TheBloke/Llama-2-13B-chat-GPTQ`, `TheBloke/Llama-2-13B-chat-AWQ`.
 
 ## Advanced LMI-Dist Configurations
 

@@ -2,7 +2,9 @@
 
 ## Model Artifacts Structure
 
-LMI Transformers-NeuronX expect the model to be standard HuggingFace format for runtime compilation. For loading of compiled models, both Optimum compiled models and split-models with a separate `neff` cache (compiled models must be compiled with the same Neuron compiler version and model settings).
+LMI Transformers-NeuronX expects the model to be [standard HuggingFace format](../deployment_guide/model-artifacts.md) for runtime compilation. 
+
+For loading of compiled models, both Optimum compiled models and split-models with a separate `neff` cache (compiled models must be compiled with the same Neuron compiler version and model settings).
 The source of the model could be:
 
 - model_id string from huggingface
@@ -36,46 +38,42 @@ The model architectures that are tested daily for LMI Transformers-NeuronX (in C
 
 We will add more model support for the future versions to have them tested. Please feel free to [file us an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) for more model coverage in CI.
 
-### Quantization
-
-Currently, we allow customer to use `option.quantize=static_int8` or `OPTION_QUANTIZE=static_int8` to load the model using `int8` weight quantization.
-
 ## Quick Start Configurations
 
 Most of the LMI Transformers-NeuronX models use the following template (quantization, and other advanced features will require additional configuration which is described [here](../../lmi/configurations_large_model_inference_containers.md#transformers-neuronx-doc)):
 
-### Use serving.properties
-
-You can deploy with a serving.properties:
+### serving.properties
 
 ```
 engine=Python
-option.tensor_parallel_degree=4
 option.model_id=<your model>
 option.entryPoint=djl_python.transformers_neuronx
-option.max_rolling_batch_size=8
 option.rolling_batch=auto
+# Adjust the following based on model size and instance type
+option.tensor_parallel_degree=4
+option.max_rolling_batch_size=8
 option.model_loading_timeout=1600
 ```
 
-This is the standard no-code experience DJL-Serving provided.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
 
-### Use environment variables
-
-You can also deploy without even providing any artifacts to run with LMI through specifying everything in ENV:
+### environment variables
 
 ```
 HF_MODEL_ID=<your model>
 OPTION_ENTRYPOINT=djl_python.transformers_neuronx
+OPTION_ROLLING_BATCH=auto
+# Adjust the following based on model size and instance type
 TENSOR_PARALLEL_DEGREE=4
 OPTION_MAX_ROLLING_BATCH_SIZE=8
-OPTION_ROLLING_BATCH=auto
 OPTION_MODEL_LOADING_TIMEOUT=1600
 ```
 
-### SageMaker notebook deployment samples
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
+
+## Quantization
 
-Here you can find deployment samples with SageMaker notebooks [tutorial](../README.md).
+Currently, we allow customer to use `option.quantize=static_int8` or `OPTION_QUANTIZE=static_int8` to load the model using `int8` weight quantization.
 
 ## Advanced Transformers NeuronX Configurations
 

@@ -7,8 +7,6 @@ TRT-LLM LMI supports two options for model artifacts
 1. [Standard HuggingFace model format](../deployment_guide/model-artifacts.md#huggingface-transformers-pretrained-format): In this case, TRT-LLM LMI will build TRT-LLM engines from HuggingFace model and package them with HuggingFace model config files during model load time.
 2. [Custom TRT-LLM LMI model format](../deployment_guide/model-artifacts.md#tensorrt-llmtrt-llm-lmi-model-format): In this case, artifacts are directly loaded without the need to model compilation resulting in faster load times.
 
-
-
 ## Supported Model Architectures
 
 The below model architectures are supported for JIT model compiltation and tested in our CI.
@@ -28,73 +26,44 @@ TRT-LLM LMI v8 0.26.0 containers come with [TRT-LLM 0.7.1](https://github.com/NV
 
 We will add more model support in the future versions in our CI. Please feel free to [file an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) if you are looking for a specific model support.
 
+## Quick Start Configurations
 
-## SageMaker Deployment Tutorial
-Users need to provide the model id of the model they want to deploy. Model id can be provided using `OPTION_MODEL_ID` environment variable which can take one of the following values:
-
-* Hugging Face model id
-* s3 uri of Hugging Face model stored in s3
-* s3 uri of pre-compiled TRT-LLM LMI model artifacts
+You can leverage `tensorrtllm` with LMI using the following starter configurations:
 
-We also need to set `SERVING_LOAD_MODELS` environment variable which can be set as below. 
+### serving.properties
 
 ```
-HF_MODEL_ID=<your model id>
+engine=MPI
+option.tensor_parallel_degree=max
+option.rolling_batch=trtllm
+option.model_id=<your model id>
+# Adjust the following based on model size and instance type
+option.max_rolling_batch_size=64
+option.max_input_len=1024
+option.max_output_len=512
 ```
-In addition to these required parameters, users may want to set these parameters according to their use case:
 
-* `TENSOR_PARALLEL_DEGREE`: Determines number of gpus across model will be split into. By default, model will be split across all the gpus available in the instance. For some model architectures, this default behavior will not work. In such cases, users can set this parameter to a value that works for the specific model.
-* `OPTION_MAX_INPUT_LEN`: Determines maximum input prompt length the model can process. Default is 1024. Users can decrease/increase this value if they know their application's precise limit.
-* `OPTION_MAX_OUTPUT_LEN`: Determines maximum output tokens expected from the model. Default is 512. Users can decrease/increase this value if they know their application's precise limit.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
 
-We also support customizing additional parameters to boost performance as per specific use case. Please refer to `Common` and `TensorRT-LLM` sections in this [doc](../../lmi/configurations_large_model_inference_containers.md) for advanced configuration
+### environment variables
 
-In this tutorial, we will use [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to deploy the model on SageMaker. The below code can be run in SageMaker environment to deploy llama2-13b on g5.12xlarge instance. 
+````
+HF_MODEL_ID=<your model id>
+TENSOR_PARALLEL_DEGREE=max
+OPTION_ROLLING_BATCH=trtllm
+# Adjust the following based on model size and instance type
+OPTION_MAX_ROLLING_BATCH_SIZE=64
+OPTION_MAX_INPUT_LEN=1024
+OPTION_MAX_OUTPUT_LEN=512
+````
 
-```
-import sagemaker
-from sagemaker import image_uris, Model, Predictor
-from sagemaker.serializers import JSONSerializer
-from sagemaker.deserializers import JSONDeserializer
-
-# Setup role and sagemaker session
-iam_role = sagemaker.get_execution_role()
-sagemaker_session = sagemaker.session.Session()
-region = sagemaker_session._region_name
-
-# Fetch the uri of the TRT-LLM LMI container
-container_image_uri = image_uris.retrieve(framework="djl-tensorrtllm", version="0.26.0", region=region)
-
-# Create the SageMaker Model object. In this example we'll use vLLM as our inference backend
-model = Model(
-  image_uri=container_image_uri,
-  role=iam_role,
-  env={
-    "HF_MODEL_ID": "TheBloke/Llama-2-13B-fp16",
-  }
-)
-
-# Deploy your model to a SageMaker Endpoint and create a Predictor to make inference requests
-endpoint_name = sagemaker.utils.name_from_base("llama-13b-trtllm-endpoint")
-model.deploy(instance_type="ml.g5.12xlarge", initial_instance_count=1, endpoint_name=endpoint_name)
-predictor = Predictor(
-  endpoint_name=endpoint_name,
-  sagemaker_session=sagemaker_session,
-  serializer=JSONSerializer(),
-  deserializer=JSONDeserializer(),
-)
-
-
-# Make an inference request against the llama2-13b endpoint
-outputs = predictor.predict({
-  "inputs": "The diamondback terrapin was the first reptile to be",
-  "parameters": {
-    "do_sample": True,
-    "max_new_tokens": 256,
-  }
-})
-print(outputs)
-```
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
+
+## Quantization Support
+
+We support two methods of quantization when using TensorRT-LLM with LMI: SmoothQuant, and AWQ.
+You can enable these quantization strategies using `option.quantize=<smoothquant|awq>` in serving.properties, or `OPTION_QUANTIZE=<smoothquant|awq>` environment variable.
+More details about additional (optional) quantization configurations are available in the advanced configuration table below.
 
 
 ##  Advanced TensorRT-LLM Configurations

@@ -48,36 +48,34 @@ Currently, we allow customer to use `option.quantize=awq` or `OPTION_QUANTIZE=aw
 
 We will have GPTQ supported for vLLM in the upcoming version.
 
-## Deployment tutorial
+## Quick Start Configurations 
 
-Most of the vLLM model could fall under the following templates:
+You can leverage `vllm` with LMI using the following starter configurations:
 
 ### serving.properties
 
-You can deploy with a serving.properties:
-
 ```
 engine=Python
 option.tensor_parallel_degree=max
 option.model_id=<your model>
-option.max_rolling_batch_size=64
 option.rolling_batch=vllm
+# Adjust the following based on model size and instance type
+option.max_rolling_batch_size=64
 ```
 
-This is the standard no-code experience DJL-Serving provided.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
 
-### All environment variables
-
-You can also deploy without even providing any artifacts to run with LMI through specifying everything in ENV:
+### environment variables 
 
 ```
 HF_MODEL_ID=<your model>
 TENSOR_PARALLEL_DEGREE=max
-OPTION_MAX_ROLLING_BATCH_SIZE=64
 OPTION_ROLLING_BATCH=vllm
+# Adjust the following based on model size and instance type
+OPTION_MAX_ROLLING_BATCH_SIZE=64
 ```
 
-You can use [SageMaker deployment template](../README.md#using-the-sagemaker-python-sdk-to-deploy-your-first-model-with-lmi) to deploy the model with environment variables.
+You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
 
 ### Advanced vLLM Configurations