diff --git a/serving/docs/lmi/user_guides/lmi-dist_user_guide.md b/serving/docs/lmi/user_guides/lmi-dist_user_guide.md index 050f46bf9..61db47329 100644 --- a/serving/docs/lmi/user_guides/lmi-dist_user_guide.md +++ b/serving/docs/lmi/user_guides/lmi-dist_user_guide.md @@ -10,31 +10,45 @@ The model architecture that we test for lmi-dist (in CI): - Llama (un-quantized and with GPTQ) - Falcon -- GPT-NeoX (un-quantized and with BitsAndBytes) +- GPT-NeoX - MPT - Mistral (un-quantized and with AWQ) - Mixtral - T5 +- gemma +- starcoder +- phi +- dbrx -### Complete Model Set - -Optimized models: -* falcon -* gpt-neox -* llama -* llama-2 -* llava -* mistral -* mixtral -* mpt -* santacoder -* t5 +### Complete Model Set -*Note: PEFT is also supported for optimized models* +- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.) +- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.) +- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.) +- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.) +- DBRX (`databricks/dbrx-base`, `databricks/dbrx-instruct`, etc.) +- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.) +- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.) +- Gemma (`google/gemma-2b`, `google/gemma-7b`, etc.) +- GPT-2 (`gpt2`, `gpt2-xl`, etc.) +- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.) +- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.) +- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.) +- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.) +- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.) +- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.) +- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.) +- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.) +- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.) +- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.) +- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.) +- Qwen2 (`Qwen/Qwen2-beta-7B`, `Qwen/Qwen2-beta-7B-Chat`, etc.) +- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.) +- T5 (`google/flan-t5-xxl`, `google/flan-t5-base`, etc.) + +We will add more model support for the future versions to have them tested. Please feel free to [file us an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) for more model coverage in CI. -`lmi-dist` can also run other models which are supported by huggingface transformers but not listed above, although **without** the optimizations and `gptq`, `awq` quantization support. -`transformers` library support can be found for [CausalLM](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L381) and [Seq2Seq](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/models/auto/modeling_auto.py#L652) models. ## Quick Start Configurations @@ -75,6 +89,7 @@ We support the following ``: * [DEPRECATED] bitsandbytes (LMI container versions >= 0.24.0 and < 0.27.0) * [DEPRECATED] bitsandbytes8 (LMI container versions >= 0.25.0 and < 0.27.0) * gptq (LMI container versions >= 0.24.0) +* squeezellm (LMI container versions >= 0.27.0) When using pre-quantized models make sure to use the correct model artifacts e.g. `TheBloke/Llama-2-13B-chat-GPTQ`, `TheBloke/Llama-2-13B-chat-AWQ`. @@ -83,8 +98,20 @@ When using pre-quantized models make sure to use the correct model artifacts e.g Here are the advanced parameters that are available when using LMI-Dist. -| Item | LMI Version | Description | Example value | -|-------------------------------------------------------------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------| -| option.quantize | \>= 0.23.0 | Quantize the model with the supported quantization methods | Default: `None` | -| option.max_rolling_batch_prefill_tokens [Deprecated since 0.25.0] | \>= 0.24.0 | Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Currently we are calculating the best value for you from 0.25.0, this is no longer required | Default: 4096 | - +| Item | LMI Version | Configuration Type | Description | Example value | +|------------------------------------------|-------------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------| +| option.quantize | \>= 0.23.0 | LMI | Quantize the model with the supported quantization methods(`gptq`, `awq`, `squeezellm`) | `awq` Default: `None` | +| option.max_rolling_batch_prefill_tokens | \>= 0.24.0 | Pass Through | Limits the number of tokens for prefill(a.k.a prompt processing). This needs to be tuned based on GPU memory available and request lengths. Setting this value too high can limit the number of kv cache blocks or run into GPU OOM. If you don't set this, `lmi-dist` will default to max model length from Hugging Face config(also accounts for rope scaling if applicable). | Default: `None` | +| option.max_model_len | \>= 0.27.0 | Pass Through | The maximum length (input+output) of the request. The request will be stopped if more tokens are generated. `lmi-dist` will default to max model length from Hugging Face config(also accounts for rope scaling if applicable). For models with larger maximum length support(for e.g. 32k for Mistral 7B), it could lead to GPU OOM. In such cases, to deploy on a smaller instances, reduce this value. | Default: `None` | +| option.load_format | \>= 0.27.0 | Pass Through | The checkpoint format of the model. Default is auto and means bin/safetensors will be used if found. | Default: `auto` | +| option.enforce_eager | \>= 0.27.0 | Pass Through | `lmi-dist` by default will run with CUDA graph optimization to reach to the best performance. However, in the situation of very less GPU memory, having CUDA graph enabled will cause OOM. So if you set this option to true, we will use PyTorch Eager mode and disable CUDA graph to save some GBs of memory. `T5` model will not use cuda graphs. | Default: `False` | +| option.gpu_memory_utilization | \>= 0.27.0 | Pass Through | This config controls the amount of GPU memory allocated to KV cache. Setting higher value will allocate more memory for KV cache. Default is 0.9. It recommended to reduce this value if GPU OOM's are encountered. | Default: `0.9` | +| option.speculative_draft_model | \>= 0.27.0 | Pass Through | Model id or path to speculative decoding draft model | Default: `None` | +| option.draft_model_tp_size | \>= 0.27.0 | Pass Through | Tensor parallel degree of speculative decoding draft model. Accepted values are `1` and target model's tensor parallel size(`option.tensor_parallel_degree`) | Default: `1` | +| option.speculative_length | \>= 0.27.0 | Pass Through | Determines the number of tokens draft model generates before verifying against target model | Default: `5` | +| option.record_acceptance_rate | \>= 0.27.0 | LMI | Enables logging speculative decoding acceptance rate | Default: `False` | +| option.enable_lora | \>= 0.27.0 | Pass Through | This config enables support for LoRA adapters. | Default: `false` | +| option.max_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters that can be run at once. Allocates GPU memory for those number adapters. | Default: `4` | +| option.max_lora_rank | \>= 0.27.0 | Pass Through | This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost. | Default: `16` | +| option.lora_extra_vocab_size | \>= 0.27.0 | Pass Through | This config determines the maximum additional vocabulary that can be added through a LoRA adapter. | Default: `256` | +| option.max_cpu_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. | Default: `None` | diff --git a/serving/docs/lmi/user_guides/lmi_input_output_schema.md b/serving/docs/lmi/user_guides/lmi_input_output_schema.md index 7fff2537c..79f4147ab 100644 --- a/serving/docs/lmi/user_guides/lmi_input_output_schema.md +++ b/serving/docs/lmi/user_guides/lmi_input_output_schema.md @@ -84,7 +84,21 @@ Decoding method supported in DeepSpeed: Greedy (Default) and Sampling. LmiDistRollingBatchParameters : { 'typical_p' : float (default= 1.0), 'truncate' : integer (default = None), - 'ignore_eos_token' : boolean (default = false) + 'ignore_eos_token' : boolean (default = false), + 'top_k' : integer (default = -1), + 'min_p': float (default = 0.0), + 'presence_penalty': float (default = 0.0), + 'frequency_penalty' : float (default = 0.0), + 'num_beams': integer (default = 1), (set this greater than 1 to enable beam search) + 'length_penalty' : float (default = 1.0), + 'early_stopping' : boolean (default = false), + 'stop_token_ids': list (default = None), + 'include_stop_str_in_output' : boolean (default = false), + 'ignore_eos_token' : boolean (default = false), + 'logprobs' : int (default = None), + 'prompt_logprobs' : int (default = None), + 'skip_special_tokens': boolean (default = true), + 'spaces_between_special_tokens': boolean (default = true), } ``` @@ -96,11 +110,12 @@ Decoding method supported in LmiDist : Greedy (Default) and Sampling. ``` vLLMRollingBatchParameters : { 'top_k' : integer (default = -1) - 'min_p': float (default = 0.0), 'presence_penalty': float (default = 0.0), 'frequency_penalty' : float (default = 0.0), 'num_beams': integer (default = 1), (set this greater than 1 to enable beam search) + 'length_penalty' : float (default = 1.0), + 'early_stopping' : boolean (default = false), 'stop_token_ids': list (default = None), 'include_stop_str_in_output' : boolean (default = false), 'ignore_eos_token' : boolean (default = false), diff --git a/serving/docs/lmi/user_guides/vllm_user_guide.md b/serving/docs/lmi/user_guides/vllm_user_guide.md index 2af2e2517..cfef32fcd 100644 --- a/serving/docs/lmi/user_guides/vllm_user_guide.md +++ b/serving/docs/lmi/user_guides/vllm_user_guide.md @@ -26,6 +26,7 @@ The model architecture that we carefully tested for vLLM (in CI): - ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.) - DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.) - Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.) +- Gemma (`google/gemma-2b`, `google/gemma-7b`, etc.) - GPT-2 (`gpt2`, `gpt2-xl`, etc.) - GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.) - GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.) @@ -94,14 +95,14 @@ If you encounter an issue with a `Pass Through` configuration, it is likely an i | Item | LMI Version | Configuration Type | Description | Example value | |------------------------------------------|-------------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------| | option.quantize | \>= 0.26.0 | LMI | Quantize the model with the supported quantization methods. LMI uses this to set the right quantization configs in VLLM | `awq` Default: `None` | -| option.max_rolling_batch_prefill_tokens | \>= 0.26.0 | LMI | Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. If you don't set this, vLLM will compute a value. LMI uses this to set `max-num-batched-tokens` in vLLM. | Default: `None` | +| option.max_rolling_batch_prefill_tokens | \>= 0.26.0 | LMI | Limits the number of tokens for prefill(a.k.a prompt processing). This needs to be tuned based on GPU memory available and request lengths. Setting this value too high can limit the number of kv cache blocks or run into OOM. If you don't set this, `vllm` will default to max model length from Hugging Face config(also accounts for rope scaling if applicable). | Default: `None` | | option.max_model_len | \>= 0.26.0 | Pass Through | the maximum length (input+output) vLLM should preserve memory for. If not specified, will use the default length the model is capable in config.json. in verion like 0.27.0, sometimes model's maximum length could go to 32k (Mistral 7B) and way beyond the supported KV token size. In that case to deploy on a small instance, we need to adjust this value within the range of KV Cache limit. | Default: `None` | | option.load_format | \>= 0.26.0 | Pass Through | The checkpoint format of the model. Default is auto and means bin/safetensors will be used if found. | Default: `auto` | | option.enforce_eager | \>= 0.27.0 | Pass Through | vLLM by default will run with CUDA graph optimization to reach to the best performance. However, in the situation of very less GPU memory, having CUDA graph enabled will cause OOM. So if you set this option to true, we will use PyTorch Eager mode and disable CUDA graph to save some GBs of memory. | Default: `False` | -| option.gpu_memory_utilization | \>= 0.27.0 | Pass Through | This config controls the percentage of memory to be allocated to PagedAttention. Default to 0.9 (90%). We don't recommend to change this value because this impact the overall GPU memory allocations. | Default: `0.9` | -| option.enable_loras | \>= 0.27.0 | Pass Through | This config enables support for LoRA adapters. | Default: `false` | -| option.max_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters that can be run at once. Allocates more GPU memory for those adapters. | Default: `4` | -| option.max_lora_rank | \>= 0.27.0 | Pass Through | This config determines the maximum rank allowed for a LoRA adapter. Setting a larger value will enable more adapters at a greater memory usage cost. | Default: `16` | +| option.gpu_memory_utilization | \>= 0.27.0 | Pass Through | This config controls the amount of GPU memory allocated to KV cache. Setting higher value will allocate more memory for KV cache.Default is 0.9. It recommended to reduce this value if GPU OOM's are encountered. | Default: `0.9` | +| option.enable_lora | \>= 0.27.0 | Pass Through | This config enables support for LoRA adapters. | Default: `false` | +| option.max_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters that can be run at once. Allocates GPU memory for those number of adapters. | Default: `4` | +| option.max_lora_rank | \>= 0.27.0 | Pass Through | This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost. | Default: `16` | | option.lora_extra_vocab_size | \>= 0.27.0 | Pass Through | This config determines the maximum additional vocabulary that can be added through a LoRA adapter. | Default: `256` | | option.max_cpu_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. | Default: `None` |