diff --git a/serving/docs/configuration.md b/serving/docs/configuration.md index ed5f0b285..4c3d8e318 100644 --- a/serving/docs/configuration.md +++ b/serving/docs/configuration.md @@ -1,162 +1,47 @@ -# DJLServing startup configuration +# DJL Serving Configuration -## Environment variables +DJL Serving is a multi-layer system and has many different forms of configuration across those layers. -User can set environment variables to change DJL Serving behavior, following is a list of -variables that user can set for DJL Serving: +## Global -* JAVA_HOME -* JAVA_OPTS -* SERVING_OPTS -* MODEL_SERVER_HOME +At the beginning, there are [global configurations](configurations_global.md). +These configurations are passed through startup arguments, the config file, and environment variables. -**Note:** environment variable has higher priority that command line or config.properties. -It will override other property values. +As part of the startup, you are able to specify several different categories of options: -**Note:** For tunable parameters for Large Language Models please refer to [this](configurations_large_model_inference_containers.md) guide. +- Global Java settings with environment variables like `$JAVA_HOME` and `$JAVA_OPTS`. +- Loading behavior with the `model_store` and what models to load on startup +- Network settings such as the port and SSL -## Command line parameters +## Engine -User can use the following parameters to start djl-serving, those parameters will override default behavior: +DJL Serving is powered by [DeepJavaLibrary](djl.ai) and most of the functionality exists through the use of [DJL engines](http://docs.djl.ai/docs/engine.html). +As part of this, many of the engines along with DJL itself can be configured through the use of environment variables and system properties. -``` -djl-serving -h +The [engine configuration](configurations.md) document lists these configurations. +These include both the ones global to DJL as well as lists for each engine. +There are configurations for paths, versions, performance, settings, and debugging. +All engine configurations are shared between all models and workers using that engine. -usage: djl-serving [OPTIONS] - -f,--config-file Path to the configuration properties file. - -h,--help Print this help. - -m,--models Models to be loaded at startup. - -s,--model-store Model store location where models can be loaded. -``` +## Workflow -Details about the models, model-store, and workflows can be found in the equivalent configuration properties. +Next, you are able to add and configure a [Workflow](workflows.md). +DJL Serving has a custom solution for handling workflows that is configured through a `workflow.json` or `workflow.yml` file. -## config.properties file +## Model -DJL Serving use a `config.properties` file to store configurations. +Next, it is possible to specify [model configuration](configurations_model.md). +This is mostly done by using a `serving.properties` file, although there are environment variables that can be used as well. -### Configure listening port +These configurations are also optional. +If no `serving.properties` is provided, some basic properties such as which engine to use will be inferred. +The rest will back back to the global defaults. -DJL Serving only allows localhost access by default. +## Application -* inference_address: inference API binding address, default: http://127.0.0.1:8080 -* management_address: management API binding address, default: http://127.0.0.1:8081 +Alongside the configurations that determine how DJL Serving runs the model, there are also options that can be passed into the model itself. +The primary way is through the [DJL Model](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/Model.html) properties or [DJL Criteria](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/repository/zoo/Criteria.html) arguments. +These settings are ultimately dependent on the individual model. +But, here are some documented applications that have additional configurations: -Here are a couple of examples: - -```properties -# bind inference API to all network interfaces with SSL enabled -inference_address=https://0.0.0.0:8443 - -# bind inference API to private network interfaces -inference_address=https://172.16.1.10:8443 -``` - -### Configure initial models and workflows - -**Model Store** - -The `model_store` config property can be used to define a directory where each file/folder in it is a model to be loaded. -It will then attempt to load all of them by default. -Here is an example: - -```properties -model_store=build/models -``` - -**Load Models** - -The `load_models` config property can be used to define a list of models (or workflows) to be loaded. -The list should be defined as a comma separated list of urls to load models from. - -Each model can be defined either as a URL directly or optionally with prepended endpoint data like `[EndpointData]=modelUrl`. -The endpoint is a list of data items separated by commas. -The possible variations are: - -- `[modelName]` -- `[modelName:version]` -- `[modelName:version:engine]` -- `[modelName:version:engine:deviceNames]` - -The version can be an arbitrary string. -The engines uses the standard DJL `Engine` names. - -Possible deviceNames strings include `*` for all devices and a `;` separated list of device names following the format defined in DJL `Device.fromName`. -If no device is specified, it will use the DJL default device (usually GPU if available else CPU). - -```properties -load_models=https://resources.djl.ai/test-models/mlp.tar.gz,[mlp:v1:MXNet:*]=https://resources.djl.ai/test-models/mlp.tar.gz -``` - -**Workflows** - -Use the `load_models` config property to define initial workflows that should be loaded on startup. - -```properties -load_models=https://resources.djl.ai/test-models/basic-serving-workflow.json -``` - -View the [workflow documentation](workflows.md) to see more information about workflows and their configuration format. - -### Enable SSL - -For users who want to enable HTTPs, you can change `inference_address` or `management_addrss` -protocol from http to https, for example: `inference_addrss=https://127.0.0.1`. -This will make DJL Serving listen on localhost 443 port to accepting https request. - -User also must provide certificate and private keys to enable SSL. DJL Serving support two ways to configure SSL: - -1. Use keystore - * keystore: Keystore file location, if multiple private key entry in the keystore, first one will be picked. - * keystore_pass: keystore password, key password (if applicable) MUST be the same as keystore password. - * keystore_type: type of keystore, default: PKCS12 - -2. Use private-key/certificate files - * private_key_file: private key file location, support both PKCS8 and OpenSSL private key. - * certificate_file: X509 certificate chain file location. - -#### Self-signed certificate example - -This is a quick example to enable SSL with self-signed certificate - -##### User java keytool to create keystore - -```bash -keytool -genkey -keyalg RSA -alias djl -keystore keystore.p12 -storepass changeit -storetype PKCS12 -validity 3600 -keysize 2048 -dname "CN=www.MY_DOMSON.com, OU=Cloud Service, O=model server, L=Palo Alto, ST=California, C=US" -``` - - Config following property in config.properties: - -```properties -inference_address=https://127.0.0.1:8443 -management_address=https://127.0.0.1:8444 -keystore=keystore.p12 -keystore_pass=changeit -keystore_type=PKCS12 -``` - -##### User OpenSSL to create private key and certificate - -```bash -# generate a private key with the correct length -openssl genrsa -out private-key.pem 2048 - -# generate corresponding public key -openssl rsa -in private-key.pem -pubout -out public-key.pem - -# create a self-signed certificate -openssl req -new -x509 -key private-key.pem -out cert.pem -days 360 - -# convert pem to pfx/p12 keystore -openssl pkcs12 -export -inkey private-key.pem -in cert.pem -out keystore.p12 -``` - - Config following property in config.properties: - -```properties -inference_address=https://127.0.0.1:8443 -management_address=https://127.0.0.1:8444 -keystore=keystore.p12 -keystore_pass=changeit -keystore_type=PKCS12 -``` +- [Large Language Model Configurations](configurations_large_model_inference_containers.md) diff --git a/serving/docs/configurations.md b/serving/docs/configurations.md index c044d1669..e464df085 100644 --- a/serving/docs/configurations.md +++ b/serving/docs/configurations.md @@ -1,8 +1,6 @@ -# All DJL configuration options +# Engine Configuration -DJL serving is highly configurable. This document tries to capture those configurations in a single document. - -**Note:** For tunable parameters for Large Language Models please refer to [this](configurations_large_model_inference_containers.md) guide. +This covers the available configurations for DJL and engines. ## DJL settings @@ -83,134 +81,6 @@ DJLServing build on top of Deep Java Library (DJL). Here is a list of settings f | ai.djl.python.disable_alternative | system prop | Disable alternative engine | | TENSOR_PARALLEL_DEGREE | env var | Set tensor parallel degree.
For mpi mode, the default is number of accelerators.
Use "max" for non-mpi mode to use all GPUs for tensor parallel. | -DJLServing provides a few alias for Python engine to make it easy for common LLM configurations. - -- `engine=DeepSpeed`, equivalent to: - -``` -engine=Python -option.mpi_mode=true -option.entryPoint=djl_python.deepspeed -``` - -- `engine=FasterTransformer`, this is equivalent to: - -``` -engine=Python -option.mpi_mode=true -option.entryPoint=djl_python.fastertransformer -``` - -- `engine=MPI`, this is equivalent to: - -``` -engine=Python -option.mpi_mode=true -option.entryPoint=djl_python.huggingface -``` - -## Global Model Server settings - -Global settings are configured at model server level. Change to these settings usually requires -restart model server to take effect. - -Most of the model server specific configuration can be configured in `conf/config.properties` file. -You can find the configuration keys here: -[ConfigManager.java](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/main/java/ai/djl/serving/util/ConfigManager.java#L52-L79) - -Each configuration key can also be override by environment variable with `SERVING_` prefix, for example: - -``` -export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config -``` - -| Key | Type | Description | -|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/) | -| DEFAULT_JVM_OPTS | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`
Override default JVM startup options and system properties. | -| JAVA_OPTS | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`
Add extra JVM options. | -| SERVING_OPTS | env var | default: N/A
Add serving related JVM options.
Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.
- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch
- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch
- `-Dai.djl.logging.level=debug` change DJL loggging level | - -## Model specific settings - -You set per model settings by adding a [serving.properties](modes.md#servingproperties) file in the root of your model directory (or .zip). -Some of the options can be override by environment variable with `OPTION_` prefix, for example: - -``` -# to enable rolling batch with only environment variable: -export OPTION_ROLLING_BATCH=auto -``` - -You can set number of workers for each model: -https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8 - -For example, set minimum workers and maximum workers for your model: - -``` -minWorkers=32 -maxWorkers=64 -``` - -Or you can configure minimum workers and maximum workers differently for GPU and CPU: - -``` -gpu.minWorkers=2 -gpu.maxWorkers=3 -cpu.minWorkers=2 -cpu.maxWorkers=4 -``` - -job queue size, batch size, max batch delay, max worker idle time can be configured at -per model level, this will override global settings: - -``` -job_queue_size=10 -batch_size=2 -max_batch_delay=1 -max_idle_time=120 -``` - -You can configure which device to load the model on, default is *: - -``` -load_on_devices=gpu4;gpu5 -# or simply: -load_on_devices=4;5 -``` - -### Python (DeepSpeed) - -For Python (DeepSpeed) engine, DJL load multiple workers sequentially by default to avoid run -out of memory. You can reduced model loading time by parallel loading workers if you know the -peak memory won’t cause out of memory: - -``` -# Allows to load DeepSpeed workers in parallel -option.parallel_loading=true -# specify tensor parallel degree (number of partitions) -option.tensor_parallel_degree=2 -# specify per model timeout -option.model_loading_timeout=600 -option.predict_timeout=240 -# mark the model as failure after python process crashing 10 times -retry_threshold=0 - -# enable virtual environment -option.enable_venv=true - -# use built-in DeepSpeed handler -option.entryPoint=djl_python.deepspeed -# passing extra options to model.py or built-in handler -option.model_id=gpt2 -option.data_type=fp32 -option.max_new_tokens=50 - -# defines custom environment variables -env=LARGE_TENSOR=1 -# specify the path to the python executable -option.pythonExecutable=/usr/bin/python3 -``` - ## Engine specific settings DJL support 12 deep learning frameworks, each framework has their own settings. Please refer to @@ -229,52 +99,3 @@ The follow table show some engine specific environment variables that is overrid | TF_CPP_MIN_LOG_LEVEL | TensorFlow | default 1 | | MXNET_ENGINE_TYPE | MXNet | this value must be `NaiveEngine` | -## Appendix - -### How to configure logging - -#### Option 1: enable debug log: - -``` -export SERVING_OPTS="-Dai.djl.logging.level=debug" -``` - -#### Option 2: use your log4j2.xml - -``` -export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml -``` - -DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containers. -Use the following environment variable to print HTTP access log to console: - -``` -export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml -``` - -Use the following environment variable to print both access log, server metrics and model metrics to console: - -``` -export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml -``` - -### How to download uncompressed model from S3 -To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and -only keep the model code and metadata in the `model.tar.gz` (.zip) file. DJL can leverage -[s5cmd](https://github.com/peak/s5cmd) to download uncompressed files from S3 with extremely fast -speed. - -To enable `s5cmd` downloading, you can configure `serving.properties` as the following: - -``` -option.model_id=s3://YOUR_BUCKET/... -``` - -### How to resolve python package conflict between models -If you want to deploy multiple python models, but their dependencies has conflict, you can enable -[python virtual environments](https://docs.python.org/3/tutorial/venv.html) for your model: - -``` -option.enable_venv=true -``` - diff --git a/serving/docs/configurations_global.md b/serving/docs/configurations_global.md new file mode 100644 index 000000000..cf62bec7c --- /dev/null +++ b/serving/docs/configurations_global.md @@ -0,0 +1,215 @@ +# Global Configuration + +This covers configurations that are used globally and as part of startup for DJL Serving. + +## Command line parameters + +User can use the following parameters to start djl-serving, those parameters will override default behavior: + +``` +djl-serving -h + +usage: djl-serving [OPTIONS] + -f,--config-file Path to the configuration properties file. + -h,--help Print this help. + -m,--models Models to be loaded at startup. + -s,--model-store Model store location where models can be loaded. +``` + +Details about the models, model-store, and workflows can be found in the equivalent configuration properties. + +## config.properties file + +DJL Serving use a `config.properties` file to store configurations. + +### Configure listening port + +DJL Serving only allows localhost access by default. + +* inference_address: inference API binding address, default: http://127.0.0.1:8080 +* management_address: management API binding address, default: http://127.0.0.1:8081 + +Here are a couple of examples: + +```properties +# bind inference API to all network interfaces with SSL enabled +inference_address=https://0.0.0.0:8443 + +# bind inference API to private network interfaces +inference_address=https://172.16.1.10:8443 +``` + +### Configure initial models and workflows + +**Model Store** + +The `model_store` config property can be used to define a directory where each file/folder in it is a model to be loaded. +It will then attempt to load all of them by default. +Here is an example: + +```properties +model_store=build/models +``` + +**Load Models** + +The `load_models` config property can be used to define a list of models (or workflows) to be loaded. +The list should be defined as a comma separated list of urls to load models from. + +Each model can be defined either as a URL directly or optionally with prepended endpoint data like `[EndpointData]=modelUrl`. +The endpoint is a list of data items separated by commas. +The possible variations are: + +- `[modelName]` +- `[modelName:version]` +- `[modelName:version:engine]` +- `[modelName:version:engine:deviceNames]` + +The version can be an arbitrary string. +The engines uses the standard DJL `Engine` names. + +Possible deviceNames strings include `*` for all devices and a `;` separated list of device names following the format defined in DJL `Device.fromName`. +If no device is specified, it will use the DJL default device (usually GPU if available else CPU). + +```properties +load_models=https://resources.djl.ai/test-models/mlp.tar.gz,[mlp:v1:MXNet:*]=https://resources.djl.ai/test-models/mlp.tar.gz +``` + +**Workflows** + +Use the `load_models` config property to define initial workflows that should be loaded on startup. + +```properties +load_models=https://resources.djl.ai/test-models/basic-serving-workflow.json +``` + +View the [workflow documentation](workflows.md) to see more information about workflows and their configuration format. + +### Enable SSL + +For users who want to enable HTTPs, you can change `inference_address` or `management_addrss` +protocol from http to https, for example: `inference_addrss=https://127.0.0.1`. +This will make DJL Serving listen on localhost 443 port to accepting https request. + +User also must provide certificate and private keys to enable SSL. DJL Serving support two ways to configure SSL: + +1. Use keystore + * keystore: Keystore file location, if multiple private key entry in the keystore, first one will be picked. + * keystore_pass: keystore password, key password (if applicable) MUST be the same as keystore password. + * keystore_type: type of keystore, default: PKCS12 + +2. Use private-key/certificate files + * private_key_file: private key file location, support both PKCS8 and OpenSSL private key. + * certificate_file: X509 certificate chain file location. + +#### Self-signed certificate example + +This is a quick example to enable SSL with self-signed certificate + +##### User java keytool to create keystore + +```bash +keytool -genkey -keyalg RSA -alias djl -keystore keystore.p12 -storepass changeit -storetype PKCS12 -validity 3600 -keysize 2048 -dname "CN=www.MY_DOMSON.com, OU=Cloud Service, O=model server, L=Palo Alto, ST=California, C=US" +``` + +Config following property in config.properties: + +```properties +inference_address=https://127.0.0.1:8443 +management_address=https://127.0.0.1:8444 +keystore=keystore.p12 +keystore_pass=changeit +keystore_type=PKCS12 +``` + +##### User OpenSSL to create private key and certificate + +```bash +# generate a private key with the correct length +openssl genrsa -out private-key.pem 2048 + +# generate corresponding public key +openssl rsa -in private-key.pem -pubout -out public-key.pem + +# create a self-signed certificate +openssl req -new -x509 -key private-key.pem -out cert.pem -days 360 + +# convert pem to pfx/p12 keystore +openssl pkcs12 -export -inkey private-key.pem -in cert.pem -out keystore.p12 +``` + +Config following property in config.properties: + +```properties +inference_address=https://127.0.0.1:8443 +management_address=https://127.0.0.1:8444 +keystore=keystore.p12 +keystore_pass=changeit +keystore_type=PKCS12 +``` + +## Environment variables + +User can set environment variables to change DJL Serving behavior, following is a list of +variables that user can be set for DJL Serving: + +* JAVA_HOME +* JAVA_OPTS +* SERVING_OPTS +* MODEL_SERVER_HOME + +**Note:** environment variable has higher priority that command line or config.properties. +It will override other property values. + +### Global Model Server settings + +Global settings are configured at model server level. Change to these settings usually requires +restart model server to take effect. + +Most of the model server specific configuration can be configured in `conf/config.properties` file. +You can find the configuration keys here: +[ConfigManager.java](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/main/java/ai/djl/serving/util/ConfigManager.java#L52-L79) + +Each configuration key can also be override by environment variable with `SERVING_` prefix, for example: + +``` +export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config +``` + +| Key | Type | Description | +|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/) | +| DEFAULT_JVM_OPTS | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`
Override default JVM startup options and system properties. | +| JAVA_OPTS | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`
Add extra JVM options. | +| SERVING_OPTS | env var | default: N/A
Add serving related JVM options.
Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.
- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch
- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch
- `-Dai.djl.logging.level=debug` change DJL loggging level | + + +## Appendix + +### How to configure logging + +#### Option 1: enable debug log: + +``` +export SERVING_OPTS="-Dai.djl.logging.level=debug" +``` + +#### Option 2: use your log4j2.xml + +``` +export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml +``` + +DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containers. +Use the following environment variable to print HTTP access log to console: + +``` +export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml +``` + +Use the following environment variable to print both access log, server metrics and model metrics to console: + +``` +export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml +``` + diff --git a/serving/docs/configurations_large_model_inference_containers.md b/serving/docs/configurations_large_model_inference_containers.md index 5d5691889..ff34cc760 100644 --- a/serving/docs/configurations_large_model_inference_containers.md +++ b/serving/docs/configurations_large_model_inference_containers.md @@ -1,7 +1,7 @@ # Large Model Inference Containers -DJL serving is highly configurable. This document tries to capture those configurations -for [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). +There are a number of shared configurations for python models running large language models. +They are also available through the [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers). ### Common ([doc](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)) @@ -25,6 +25,7 @@ for [Large Model Inference Containers](https://github.com/aws/deep-learning-cont | option.return_tuple | No | Whether transformer layers need to return a tuple or a tensor. | `false` | | option.training_mp_size | No | If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference. | `2` | | option.checkpoint | No | Path to DeepSpeed compatible checkpoint file. | `ds_inference_checkpoint.json` | +| option.parallel_loading | No | Loads multiple workers in parallel (faster but risks running out of memory). | `ds_inference_checkpoint.json` | ### FasterTransformer ([doc](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html)) @@ -56,3 +57,31 @@ for [Large Model Inference Containers](https://github.com/aws/deep-learning-cont |--------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------|-----------------| | option.n_positions | No | Input sequence length | Default: `128` | | option.unroll | No | Unroll the model graph for compilation. With `unroll=None` compiler will have more opportunities to do optimizations across the layers | Default: `None` | + +## Aliases + +DJLServing provides a few alias for Python engine to make it easy for common LLM configurations. + +- `engine=DeepSpeed`, equivalent to: + +``` +engine=Python +option.mpi_mode=true +option.entryPoint=djl_python.deepspeed +``` + +- `engine=FasterTransformer`, this is equivalent to: + +``` +engine=Python +option.mpi_mode=true +option.entryPoint=djl_python.fastertransformer +``` + +- `engine=MPI`, this is equivalent to: + +``` +engine=Python +option.mpi_mode=true +option.entryPoint=djl_python.huggingface +``` diff --git a/serving/docs/configurations_model.md b/serving/docs/configurations_model.md new file mode 100644 index 000000000..394a98310 --- /dev/null +++ b/serving/docs/configurations_model.md @@ -0,0 +1,212 @@ +# Model Configuration + +You set per model settings by adding a serving.properties file in the root of your model directory (or .zip). +These apply for all engines and modes. + +An example `serving.properties` can be found [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties). + +## Main properties + +In `serving.properties`, you can set the following properties. Model properties are accessible to `Translator` +and python handler functions. + +- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc. +- `load_on_devices`: A ; delimited devices list, which the model to be loaded on, default to load on all devices. +- `translatorFactory`: Specify the TranslatorFactory. +- `job_queue_size`: Specify the job queue size at model level, this will override global `job_queue_size`, default is `1000`. +- `batch_size`: the dynamic batch size, default is `1`. +- `max_batch_delay` - the maximum delay for batch aggregation in millis, default value is `100` milliseconds. +- `max_idle_time` - the maximum idle time in seconds before the worker thread is scaled down, default is `60` seconds. +- `log_model_metric`: Enable model metrics (inference, pre-process and post-process latency) logging. +- `metrics_aggregation`: Number of model metrics to aggregate, default is `1000`. +- `minWorkers`: Minimum number of workers, default is `1`. +- `maxWorkers`: Maximum number of workers, default is `#CPU/OMP_NUM_THREAD` for CPU, GPU default is `2`, inferentia default is `2` (PyTorch engine), `1` (Python engine) . +- `gpu.minWorkers`: Minimum number of workers for GPU. +- `gpu.maxWorkers`: Maximum number of workers for GPU. +- `cpu.minWorkers`: Minimum number of workers for CPU. +- `cpu.maxWorkers`: Maximum number of workers for CPU. +- `required_memory_mb`: Specify the required memory (CPU and GPU) in MB to load the model. +- `gpu.required_memory_mb`: Specify the required GPU memory in MB to load the model. +- `reserved_memory_mb`: Reserve memory in MB to avoid system out of memory. +- `gpu.reserved_memory_mb`: Reserve GPU memory in MB to avoid system out of memory. + +## Option Properties + +In `serving.properties`, you can also set options (prefixed with `option`) and properties. +The options will be passed to `Model.load(Path modelPath, String prefix, Map options)` API. +It allows you to set engine specific configurations. +Here are some of the available option properties: + +``` +# set model file name prefix if different from folder name +option.modeName=resnet18_v1 + +# PyTorch options +option.mapLocation=true +option.extraFiles=foo.txt,bar.txt + +# ONNXRuntime options +option.interOpNumThreads=2 +option.intraOpNumThreads=2 +option.executionMode=SEQUENTIAL +option.optLevel=BASIC_OPT +option.memoryPatternOptimization=true +option.cpuArenaAllocator=true +option.disablePerSessionThreads=true +option.customOpLibrary=myops.so +option.disablePerSessionThreads=true +option.ortDevice=TensorRT/ROCM/CoreML + +# Python model options +retry_threshold=10 # Mark model as failure after python process crashing 10 times +option.pythonExecutable=python3 +option.entryPoint=deepspeed.py +option.handler=hanlde +option.predict_timeout=120 +option.model_loading_timeout=10 +option.parallel_loading=true +option.tensor_parallel_degree=2 +option.enable_venv=true +option.rolling_batch=auto +#option.rolling_batch=lmi-dist +option.max_rolling_batch_size=64 +option.paged_attention=false +option.max_rolling_batch_prefill_tokens=1088 +``` + +Most of the options can also be overriden by an environment variable with the `OPTION_` prefix and all caps. +For example: + +``` +# to enable rolling batch with only environment variable: +export OPTION_ROLLING_BATCH=auto +``` + +## Basic Model Configurations + +You can set number of workers for each model: +https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8 + +For example, set minimum workers and maximum workers for your model: + +``` +minWorkers=32 +maxWorkers=64 +``` + +Or you can configure minimum workers and maximum workers differently for GPU and CPU: + +``` +gpu.minWorkers=2 +gpu.maxWorkers=3 +cpu.minWorkers=2 +cpu.maxWorkers=4 +``` + +job queue size, batch size, max batch delay, max worker idle time can be configured at +per model level, this will override global settings: + +``` +job_queue_size=10 +batch_size=2 +max_batch_delay=1 +max_idle_time=120 +``` + +You can configure which device to load the model on, default is *: + +``` +load_on_devices=gpu4;gpu5 +# or simply: +load_on_devices=4;5 +``` + +## Python model configuration + +#### number of workers + +For Python engine, we recommend set `minWorkers` and `maxWorkers` to be the same since python +worker scale up and down is expensive. + +You may also need to consider `OMP_NUM_THREAD` when setting number workers. `OMP_NUM_THREAD` is default +to `1`, you can unset `OMP_NUM_THREAD` by setting `NO_OMP_NUM_THREADS=true`. If `OMP_NUM_THREAD` is unset, +the `maxWorkers` will be default to 2 (larger `maxWorkers` with non 1 `OMP_NUM_THREAD` can cause thread +contention, and reduce throughput). + +Set minimum workers and maximum workers for your model: + +``` +minWorkers=32 +maxWorkers=64 +# idle time in seconds before the worker thread is scaled down +max_idle_time=120 +``` + +Or set minimum workers and maximum workers differently for GPU and CPU: + +``` +gpu.minWorkers=2 +gpu.maxWorkers=3 +cpu.minWorkers=2 +cpu.maxWorkers=4 +``` + +**Note**: Loading model in Python mode is pretty heavy. We recommend to set `minWorker` and `maxWorker` to be the same value to avoid unnecessary load and unload. + + +#### job queue size +Or override global `job_queue_size`: + +``` +job_queue_size=10 +``` + +#### dynamic batching +To enable dynamic batching: + +``` +batch_size=2 +max_batch_delay=1 +``` + +#### rolling batch +To enable rolling batch for Python engine: + +``` +# lmi-dist and vllm requires running mpi mode +engine=MPI +option.rolling_batch=auto +# use FlashAttention +#option.rolling_batch=lmi-dist +#option.rolling_batch=scheduler +option.max_rolling_batch_size=64 + +# increase max_rolling_batch_prefill_tokens for long sequence +option.max_rolling_batch_prefill_tokens=1088 + +# disable PagedAttention if run into OOM +option.paged_attention=false +``` + +## Appendix + +### How to download uncompressed model from S3 +To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and +only keep the model code and metadata in the `model.tar.gz` (.zip) file. DJL can leverage +[s5cmd](https://github.com/peak/s5cmd) to download uncompressed files from S3 with extremely fast +speed. + +To enable `s5cmd` downloading, you can configure `serving.properties` as the following: + +``` +option.model_id=s3://YOUR_BUCKET/... +``` + +### How to resolve python package conflict between models +If you want to deploy multiple python models, but their dependencies has conflict, you can enable +[python virtual environments](https://docs.python.org/3/tutorial/venv.html) for your model: + +``` +option.enable_venv=true +``` + diff --git a/serving/docs/modes.md b/serving/docs/modes.md index 5227eeb1d..9a9eb4280 100644 --- a/serving/docs/modes.md +++ b/serving/docs/modes.md @@ -8,141 +8,7 @@ DJL Serving is a high-performance serving system for deep learning models. DJL S 2. [Java Mode](#java-mode) 3. [Binary Mode](#binary-mode) -### serving.properties - -In addition to the mode specific files, the `serving.properties` is a configuration file that can be used in all modes. -Place `serving.properties` in the same directory with your model file to specify configuration for each model. - -In `serving.properties`, you can set options (prefixed with `option`) and properties. The options -will be passed to `Model.load(Path modelPath, String prefix, Map options)` API. It allows -you set engine specific configurations, for example: - -``` -# set model file name prefix if different from folder name -option.modeName=resnet18_v1 - -# PyTorch options -option.mapLocation=true -option.extraFiles=foo.txt,bar.txt - -# ONNXRuntime options -option.interOpNumThreads=2 -option.intraOpNumThreads=2 -option.executionMode=SEQUENTIAL -option.optLevel=BASIC_OPT -option.memoryPatternOptimization=true -option.cpuArenaAllocator=true -option.disablePerSessionThreads=true -option.customOpLibrary=myops.so -option.disablePerSessionThreads=true -option.ortDevice=TensorRT/ROCM/CoreML - -# Python model options -option.pythonExecutable=python3 -option.entryPoint=deepspeed.py -option.handler=hanlde -option.predict_timeout=120 -option.model_loading_timeout=10 -option.parallel_loading=true -option.tensor_parallel_degree=2 -option.enable_venv=true -option.rolling_batch=auto -#option.rolling_batch=lmi-dist -option.max_rolling_batch_size=64 -option.paged_attention=false -option.max_rolling_batch_prefill_tokens=1088 -``` - -In `serving.properties`, you can set the following properties. Model properties are accessible to `Translator` -and python handler functions. - -- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc. -- `load_on_devices`: A ; delimited devices list, which the model to be loaded on, default to load on all devices. -- `translatorFactory`: Specify the TranslatorFactory. -- `job_queue_size`: Specify the job queue size at model level, this will override global `job_queue_size`, default is `1000`. -- `batch_size`: the dynamic batch size, default is `1`. -- `max_batch_delay` - the maximum delay for batch aggregation in millis, default value is `100` milliseconds. -- `max_idle_time` - the maximum idle time in seconds before the worker thread is scaled down, default is `60` seconds. -- `log_model_metric`: Enable model metrics (inference, pre-process and post-process latency) logging. -- `metrics_aggregation`: Number of model metrics to aggregate, default is `1000`. -- `minWorkers`: Minimum number of workers, default is `1`. -- `maxWorkers`: Maximum number of workers, default is `#CPU/OMP_NUM_THREAD` for CPU, GPU default is `2`, inferentia default is `2` (PyTorch engine), `1` (Python engine) . -- `gpu.minWorkers`: Minimum number of workers for GPU. -- `gpu.maxWorkers`: Maximum number of workers for GPU. -- `cpu.minWorkers`: Minimum number of workers for CPU. -- `cpu.maxWorkers`: Maximum number of workers for CPU. -- `required_memory_mb`: Specify the required memory (CPU and GPU) in MB to load the model. -- `gpu.required_memory_mb`: Specify the required GPU memory in MB to load the model. -- `reserved_memory_mb`: Reserve memory in MB to avoid system out of memory. -- `gpu.reserved_memory_mb`: Reserve GPU memory in MB to avoid system out of memory. - - -#### number of workers -For Python engine, we recommend set `minWorkers` and `maxWorkers` to be the same since python -worker scale up and down is expensive. - -You may also need to consider `OMP_NUM_THREAD` when setting number workers. `OMP_NUM_THREAD` is default -to `1`, you can unset `OMP_NUM_THREAD` by setting `NO_OMP_NUM_THREADS=true`. If `OMP_NUM_THREAD` is unset, -the `maxWorkers` will be default to 2 (larger `maxWorkers` with non 1 `OMP_NUM_THREAD` can cause thread -contention, and reduce throughput). - -Set minimum workers and maximum workers for your model: - -``` -minWorkers=32 -maxWorkers=64 -# idle time in seconds before the worker thread is scaled down -max_idle_time=120 -``` - -Or set minimum workers and maximum workers differently for GPU and CPU: - -``` -gpu.minWorkers=2 -gpu.maxWorkers=3 -cpu.minWorkers=2 -cpu.maxWorkers=4 -``` - -**Note**: Loading model in Python mode is pretty heavy. We recommend to set `minWorker` and `maxWorker` to be the same value to avoid unnecessary load and unload. - - -#### job queue size -Or override global `job_queue_size`: - -``` -job_queue_size=10 -``` - -#### dynamic batching -To enable dynamic batching: - -``` -batch_size=2 -max_batch_delay=1 -``` - -#### rolling batch -To enable rolling batch for Python engine: - -``` -# lmi-dist and vllm requires running mpi mode -engine=MPI -option.rolling_batch=auto -# use FlashAttention -#option.rolling_batch=lmi-dist -#option.rolling_batch=scheduler -option.max_rolling_batch_size=64 - -# increase max_rolling_batch_prefill_tokens for long sequence -option.max_rolling_batch_prefill_tokens=1088 - -# disable PagedAttention if run into OOM -option.paged_attention=false -``` - - -An example `serving.properties` can be found [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties). +Also see the options for [model configurations](configurations_model.md). ## Python Mode