diff --git a/serving/docs/configuration.md b/serving/docs/configuration.md
index ed5f0b285..4c3d8e318 100644
--- a/serving/docs/configuration.md
+++ b/serving/docs/configuration.md
@@ -1,162 +1,47 @@
-# DJLServing startup configuration
+# DJL Serving Configuration
 
-## Environment variables
+DJL Serving is a multi-layer system and has many different forms of configuration across those layers.
 
-User can set environment variables to change DJL Serving behavior, following is a list of
-variables that user can set for DJL Serving:
+## Global
 
-* JAVA_HOME
-* JAVA_OPTS
-* SERVING_OPTS
-* MODEL_SERVER_HOME
+At the beginning, there are [global configurations](configurations_global.md).
+These configurations are passed through startup arguments, the config file, and environment variables.
 
-**Note:** environment variable has higher priority that command line or config.properties.
-It will override other property values.
+As part of the startup, you are able to specify several different categories of options:
 
-**Note:** For tunable parameters for Large Language Models please refer to [this](configurations_large_model_inference_containers.md) guide.
+- Global Java settings with environment variables like `$JAVA_HOME` and `$JAVA_OPTS`.
+- Loading behavior with the `model_store` and what models to load on startup
+- Network settings such as the port and SSL
 
-## Command line parameters
+## Engine
 
-User can use the following parameters to start djl-serving, those parameters will override default behavior:
+DJL Serving is powered by [DeepJavaLibrary](djl.ai) and most of the functionality exists through the use of [DJL engines](http://docs.djl.ai/docs/engine.html).
+As part of this, many of the engines along with DJL itself can be configured through the use of environment variables and system properties.
 
-```
-djl-serving -h
+The [engine configuration](configurations.md) document lists these configurations.
+These include both the ones global to DJL as well as lists for each engine.
+There are configurations for paths, versions, performance, settings, and debugging.
+All engine configurations are shared between all models and workers using that engine.
 
-usage: djl-serving [OPTIONS]
- -f,--config-file <CONFIG-FILE>    Path to the configuration properties file.
- -h,--help                         Print this help.
- -m,--models <MODELS>              Models to be loaded at startup.
- -s,--model-store <MODELS-STORE>   Model store location where models can be loaded.
-```
+## Workflow
 
-Details about the models, model-store, and workflows can be found in the equivalent configuration properties.
+Next, you are able to add and configure a [Workflow](workflows.md).
+DJL Serving has a custom solution for handling workflows that is configured through a `workflow.json` or `workflow.yml` file.
 
-## config.properties file
+## Model
 
-DJL Serving use a `config.properties` file to store configurations.
+Next, it is possible to specify [model configuration](configurations_model.md).
+This is mostly done by using a `serving.properties` file, although there are environment variables that can be used as well.
 
-### Configure listening port
+These configurations are also optional.
+If no `serving.properties` is provided, some basic properties such as which engine to use will be inferred.
+The rest will back back to the global defaults.
 
-DJL Serving only allows localhost access by default.
+## Application
 
-* inference_address: inference API binding address, default: http://127.0.0.1:8080
-* management_address: management API binding address, default: http://127.0.0.1:8081
+Alongside the configurations that determine how DJL Serving runs the model, there are also options that can be passed into the model itself.
+The primary way is through the [DJL Model](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/Model.html) properties or [DJL Criteria](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/repository/zoo/Criteria.html) arguments.
+These settings are ultimately dependent on the individual model.
+But, here are some documented applications that have additional configurations:
 
-Here are a couple of examples:
-
-```properties
-# bind inference API to all network interfaces with SSL enabled
-inference_address=https://0.0.0.0:8443
-
-# bind inference API to private network interfaces
-inference_address=https://172.16.1.10:8443
-```
-
-### Configure initial models and workflows
-
-**Model Store**
-
-The `model_store` config property can be used to define a directory where each file/folder in it is a model to be loaded.
-It will then attempt to load all of them by default.
-Here is an example:
-
-```properties
-model_store=build/models
-```
-
-**Load Models**
-
-The `load_models` config property can be used to define a list of models (or workflows) to be loaded.
-The list should be defined as a comma separated list of urls to load models from.
-
-Each model can be defined either as a URL directly or optionally with prepended endpoint data like `[EndpointData]=modelUrl`.
-The endpoint is a list of data items separated by commas.
-The possible variations are:
-
-- `[modelName]`
-- `[modelName:version]`
-- `[modelName:version:engine]`
-- `[modelName:version:engine:deviceNames]`
-
-The version can be an arbitrary string.
-The engines uses the standard DJL `Engine` names.
-
-Possible deviceNames strings include `*` for all devices and a `;` separated list of device names following the format defined in DJL `Device.fromName`.
-If no device is specified, it will use the DJL default device (usually GPU if available else CPU).
-
-```properties
-load_models=https://resources.djl.ai/test-models/mlp.tar.gz,[mlp:v1:MXNet:*]=https://resources.djl.ai/test-models/mlp.tar.gz
-```
-
-**Workflows**
-
-Use the `load_models` config property to define initial workflows that should be loaded on startup.
-
-```properties
-load_models=https://resources.djl.ai/test-models/basic-serving-workflow.json
-```
-
-View the [workflow documentation](workflows.md) to see more information about workflows and their configuration format.
-
-### Enable SSL
-
-For users who want to enable HTTPs, you can change `inference_address` or `management_addrss`
-protocol from http to https, for example: `inference_addrss=https://127.0.0.1`.
-This will make DJL Serving listen on localhost 443 port to accepting https request.
-
-User also must provide certificate and private keys to enable SSL. DJL Serving support two ways to configure SSL:
-
-1. Use keystore
-    * keystore: Keystore file location, if multiple private key entry in the keystore, first one will be picked.
-    * keystore_pass: keystore password, key password (if applicable) MUST be the same as keystore password.
-    * keystore_type: type of keystore, default: PKCS12
-
-2. Use private-key/certificate files
-    * private_key_file: private key file location, support both PKCS8 and OpenSSL private key.
-    * certificate_file: X509 certificate chain file location.
-
-#### Self-signed certificate example
-
-This is a quick example to enable SSL with self-signed certificate
-
-##### User java keytool to create keystore
-
-```bash
-keytool -genkey -keyalg RSA -alias djl -keystore keystore.p12 -storepass changeit -storetype PKCS12 -validity 3600 -keysize 2048 -dname "CN=www.MY_DOMSON.com, OU=Cloud Service, O=model server, L=Palo Alto, ST=California, C=US"
-```
-
-   Config following property in config.properties:
-
-```properties
-inference_address=https://127.0.0.1:8443
-management_address=https://127.0.0.1:8444
-keystore=keystore.p12
-keystore_pass=changeit
-keystore_type=PKCS12
-```
-
-##### User OpenSSL to create private key and certificate
-
-```bash
-# generate a private key with the correct length
-openssl genrsa -out private-key.pem 2048
-
-# generate corresponding public key
-openssl rsa -in private-key.pem -pubout -out public-key.pem
-
-# create a self-signed certificate
-openssl req -new -x509 -key private-key.pem -out cert.pem -days 360
-
-# convert pem to pfx/p12 keystore
-openssl pkcs12 -export -inkey private-key.pem -in cert.pem -out keystore.p12
-```
-
-   Config following property in config.properties:
-
-```properties
-inference_address=https://127.0.0.1:8443
-management_address=https://127.0.0.1:8444
-keystore=keystore.p12
-keystore_pass=changeit
-keystore_type=PKCS12
-```
+- [Large Language Model Configurations](configurations_large_model_inference_containers.md)
diff --git a/serving/docs/configurations.md b/serving/docs/configurations.md
index c044d1669..e464df085 100644
--- a/serving/docs/configurations.md
+++ b/serving/docs/configurations.md
@@ -1,8 +1,6 @@
-# All DJL configuration options
+# Engine Configuration
 
-DJL serving is highly configurable. This document tries to capture those configurations in a single document.
-
-**Note:** For tunable parameters for Large Language Models please refer to [this](configurations_large_model_inference_containers.md) guide.
+This covers the available configurations for DJL and engines.
 
 ## DJL settings
 
@@ -83,134 +81,6 @@ DJLServing build on top of Deep Java Library (DJL). Here is a list of settings f
 | ai.djl.python.disable_alternative | system prop         | Disable alternative engine                                                                                                                             |
 | TENSOR_PARALLEL_DEGREE            | env var             | Set tensor parallel degree.<br>For mpi mode, the default is number of accelerators.<br>Use "max" for non-mpi mode to use all GPUs for tensor parallel. |
 
-DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.
-
-- `engine=DeepSpeed`, equivalent to:
-
-```
-engine=Python
-option.mpi_mode=true
-option.entryPoint=djl_python.deepspeed
-```
-
-- `engine=FasterTransformer`, this is equivalent to:
-
-```
-engine=Python
-option.mpi_mode=true
-option.entryPoint=djl_python.fastertransformer
-```
-
-- `engine=MPI`, this is equivalent to:
-
-```
-engine=Python
-option.mpi_mode=true
-option.entryPoint=djl_python.huggingface
-```
-
-## Global Model Server settings
-
-Global settings are configured at model server level. Change to these settings usually requires
-restart model server to take effect.
-
-Most of the model server specific configuration can be configured in `conf/config.properties` file.
-You can find the configuration keys here:
-[ConfigManager.java](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/main/java/ai/djl/serving/util/ConfigManager.java#L52-L79)
-
-Each configuration key can also be override by environment variable with `SERVING_` prefix, for example:
-
-```
-export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config
-```
-
-| Key               | Type    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/)                                                                                                                                                                                                                                                                                                                                                   |
-| DEFAULT_JVM_OPTS  | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`<br>Override default JVM startup options and system properties.                                                                                                                                                                                                                                                                                                                           |
-| JAVA_OPTS         | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`<br>Add extra JVM options.                                                                                                                                                                                                                                                                                                                                                                            |
-| SERVING_OPTS      | env var | default: N/A<br>Add serving related JVM options.<br>Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.<br>- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch<br>- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch<br>- `-Dai.djl.logging.level=debug` change DJL loggging level |
-
-## Model specific settings
-
-You set per model settings by adding a [serving.properties](modes.md#servingproperties) file in the root of your model directory (or .zip).
-Some of the options can be override by environment variable with `OPTION_` prefix, for example:
-
-```
-# to enable rolling batch with only environment variable:
-export OPTION_ROLLING_BATCH=auto
-```
-
-You can set number of workers for each model:
-https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8
-
-For example, set minimum workers and maximum workers for your model:
-
-```
-minWorkers=32
-maxWorkers=64
-```
-
-Or you can configure minimum workers and maximum workers differently for GPU and CPU:
-
-```
-gpu.minWorkers=2
-gpu.maxWorkers=3
-cpu.minWorkers=2
-cpu.maxWorkers=4
-```
-
-job queue size, batch size, max batch delay, max worker idle time can be configured at
-per model level, this will override global settings:
-
-```
-job_queue_size=10
-batch_size=2
-max_batch_delay=1
-max_idle_time=120
-```
-
-You can configure which device to load the model on, default is *:
-
-```
-load_on_devices=gpu4;gpu5
-# or simply:
-load_on_devices=4;5
-```
-
-### Python (DeepSpeed)
-
-For Python (DeepSpeed) engine, DJL load multiple workers sequentially by default to avoid run
-out of memory. You can reduced model loading time by parallel loading workers if you know the
-peak memory won’t cause out of memory:
-
-```
-# Allows to load DeepSpeed workers in parallel
-option.parallel_loading=true
-# specify tensor parallel degree (number of partitions)
-option.tensor_parallel_degree=2
-# specify per model timeout
-option.model_loading_timeout=600
-option.predict_timeout=240
-# mark the model as failure after python process crashing 10 times
-retry_threshold=0
-
-# enable virtual environment
-option.enable_venv=true
-
-# use built-in DeepSpeed handler
-option.entryPoint=djl_python.deepspeed
-# passing extra options to model.py or built-in handler
-option.model_id=gpt2
-option.data_type=fp32
-option.max_new_tokens=50
-
-# defines custom environment variables
-env=LARGE_TENSOR=1
-# specify the path to the python executable
-option.pythonExecutable=/usr/bin/python3
-```
-
 ## Engine specific settings
 
 DJL support 12 deep learning frameworks, each framework has their own settings. Please refer to
@@ -229,52 +99,3 @@ The follow table show some engine specific environment variables that is overrid
 | TF_CPP_MIN_LOG_LEVEL	  | TensorFlow | default 1                                           |
 | MXNET_ENGINE_TYPE      | MXNet      | this value must be `NaiveEngine`                    |
 
-## Appendix
-
-### How to configure logging
-
-#### Option 1: enable debug log:
-
-```
-export SERVING_OPTS="-Dai.djl.logging.level=debug"
-```
-
-#### Option 2: use your log4j2.xml
-
-```
-export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml
-```
-
-DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containers.
-Use the following environment variable to print HTTP access log to console:
-
-```
-export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml
-```
-
-Use the following environment variable to print both access log, server metrics and model metrics to console:
-
-```
-export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml
-```
-
-### How to download uncompressed model from S3
-To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and
-only keep the model code and metadata in the `model.tar.gz` (.zip) file. DJL can leverage
-[s5cmd](https://github.com/peak/s5cmd) to download uncompressed files from S3 with extremely fast
-speed.
-
-To enable `s5cmd` downloading, you can configure `serving.properties` as the following:
-
-```
-option.model_id=s3://YOUR_BUCKET/...
-```
-
-### How to resolve python package conflict between models
-If you want to deploy multiple python models, but their dependencies has conflict, you can enable
-[python virtual environments](https://docs.python.org/3/tutorial/venv.html) for your model:
-
-```
-option.enable_venv=true
-```
-
diff --git a/serving/docs/configurations_global.md b/serving/docs/configurations_global.md
new file mode 100644
index 000000000..cf62bec7c
--- /dev/null
+++ b/serving/docs/configurations_global.md
@@ -0,0 +1,215 @@
+# Global Configuration
+
+This covers configurations that are used globally and as part of startup for DJL Serving.
+
+## Command line parameters
+
+User can use the following parameters to start djl-serving, those parameters will override default behavior:
+
+```
+djl-serving -h
+
+usage: djl-serving [OPTIONS]
+ -f,--config-file <CONFIG-FILE>    Path to the configuration properties file.
+ -h,--help                         Print this help.
+ -m,--models <MODELS>              Models to be loaded at startup.
+ -s,--model-store <MODELS-STORE>   Model store location where models can be loaded.
+```
+
+Details about the models, model-store, and workflows can be found in the equivalent configuration properties.
+
+## config.properties file
+
+DJL Serving use a `config.properties` file to store configurations.
+
+### Configure listening port
+
+DJL Serving only allows localhost access by default.
+
+* inference_address: inference API binding address, default: http://127.0.0.1:8080
+* management_address: management API binding address, default: http://127.0.0.1:8081
+
+Here are a couple of examples:
+
+```properties
+# bind inference API to all network interfaces with SSL enabled
+inference_address=https://0.0.0.0:8443
+
+# bind inference API to private network interfaces
+inference_address=https://172.16.1.10:8443
+```
+
+### Configure initial models and workflows
+
+**Model Store**
+
+The `model_store` config property can be used to define a directory where each file/folder in it is a model to be loaded.
+It will then attempt to load all of them by default.
+Here is an example:
+
+```properties
+model_store=build/models
+```
+
+**Load Models**
+
+The `load_models` config property can be used to define a list of models (or workflows) to be loaded.
+The list should be defined as a comma separated list of urls to load models from.
+
+Each model can be defined either as a URL directly or optionally with prepended endpoint data like `[EndpointData]=modelUrl`.
+The endpoint is a list of data items separated by commas.
+The possible variations are:
+
+- `[modelName]`
+- `[modelName:version]`
+- `[modelName:version:engine]`
+- `[modelName:version:engine:deviceNames]`
+
+The version can be an arbitrary string.
+The engines uses the standard DJL `Engine` names.
+
+Possible deviceNames strings include `*` for all devices and a `;` separated list of device names following the format defined in DJL `Device.fromName`.
+If no device is specified, it will use the DJL default device (usually GPU if available else CPU).
+
+```properties
+load_models=https://resources.djl.ai/test-models/mlp.tar.gz,[mlp:v1:MXNet:*]=https://resources.djl.ai/test-models/mlp.tar.gz
+```
+
+**Workflows**
+
+Use the `load_models` config property to define initial workflows that should be loaded on startup.
+
+```properties
+load_models=https://resources.djl.ai/test-models/basic-serving-workflow.json
+```
+
+View the [workflow documentation](workflows.md) to see more information about workflows and their configuration format.
+
+### Enable SSL
+
+For users who want to enable HTTPs, you can change `inference_address` or `management_addrss`
+protocol from http to https, for example: `inference_addrss=https://127.0.0.1`.
+This will make DJL Serving listen on localhost 443 port to accepting https request.
+
+User also must provide certificate and private keys to enable SSL. DJL Serving support two ways to configure SSL:
+
+1. Use keystore
+    * keystore: Keystore file location, if multiple private key entry in the keystore, first one will be picked.
+    * keystore_pass: keystore password, key password (if applicable) MUST be the same as keystore password.
+    * keystore_type: type of keystore, default: PKCS12
+
+2. Use private-key/certificate files
+    * private_key_file: private key file location, support both PKCS8 and OpenSSL private key.
+    * certificate_file: X509 certificate chain file location.
+
+#### Self-signed certificate example
+
+This is a quick example to enable SSL with self-signed certificate
+
+##### User java keytool to create keystore
+
+```bash
+keytool -genkey -keyalg RSA -alias djl -keystore keystore.p12 -storepass changeit -storetype PKCS12 -validity 3600 -keysize 2048 -dname "CN=www.MY_DOMSON.com, OU=Cloud Service, O=model server, L=Palo Alto, ST=California, C=US"
+```
+
+Config following property in config.properties:
+
+```properties
+inference_address=https://127.0.0.1:8443
+management_address=https://127.0.0.1:8444
+keystore=keystore.p12
+keystore_pass=changeit
+keystore_type=PKCS12
+```
+
+##### User OpenSSL to create private key and certificate
+
+```bash
+# generate a private key with the correct length
+openssl genrsa -out private-key.pem 2048
+
+# generate corresponding public key
+openssl rsa -in private-key.pem -pubout -out public-key.pem
+
+# create a self-signed certificate
+openssl req -new -x509 -key private-key.pem -out cert.pem -days 360
+
+# convert pem to pfx/p12 keystore
+openssl pkcs12 -export -inkey private-key.pem -in cert.pem -out keystore.p12
+```
+
+Config following property in config.properties:
+
+```properties
+inference_address=https://127.0.0.1:8443
+management_address=https://127.0.0.1:8444
+keystore=keystore.p12
+keystore_pass=changeit
+keystore_type=PKCS12
+```
+
+## Environment variables
+
+User can set environment variables to change DJL Serving behavior, following is a list of
+variables that user can be set for DJL Serving:
+
+* JAVA_HOME
+* JAVA_OPTS
+* SERVING_OPTS
+* MODEL_SERVER_HOME
+
+**Note:** environment variable has higher priority that command line or config.properties.
+It will override other property values.
+
+### Global Model Server settings
+
+Global settings are configured at model server level. Change to these settings usually requires
+restart model server to take effect.
+
+Most of the model server specific configuration can be configured in `conf/config.properties` file.
+You can find the configuration keys here:
+[ConfigManager.java](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/main/java/ai/djl/serving/util/ConfigManager.java#L52-L79)
+
+Each configuration key can also be override by environment variable with `SERVING_` prefix, for example:
+
+```
+export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config
+```
+
+| Key               | Type    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/)                                                                                                                                                                                                                                                                                                                                                   |
+| DEFAULT_JVM_OPTS  | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`<br>Override default JVM startup options and system properties.                                                                                                                                                                                                                                                                                                                           |
+| JAVA_OPTS         | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`<br>Add extra JVM options.                                                                                                                                                                                                                                                                                                                                                                            |
+| SERVING_OPTS      | env var | default: N/A<br>Add serving related JVM options.<br>Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.<br>- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch<br>- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch<br>- `-Dai.djl.logging.level=debug` change DJL loggging level |
+
+
+## Appendix
+
+### How to configure logging
+
+#### Option 1: enable debug log:
+
+```
+export SERVING_OPTS="-Dai.djl.logging.level=debug"
+```
+
+#### Option 2: use your log4j2.xml
+
+```
+export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml
+```
+
+DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containers.
+Use the following environment variable to print HTTP access log to console:
+
+```
+export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml
+```
+
+Use the following environment variable to print both access log, server metrics and model metrics to console:
+
+```
+export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml
+```
+
diff --git a/serving/docs/configurations_large_model_inference_containers.md b/serving/docs/configurations_large_model_inference_containers.md
index 5d5691889..ff34cc760 100644
--- a/serving/docs/configurations_large_model_inference_containers.md
+++ b/serving/docs/configurations_large_model_inference_containers.md
@@ -1,7 +1,7 @@
 # Large Model Inference Containers
 
-DJL serving is highly configurable. This document tries to capture those configurations
-for [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).
+There are a number of shared configurations for python models running large language models.
+They are also available through the [Large Model Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).
 
 ### Common ([doc](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html))
 
@@ -25,6 +25,7 @@ for [Large Model Inference Containers](https://github.com/aws/deep-learning-cont
 | option.return_tuple       | No       | Whether transformer layers need to return a tuple or a tensor.                                                                                                                                  | `false`                        |
 | option.training_mp_size   | No       | If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference. | `2`                            |
 | option.checkpoint         | No       | Path to DeepSpeed compatible checkpoint file.                                                                                                                                                   | `ds_inference_checkpoint.json` |
+| option.parallel_loading   | No       | Loads multiple workers in parallel (faster but risks running out of memory).                                                                                                                    | `ds_inference_checkpoint.json` |
 
 ### FasterTransformer ([doc](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html))
 
@@ -56,3 +57,31 @@ for [Large Model Inference Containers](https://github.com/aws/deep-learning-cont
 |--------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------|-----------------|
 | option.n_positions | No       | Input sequence length                                                                                                                  | Default: `128`  |
 | option.unroll      | No       | Unroll the model graph for compilation. With `unroll=None` compiler will have more opportunities to do optimizations across the layers | Default: `None` |
+
+## Aliases
+
+DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.
+
+- `engine=DeepSpeed`, equivalent to:
+
+```
+engine=Python
+option.mpi_mode=true
+option.entryPoint=djl_python.deepspeed
+```
+
+- `engine=FasterTransformer`, this is equivalent to:
+
+```
+engine=Python
+option.mpi_mode=true
+option.entryPoint=djl_python.fastertransformer
+```
+
+- `engine=MPI`, this is equivalent to:
+
+```
+engine=Python
+option.mpi_mode=true
+option.entryPoint=djl_python.huggingface
+```
diff --git a/serving/docs/configurations_model.md b/serving/docs/configurations_model.md
new file mode 100644
index 000000000..394a98310
--- /dev/null
+++ b/serving/docs/configurations_model.md
@@ -0,0 +1,212 @@
+# Model Configuration
+
+You set per model settings by adding a serving.properties file in the root of your model directory (or .zip).
+These apply for all engines and modes.
+
+An example `serving.properties` can be found [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties).
+
+## Main properties
+
+In `serving.properties`, you can set the following properties. Model properties are accessible to `Translator`
+and python handler functions.
+
+- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc.
+- `load_on_devices`: A ; delimited devices list, which the model to be loaded on, default to load on all devices.
+- `translatorFactory`: Specify the TranslatorFactory.
+- `job_queue_size`: Specify the job queue size at model level, this will override global `job_queue_size`, default is `1000`.
+- `batch_size`: the dynamic batch size, default is `1`.
+- `max_batch_delay` - the maximum delay for batch aggregation in millis, default value is `100` milliseconds.
+- `max_idle_time` - the maximum idle time in seconds before the worker thread is scaled down, default is `60` seconds.
+- `log_model_metric`: Enable model metrics (inference, pre-process and post-process latency) logging.
+- `metrics_aggregation`: Number of model metrics to aggregate, default is `1000`.
+- `minWorkers`: Minimum number of workers, default is `1`.
+- `maxWorkers`: Maximum number of workers, default is `#CPU/OMP_NUM_THREAD` for CPU, GPU default is `2`, inferentia default is `2` (PyTorch engine), `1` (Python engine) .
+- `gpu.minWorkers`: Minimum number of workers for GPU.
+- `gpu.maxWorkers`: Maximum number of workers for GPU.
+- `cpu.minWorkers`: Minimum number of workers for CPU.
+- `cpu.maxWorkers`: Maximum number of workers for CPU.
+- `required_memory_mb`: Specify the required memory (CPU and GPU) in MB to load the model.
+- `gpu.required_memory_mb`: Specify the required GPU memory in MB to load the model.
+- `reserved_memory_mb`: Reserve memory in MB to avoid system out of memory.
+- `gpu.reserved_memory_mb`: Reserve GPU memory in MB to avoid system out of memory.
+
+## Option Properties
+
+In `serving.properties`, you can also set options (prefixed with `option`) and properties.
+The options will be passed to `Model.load(Path modelPath, String prefix, Map<String, ?> options)` API.
+It allows you to set engine specific configurations.
+Here are some of the available option properties:
+
+```
+# set model file name prefix if different from folder name
+option.modeName=resnet18_v1
+
+# PyTorch options
+option.mapLocation=true
+option.extraFiles=foo.txt,bar.txt
+
+# ONNXRuntime options
+option.interOpNumThreads=2
+option.intraOpNumThreads=2
+option.executionMode=SEQUENTIAL
+option.optLevel=BASIC_OPT
+option.memoryPatternOptimization=true
+option.cpuArenaAllocator=true
+option.disablePerSessionThreads=true
+option.customOpLibrary=myops.so
+option.disablePerSessionThreads=true
+option.ortDevice=TensorRT/ROCM/CoreML
+
+# Python model options
+retry_threshold=10 # Mark model as failure after python process crashing 10 times
+option.pythonExecutable=python3
+option.entryPoint=deepspeed.py
+option.handler=hanlde
+option.predict_timeout=120
+option.model_loading_timeout=10
+option.parallel_loading=true
+option.tensor_parallel_degree=2
+option.enable_venv=true
+option.rolling_batch=auto
+#option.rolling_batch=lmi-dist
+option.max_rolling_batch_size=64
+option.paged_attention=false
+option.max_rolling_batch_prefill_tokens=1088
+```
+
+Most of the options can also be overriden by an environment variable with the `OPTION_` prefix and all caps.
+For example:
+
+```
+# to enable rolling batch with only environment variable:
+export OPTION_ROLLING_BATCH=auto
+```
+
+## Basic Model Configurations
+
+You can set number of workers for each model:
+https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8
+
+For example, set minimum workers and maximum workers for your model:
+
+```
+minWorkers=32
+maxWorkers=64
+```
+
+Or you can configure minimum workers and maximum workers differently for GPU and CPU:
+
+```
+gpu.minWorkers=2
+gpu.maxWorkers=3
+cpu.minWorkers=2
+cpu.maxWorkers=4
+```
+
+job queue size, batch size, max batch delay, max worker idle time can be configured at
+per model level, this will override global settings:
+
+```
+job_queue_size=10
+batch_size=2
+max_batch_delay=1
+max_idle_time=120
+```
+
+You can configure which device to load the model on, default is *:
+
+```
+load_on_devices=gpu4;gpu5
+# or simply:
+load_on_devices=4;5
+```
+
+## Python model configuration
+
+#### number of workers
+
+For Python engine, we recommend set `minWorkers` and `maxWorkers` to be the same since python
+worker scale up and down is expensive.
+
+You may also need to consider `OMP_NUM_THREAD` when setting number workers. `OMP_NUM_THREAD` is default
+to `1`, you can unset `OMP_NUM_THREAD` by setting `NO_OMP_NUM_THREADS=true`. If `OMP_NUM_THREAD` is unset,
+the `maxWorkers` will be default to 2 (larger `maxWorkers` with non 1 `OMP_NUM_THREAD` can cause thread
+contention, and reduce throughput).
+
+Set minimum workers and maximum workers for your model:
+
+```
+minWorkers=32
+maxWorkers=64
+# idle time in seconds before the worker thread is scaled down
+max_idle_time=120
+```
+
+Or set minimum workers and maximum workers differently for GPU and CPU:
+
+```
+gpu.minWorkers=2
+gpu.maxWorkers=3
+cpu.minWorkers=2
+cpu.maxWorkers=4
+```
+
+**Note**: Loading model in Python mode is pretty heavy. We recommend to set `minWorker` and `maxWorker` to be the same value to avoid unnecessary load and unload.
+
+
+#### job queue size
+Or override global `job_queue_size`:
+
+```
+job_queue_size=10
+```
+
+#### dynamic batching
+To enable dynamic batching:
+
+```
+batch_size=2
+max_batch_delay=1
+```
+
+#### rolling batch
+To enable rolling batch for Python engine:
+
+```
+# lmi-dist and vllm requires running mpi mode
+engine=MPI
+option.rolling_batch=auto
+# use FlashAttention
+#option.rolling_batch=lmi-dist
+#option.rolling_batch=scheduler
+option.max_rolling_batch_size=64
+
+# increase max_rolling_batch_prefill_tokens for long sequence
+option.max_rolling_batch_prefill_tokens=1088
+
+# disable PagedAttention if run into OOM
+option.paged_attention=false
+```
+
+## Appendix
+
+### How to download uncompressed model from S3
+To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and
+only keep the model code and metadata in the `model.tar.gz` (.zip) file. DJL can leverage
+[s5cmd](https://github.com/peak/s5cmd) to download uncompressed files from S3 with extremely fast
+speed.
+
+To enable `s5cmd` downloading, you can configure `serving.properties` as the following:
+
+```
+option.model_id=s3://YOUR_BUCKET/...
+```
+
+### How to resolve python package conflict between models
+If you want to deploy multiple python models, but their dependencies has conflict, you can enable
+[python virtual environments](https://docs.python.org/3/tutorial/venv.html) for your model:
+
+```
+option.enable_venv=true
+```
+
diff --git a/serving/docs/modes.md b/serving/docs/modes.md
index 5227eeb1d..9a9eb4280 100644
--- a/serving/docs/modes.md
+++ b/serving/docs/modes.md
@@ -8,141 +8,7 @@ DJL Serving is a high-performance serving system for deep learning models. DJL S
 2. [Java Mode](#java-mode)
 3. [Binary Mode](#binary-mode)
 
-### serving.properties
-
-In addition to the mode specific files, the `serving.properties` is a configuration file that can be used in all modes.
-Place `serving.properties` in the same directory with your model file to specify configuration for each model.
-
-In `serving.properties`, you can set options (prefixed with `option`) and properties. The options
-will be passed to `Model.load(Path modelPath, String prefix, Map<String, ?> options)` API. It allows
-you set engine specific configurations, for example:
-
-```
-# set model file name prefix if different from folder name
-option.modeName=resnet18_v1
-
-# PyTorch options
-option.mapLocation=true
-option.extraFiles=foo.txt,bar.txt
-
-# ONNXRuntime options
-option.interOpNumThreads=2
-option.intraOpNumThreads=2
-option.executionMode=SEQUENTIAL
-option.optLevel=BASIC_OPT
-option.memoryPatternOptimization=true
-option.cpuArenaAllocator=true
-option.disablePerSessionThreads=true
-option.customOpLibrary=myops.so
-option.disablePerSessionThreads=true
-option.ortDevice=TensorRT/ROCM/CoreML
-
-# Python model options
-option.pythonExecutable=python3
-option.entryPoint=deepspeed.py
-option.handler=hanlde
-option.predict_timeout=120
-option.model_loading_timeout=10
-option.parallel_loading=true
-option.tensor_parallel_degree=2
-option.enable_venv=true
-option.rolling_batch=auto
-#option.rolling_batch=lmi-dist
-option.max_rolling_batch_size=64
-option.paged_attention=false
-option.max_rolling_batch_prefill_tokens=1088
-```
-
-In `serving.properties`, you can set the following properties. Model properties are accessible to `Translator`
-and python handler functions.
-
-- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, PaddlePaddle, DeepSpeed, etc.
-- `load_on_devices`: A ; delimited devices list, which the model to be loaded on, default to load on all devices.
-- `translatorFactory`: Specify the TranslatorFactory.
-- `job_queue_size`: Specify the job queue size at model level, this will override global `job_queue_size`, default is `1000`.
-- `batch_size`: the dynamic batch size, default is `1`.
-- `max_batch_delay` - the maximum delay for batch aggregation in millis, default value is `100` milliseconds.
-- `max_idle_time` - the maximum idle time in seconds before the worker thread is scaled down, default is `60` seconds.
-- `log_model_metric`: Enable model metrics (inference, pre-process and post-process latency) logging.
-- `metrics_aggregation`: Number of model metrics to aggregate, default is `1000`.
-- `minWorkers`: Minimum number of workers, default is `1`.
-- `maxWorkers`: Maximum number of workers, default is `#CPU/OMP_NUM_THREAD` for CPU, GPU default is `2`, inferentia default is `2` (PyTorch engine), `1` (Python engine) .
-- `gpu.minWorkers`: Minimum number of workers for GPU.
-- `gpu.maxWorkers`: Maximum number of workers for GPU.
-- `cpu.minWorkers`: Minimum number of workers for CPU.
-- `cpu.maxWorkers`: Maximum number of workers for CPU.
-- `required_memory_mb`: Specify the required memory (CPU and GPU) in MB to load the model.
-- `gpu.required_memory_mb`: Specify the required GPU memory in MB to load the model.
-- `reserved_memory_mb`: Reserve memory in MB to avoid system out of memory.
-- `gpu.reserved_memory_mb`: Reserve GPU memory in MB to avoid system out of memory.
-
-
-#### number of workers
-For Python engine, we recommend set `minWorkers` and `maxWorkers` to be the same since python
-worker scale up and down is expensive.
-
-You may also need to consider `OMP_NUM_THREAD` when setting number workers. `OMP_NUM_THREAD` is default
-to `1`, you can unset `OMP_NUM_THREAD` by setting `NO_OMP_NUM_THREADS=true`. If `OMP_NUM_THREAD` is unset,
-the `maxWorkers` will be default to 2 (larger `maxWorkers` with non 1 `OMP_NUM_THREAD` can cause thread
-contention, and reduce throughput).
-
-Set minimum workers and maximum workers for your model:
-
-```
-minWorkers=32
-maxWorkers=64
-# idle time in seconds before the worker thread is scaled down
-max_idle_time=120
-```
-
-Or set minimum workers and maximum workers differently for GPU and CPU:
-
-```
-gpu.minWorkers=2
-gpu.maxWorkers=3
-cpu.minWorkers=2
-cpu.maxWorkers=4
-```
-
-**Note**: Loading model in Python mode is pretty heavy. We recommend to set `minWorker` and `maxWorker` to be the same value to avoid unnecessary load and unload.
-
-
-#### job queue size
-Or override global `job_queue_size`:
-
-```
-job_queue_size=10
-```
-
-#### dynamic batching
-To enable dynamic batching:
-
-```
-batch_size=2
-max_batch_delay=1
-```
-
-#### rolling batch
-To enable rolling batch for Python engine:
-
-```
-# lmi-dist and vllm requires running mpi mode
-engine=MPI
-option.rolling_batch=auto
-# use FlashAttention
-#option.rolling_batch=lmi-dist
-#option.rolling_batch=scheduler
-option.max_rolling_batch_size=64
-
-# increase max_rolling_batch_prefill_tokens for long sequence
-option.max_rolling_batch_prefill_tokens=1088
-
-# disable PagedAttention if run into OOM
-option.paged_attention=false
-```
-
-
-An example `serving.properties` can be found [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties).
+Also see the options for [model configurations](configurations_model.md).
 
 ## Python Mode