Skip to content

Commit

Permalink
Update engines and configuration options
Browse files Browse the repository at this point in the history
  • Loading branch information
frankfliu committed Jun 5, 2024
1 parent 6f93207 commit b197688
Showing 1 changed file with 19 additions and 14 deletions.
33 changes: 19 additions & 14 deletions serving/docs/lmi/user_guides/embedding-user-guide.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
# LMI Text Embedding User Guide

Text Embedding refers to the process of converting text data into numerical vectors. These embeddings capture the semantic meaning of the text and can be used for various tasks such as semantic search and similarity detection.
Text Embedding refers to the process of converting text data into numerical vectors.
These embeddings capture the semantic meaning of the text and can be used for various
tasks such as semantic search and similarity detection.

The inference process involves:

1. **Loading a Model**: Loading a model from local directory, S3 or from huggingface repository,.
1. **Loading a Model**: Loading a model from local directory, S3, DJL model zoo or from huggingface repository.
2. **Tokenization**: Breaking down the input text into tokens that the model can understand.
3. **Embeddings**: Passing the tokens through the model to produce embeddings. Embedding is a multi-dimension vector that could be used for RAG or general embedding search.
3. **Embeddings**: Passing the tokens through the model to produce embeddings. Embedding is a
multi-dimension vector that could be used for RAG or general embedding search.

LMI supports Text Embedding Inference with the following engines:

- OnnxRuntime
- PyTorch
- Rust
- TensorRT
- Python

Currently, the OnnxRuntime engine provides the best performance for text embedding in LMI.
Expand All @@ -25,7 +28,7 @@ The following text models are supported:
- XLMRoberta (`intfloat/multilingual-e5-base`, `intfloat/multilingual-e5-small`, `intfloat/multilingual-e5-large`, etc.)
- NomicBert (`nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v1.5`, etc.)
- JinaBert (`jinaai/jina-embeddings-v2-base-en`, `jinaai/jina-embeddings-v2-small-en`, etc.)
- Reranker (`BAAI/bge-reranker-base`, `BAAI/bge-reranker-large`, etc.)
- XLMRobertaForSequenceClassification (`BAAI/bge-reranker-base`, `BAAI/bge-reranker-large`, etc.)

Other embedding models are also supported, but may not be as performant as the model architectures listed above.

Expand All @@ -35,18 +38,18 @@ You can leverage LMI Text Embedding inference using the following starter config

### DJL model zoo

You can specify the djl:// model url to load a model from the DJL model zoo.
You can specify the `djl://` model url to load a model from the DJL model zoo.

```
OPTION_ENGINE=OnnxRuntime
HF_MODEL_ID=djl://ai.djl.huggingface.onnxruntime/BAAI/bge-base-en-v1.5
# Optional
OPTION_BATCH_SIZE=32
```

### environment variables

You can specify the `HF_MODEL_ID` environment variable to load a model from Hugging Face hub.
You can specify the `HF_MODEL_ID` environment variable to load a model from HuggingFace hub. DJLServing
will download the model from HuggingFace hub and optimize the model with OnnxRuntime at runtime.

```
OPTION_ENGINE=OnnxRuntime
Expand All @@ -55,7 +58,8 @@ HF_MODEL_ID=TaylorAI/bge-micro-v2
OPTION_BATCH_SIZE=32
```

You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---environment-variables) to deploy a model with environment variable configuration on SageMaker.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables)
to deploy a model with environment variable configuration on SageMaker.

### serving.properties

Expand All @@ -67,7 +71,8 @@ translatorFactory=ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory
batch_size=32
```

You can follow [this example](../deployment_guide/deploying-your-endpoint.md#configuration---servingproperties) to deploy a model with serving.properties configuration on SageMaker.
You can follow [this example](../deployment_guide/deploying-your-endpoint.md#option-1-configuration---servingproperties)
to deploy a model with serving.properties configuration on SageMaker.

## Deploy model to SageMaker

Expand Down Expand Up @@ -150,7 +155,7 @@ This is an optional config, and defaults to `False`.

**OPTION_ENGINE**

This option represents the Engine to use, values include `OnnxRuntime`, `TensorRT`, `Rust`, etc.
This option represents the Engine to use, values include `OnnxRuntime`, `PyTorch`, `Rust`, etc.

**OPTION_BATCH_SIZE**

Expand All @@ -168,10 +173,10 @@ This is an optional config, and defaults to `1`.

This option represents the maximum number of workers.

This is an optional config, and default is `#CPU/OMP_NUM_THREAD` for CPU, GPU default is `2`.
This is an optional config, and default is `#CPU` for CPU, GPU default is `2`.

For Text Embedding task, benchmarking result shows `SERVING_MAX_WORKERS=1` gives better performance than bigger numbers.
This is because the model server could process the maximum number of requests in each batch.
When running Text Embedding task on GPU, benchmarking result shows `SERVING_MAX_WORKERS=1` gives better performance.
We recommend to use same value for `SERVING_MIN_WORKERS` and `SERVING_MAX_WORKERS` on GPU to avoid worker scaling overhead.

### Additional Configurations

Expand Down

0 comments on commit b197688

Please sign in to comment.