[Model] Add support for 'gte-Qwen2' embedding models #6282

0xWelt · 2024-07-10T03:52:46Z

This should work for Alibaba-NLP/gte-Qwen2-7B-instruct and Alibaba-NLP/gte-Qwen2-1.5B-instruct

You can serve OpenAI compatible API with:

python -m vllm.entrypoints.openai.api_server \
  --served-model-name gte-Qwen2-7B-instruct \
  --model Alibaba-NLP/gte-Qwen2-7B-instruct \
  --dtype bfloat16 \
  --trust-remote-code

However, the current version has a consistency issue of embeddings, which means it can not pass the following test. It should be fixed before merging.

pytest tests/models/test_embedding.py

# FAILED tests/models/test_embedding.py::test_models[half-Alibaba-NLP/gte-Qwen2-7B-instruct] - AssertionError: Not all values are within 0.01 of 1.0

mgoin · 2024-07-10T20:04:53Z

vllm/config.py

+        # FIXME: Special handling for gte-Qwen2
+        if "gte-Qwen2" in self.model:
+            architectures = ["Qwen2EmbeddingModel"]


This hardcoded case based on the model id/path used is not acceptable. For instance, this wouldn't work in the case where a user has downloaded the model locally and passed in a path like --model ~/my-model/

The gte-Qwen2 embedding model's architecture is "Qwen2ForCausalLM", which is the same as Qwen2 LLMs. Is there any better solution to eliminate this ambiguity?

Perhaps we can add an option in argparser to specify whether it is an embedding model, rather than searching through the model architecture.

How about working with the upstream to change or add an extra "Qwen2EmbeddingModel" in the "architectures" list?

#9424 should be able to solve this.

zifeitong · 2024-07-11T16:42:00Z

You can serve OpenAI compatible API with:

python -m vllm.entrypoints.openai.api_server \
  --served-model-name gte-Qwen2-7B-instruct \
  --model Alibaba-NLP/gte-Qwen2-7B-instruct \
  --dtype bfloat16 \
  --trust-remote-code

Is trust-remote-code required?

0xWelt · 2024-07-12T03:01:51Z

@zifeitong I set it according to the official example. You can further check whether it is necessary.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-7B-instruct", trust_remote_code=True)

ghost · 2024-07-16T04:24:24Z

any chance this will land in the next version?

0xWelt · 2024-07-16T04:54:28Z

any chance this will land in the next version?

The below issues need to be resolved. I am not very familiar with vllm and can only provide limited assistance. You can invite someone who is familiar with the field to help.

Ensure that vllm correctly recognizes embedding models, especially when there are LLMs with the same architecture (e.g. Qwen2ForCausalLM).
Pass pytest tests/models/test_embedding.py

ybbz · 2024-08-01T02:48:52Z

Can this feature be merged?

waters222 · 2024-08-01T21:37:56Z

Can this feature be merged?

I dont think anyone is actively working on this. current state output wrong embedding value so no

Opdoop · 2024-08-02T07:29:28Z

The gte-Qwen2 model extends its attention to bi-directional. I think we need to pass causal=False to attn_func like what they did when evaluating the model.

https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/scripts/eval_mteb.py#L553

0xWelt · 2024-08-05T08:03:41Z

The gte-Qwen2 model extends its attention to bi-directional. I think we need to pass causal=False to attn_func like what they did when evaluating the model.

https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/scripts/eval_mteb.py#L553

It may cause flash_attn NotImplementedError.

Opdoop · 2024-08-05T09:22:34Z

@Nickydusk In a word, to enable bi-directional attention of gte-qwen2, we need to pass causal=False from vllm to flash_attention. And currently, there is no name_arg for causal, all is hardcoded as causal=True for the decoding-only model.

vllm caulse=True is hard coded in here:

vllm/vllm/attention/backends/flash_attn.py

Line 528 in 82a1b1a

causal=True,

flash attention has causal=False as default though:
https://github.com/Dao-AILab/flash-attention/blob/3f6ff1c1c52fa3d148b502e465ffb7bc88f7a50e/hopper/flash_attn_interface.py#L259

zhaochenyang20 · 2024-08-20T12:04:46Z

@Nickydusk Hey Nick. I think that you should not use Qwen2ForCausalLM in the embedding model. Since:

In [2]: model
Out[2]: 
SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: Qwen2Model 
  (1): Pooling({'word_embedding_dimension': 3584, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

The embedding model should be Qwen2Model.

zhaochenyang20 · 2024-08-23T00:02:33Z

Sorry for the previous wrong statement. You could check my PR on how I am supporting the gte-qwen2 in SGLang:

sgl-project/sglang#1186

trillionmonster · 2024-09-04T06:28:01Z

how to run with gguf ？
gte-qwen2-7b-instruct-q5_k_m.gguf

docker run --gpus '"device=3"'
-v /data/gte-Qwen2-7B-instruct-Q5_K_M-GGUF:/model
--name gte-Qwen2-7B-instruct-Q5_K_M-GGUF
-d
-p 18222:8000
--shm-size=16g
vllm/vllm-openai:latest
--model /model/gte-qwen2-7b-instruct-q5_k_m.gguf
--gpu-memory-utilization 0.1
--tensor-parallel-size 1
--max-model-len 6000

0xWelt added 2 commits July 10, 2024 02:59

support gte-Qwen2; but embedding not aligned

b22cac0

update test_embedding

8b22d6e

0xWelt mentioned this pull request Jul 10, 2024

[Model] Add support for Qwen2 for embeddings #5611

Closed

0xWelt changed the title ~~Model] Add support for 'gte-Qwen2' embedding models~~ [Model] Add support for 'gte-Qwen2' embedding models Jul 10, 2024

format new file

f159d71

0xWelt force-pushed the main branch from db71887 to f159d71 Compare July 10, 2024 06:54

mgoin reviewed Jul 10, 2024

View reviewed changes

ghost mentioned this pull request Jul 21, 2024

deploying embedding model in same way as LLM #6498

Closed

Merge branch 'vllm-project:main' into main

4bb0783

noooop mentioned this pull request Sep 25, 2024

[RFC]: Support encode only models by Workflow Defined Engine #8453

Closed

1 task

This was referenced Oct 12, 2024

[RFC]: Let every model be a reward model/embedding model for PRMs #9314

Closed

[Model] Add user-configurable task for models that support both generation and embedding #9424

Merged

0xWelt closed this Oct 23, 2024

DarkLight1337 mentioned this pull request Nov 9, 2024

[Model] Support Qwen2 embeddings and use tags to select model tests #10184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add support for 'gte-Qwen2' embedding models #6282

[Model] Add support for 'gte-Qwen2' embedding models #6282

0xWelt commented Jul 10, 2024 •

edited

Loading

mgoin Jul 10, 2024

0xWelt Jul 11, 2024 •

edited

Loading

zifeitong Jul 11, 2024

DarkLight1337 Oct 16, 2024

zifeitong commented Jul 11, 2024 •

edited

Loading

0xWelt commented Jul 12, 2024 •

edited

Loading

ghost commented Jul 16, 2024

0xWelt commented Jul 16, 2024

ybbz commented Aug 1, 2024

waters222 commented Aug 1, 2024

Opdoop commented Aug 2, 2024

0xWelt commented Aug 5, 2024

Opdoop commented Aug 5, 2024 •

edited

Loading

zhaochenyang20 commented Aug 20, 2024

zhaochenyang20 commented Aug 23, 2024

trillionmonster commented Sep 4, 2024 •

edited

Loading

[Model] Add support for 'gte-Qwen2' embedding models #6282

[Model] Add support for 'gte-Qwen2' embedding models #6282

Conversation

0xWelt commented Jul 10, 2024 • edited Loading

mgoin Jul 10, 2024

Choose a reason for hiding this comment

0xWelt Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

zifeitong Jul 11, 2024

Choose a reason for hiding this comment

DarkLight1337 Oct 16, 2024

Choose a reason for hiding this comment

zifeitong commented Jul 11, 2024 • edited Loading

0xWelt commented Jul 12, 2024 • edited Loading

ghost commented Jul 16, 2024

0xWelt commented Jul 16, 2024

ybbz commented Aug 1, 2024

waters222 commented Aug 1, 2024

Opdoop commented Aug 2, 2024

0xWelt commented Aug 5, 2024

Opdoop commented Aug 5, 2024 • edited Loading

zhaochenyang20 commented Aug 20, 2024

zhaochenyang20 commented Aug 23, 2024

trillionmonster commented Sep 4, 2024 • edited Loading

0xWelt commented Jul 10, 2024 •

edited

Loading

0xWelt Jul 11, 2024 •

edited

Loading

zifeitong commented Jul 11, 2024 •

edited

Loading

0xWelt commented Jul 12, 2024 •

edited

Loading

Opdoop commented Aug 5, 2024 •

edited

Loading

trillionmonster commented Sep 4, 2024 •

edited

Loading