Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add support for 'gte-Qwen2' embedding models #6282

Closed
wants to merge 4 commits into from

Conversation

0xWelt
Copy link

@0xWelt 0xWelt commented Jul 10, 2024

FIX #6015
FIX #5827
FIX #5611
FIX #5600

This should work for Alibaba-NLP/gte-Qwen2-7B-instruct and Alibaba-NLP/gte-Qwen2-1.5B-instruct

You can serve OpenAI compatible API with:

python -m vllm.entrypoints.openai.api_server \
  --served-model-name gte-Qwen2-7B-instruct \
  --model Alibaba-NLP/gte-Qwen2-7B-instruct \
  --dtype bfloat16 \
  --trust-remote-code

However, the current version has a consistency issue of embeddings, which means it can not pass the following test. It should be fixed before merging.

pytest tests/models/test_embedding.py

# FAILED tests/models/test_embedding.py::test_models[half-Alibaba-NLP/gte-Qwen2-7B-instruct] - AssertionError: Not all values are within 0.01 of 1.0

@0xWelt 0xWelt changed the title Model] Add support for 'gte-Qwen2' embedding models [Model] Add support for 'gte-Qwen2' embedding models Jul 10, 2024
Comment on lines +191 to +193
# FIXME: Special handling for gte-Qwen2
if "gte-Qwen2" in self.model:
architectures = ["Qwen2EmbeddingModel"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcoded case based on the model id/path used is not acceptable. For instance, this wouldn't work in the case where a user has downloaded the model locally and passed in a path like --model ~/my-model/

Copy link
Author

@0xWelt 0xWelt Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gte-Qwen2 embedding model's architecture is "Qwen2ForCausalLM", which is the same as Qwen2 LLMs. Is there any better solution to eliminate this ambiguity?

Perhaps we can add an option in argparser to specify whether it is an embedding model, rather than searching through the model architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about working with the upstream to change or add an extra "Qwen2EmbeddingModel" in the "architectures" list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9424 should be able to solve this.

@zifeitong
Copy link
Contributor

zifeitong commented Jul 11, 2024

You can serve OpenAI compatible API with:

python -m vllm.entrypoints.openai.api_server \
  --served-model-name gte-Qwen2-7B-instruct \
  --model Alibaba-NLP/gte-Qwen2-7B-instruct \
  --dtype bfloat16 \
  --trust-remote-code

Is trust-remote-code required?

@0xWelt
Copy link
Author

0xWelt commented Jul 12, 2024

@zifeitong I set it according to the official example. You can further check whether it is necessary.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-7B-instruct", trust_remote_code=True)

@ghost
Copy link

ghost commented Jul 16, 2024

any chance this will land in the next version?

@0xWelt
Copy link
Author

0xWelt commented Jul 16, 2024

any chance this will land in the next version?

The below issues need to be resolved. I am not very familiar with vllm and can only provide limited assistance. You can invite someone who is familiar with the field to help.

  • Ensure that vllm correctly recognizes embedding models, especially when there are LLMs with the same architecture (e.g. Qwen2ForCausalLM).
  • Pass pytest tests/models/test_embedding.py

@ybbz
Copy link

ybbz commented Aug 1, 2024

Can this feature be merged?

@waters222
Copy link

Can this feature be merged?

I dont think anyone is actively working on this. current state output wrong embedding value so no

@Opdoop
Copy link

Opdoop commented Aug 2, 2024

The gte-Qwen2 model extends its attention to bi-directional. I think we need to pass causal=False to attn_func like what they did when evaluating the model.

https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/scripts/eval_mteb.py#L553

@0xWelt
Copy link
Author

0xWelt commented Aug 5, 2024

The gte-Qwen2 model extends its attention to bi-directional. I think we need to pass causal=False to attn_func like what they did when evaluating the model.

https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/scripts/eval_mteb.py#L553

It may cause flash_attn NotImplementedError.

@Opdoop
Copy link

Opdoop commented Aug 5, 2024

@Nickydusk In a word, to enable bi-directional attention of gte-qwen2, we need to pass causal=False from vllm to flash_attention. And currently, there is no name_arg for causal, all is hardcoded as causal=True for the decoding-only model.

vllm caulse=True is hard coded in here:

flash attention has causal=False as default though:
https://github.com/Dao-AILab/flash-attention/blob/3f6ff1c1c52fa3d148b502e465ffb7bc88f7a50e/hopper/flash_attn_interface.py#L259

@zhaochenyang20
Copy link

@Nickydusk Hey Nick. I think that you should not use Qwen2ForCausalLM in the embedding model. Since:

In [2]: model
Out[2]: 
SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: Qwen2Model 
  (1): Pooling({'word_embedding_dimension': 3584, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

The embedding model should be Qwen2Model.

@zhaochenyang20
Copy link

Sorry for the previous wrong statement. You could check my PR on how I am supporting the gte-qwen2 in SGLang:

sgl-project/sglang#1186

@trillionmonster
Copy link

trillionmonster commented Sep 4, 2024

how to run with gguf ?
gte-qwen2-7b-instruct-q5_k_m.gguf

docker run --gpus '"device=3"'
-v /data/gte-Qwen2-7B-instruct-Q5_K_M-GGUF:/model
--name gte-Qwen2-7B-instruct-Q5_K_M-GGUF
-d
-p 18222:8000
--shm-size=16g
vllm/vllm-openai:latest
--model /model/gte-qwen2-7b-instruct-q5_k_m.gguf
--gpu-memory-utilization 0.1
--tensor-parallel-size 1
--max-model-len 6000

企业微信截图_17253448753600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants