-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add support for 'gte-Qwen2' embedding models #6282
Conversation
# FIXME: Special handling for gte-Qwen2 | ||
if "gte-Qwen2" in self.model: | ||
architectures = ["Qwen2EmbeddingModel"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded case based on the model id/path used is not acceptable. For instance, this wouldn't work in the case where a user has downloaded the model locally and passed in a path like --model ~/my-model/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gte-Qwen2 embedding model's architecture is "Qwen2ForCausalLM", which is the same as Qwen2 LLMs. Is there any better solution to eliminate this ambiguity?
Perhaps we can add an option in argparser to specify whether it is an embedding model, rather than searching through the model architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about working with the upstream to change or add an extra "Qwen2EmbeddingModel" in the "architectures" list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9424 should be able to solve this.
Is |
@zifeitong I set it according to the official example. You can further check whether it is necessary. from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Alibaba-NLP/gte-Qwen2-7B-instruct", trust_remote_code=True) |
any chance this will land in the next version? |
The below issues need to be resolved. I am not very familiar with vllm and can only provide limited assistance. You can invite someone who is familiar with the field to help.
|
Can this feature be merged? |
I dont think anyone is actively working on this. current state output wrong embedding value so no |
The gte-Qwen2 model extends its attention to bi-directional. I think we need to pass https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/blob/main/scripts/eval_mteb.py#L553 |
It may cause flash_attn NotImplementedError. |
@Nickydusk In a word, to enable bi-directional attention of gte-qwen2, we need to pass vllm vllm/vllm/attention/backends/flash_attn.py Line 528 in 82a1b1a
flash attention has |
@Nickydusk Hey Nick. I think that you should not use
The embedding model should be |
Sorry for the previous wrong statement. You could check my PR on how I am supporting the gte-qwen2 in SGLang: |
how to run with gguf ? docker run --gpus '"device=3"' |
FIX #6015
FIX #5827
FIX #5611
FIX #5600
This should work for Alibaba-NLP/gte-Qwen2-7B-instruct and Alibaba-NLP/gte-Qwen2-1.5B-instruct
You can serve OpenAI compatible API with:
However, the current version has a consistency issue of embeddings, which means it can not pass the following test. It should be fixed before merging.
pytest tests/models/test_embedding.py # FAILED tests/models/test_embedding.py::test_models[half-Alibaba-NLP/gte-Qwen2-7B-instruct] - AssertionError: Not all values are within 0.01 of 1.0