Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qwen 2.5 #8355

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add qwen 2.5 #8355

wants to merge 11 commits into from

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Feb 11, 2025

Summary

Add qwen model. Instead of loading the model and directly exporting, we load it into a format for export_llama to consume via llama_transformer.py.

Was a relatively painless process, used TorchTune checkpointing utils but didn't need to, was only for convenience.

Test plan

Convert weights:

python examples/models/qwen2_5/convert_weights.py <path-to-checkpoint-dir> <output-path>

Exporting (without quantization and at fp32 because ao quantizer doesn't support bias in 8da4w and xnnpack doesn't support bf16):

./install_executorch.sh --pybind xnnpack && python -m examples.models.llama.export_llama   \
--model qwen2_5 --params examples/models/qwen2_5/1_5b_config.json  \
--checkpoint qwen2_5-1_5b.pth -kv --use_sdpa_with_kv_cache  \
-X -d fp32 --metadata '{"get_bos_id":151643, "get_eos_ids":[151643]}'  \
--output_name="qwen2_5-1_5b.pte" --verbose

Generate:

./install_executorch.sh --pybind xnnpack && python -m examples.models.llama.runner.native \
--model qwen2_5 --pte qwen2_5-1_5b.pte \
--tokenizer ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json \
--prompt "Who is the founder of Meta?" --params examples/models/qwen2_5/1_5b_config.json \
--max_len 64 --temperature 0 -kv

cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel

Copy link

pytorch-bot bot commented Feb 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8355

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit 44aa34d with merge base 1858086 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025
@jackzhxng jackzhxng added the module: examples Issues related to demos under examples/ label Feb 11, 2025
@jackzhxng jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Feb 12, 2025
@jackzhxng jackzhxng requested a review from iseeyuan February 12, 2025 20:11
@jackzhxng jackzhxng marked this pull request as ready for review February 12, 2025 20:11
@jackzhxng jackzhxng marked this pull request as draft February 12, 2025 20:12
@jackzhxng jackzhxng marked this pull request as ready for review February 13, 2025 19:37
@larryliu0820
Copy link
Contributor

Can we add a test? Basically extend EagerModelBase and add it to examples/models/__init__.py

@jackzhxng jackzhxng force-pushed the jz/export_qwen branch 5 times, most recently from 5ef3fef to a0f91b0 Compare February 18, 2025 23:32
@jackzhxng
Copy link
Contributor Author

jackzhxng commented Feb 18, 2025

@larryliu0820 had to make some model loading changes to make the EagerModelBase test work, made a separate PR: #8552

@mergennachin
Copy link
Contributor

Can you also a README.md page?

@guangy10
Copy link
Contributor

@mergennachin Thank you for tagging me on this PR.

BTW, this model has been enabled in HuggingFace optimum-executorch a while ago using HF's modeling code, with the simplest XNNPACK recipe (fp32) as a proof-of-experience. Everybody can quickly run it by just following the Quick Start and replace the model_id with "Qwen/Qwen2.5-0.5B".

IMO, the real gap preventing users from deploying this model on-device is converting the Hugging Face tokenizer into a binary format recognizable by llama_runner. Enabling this feature is tracked in this task: pytorch/executorch#6813. This issue arises after leaving the Hugging Face ecosystem and attempting deployment on the actual target device. I explored this a while ago but haven’t had the time to revisit it yet. We should prioritize to fill this gap if we think it's high priority.

I can see that PR adds an additional path for users to convert HF weights using the ET-friendly modeling code (llama_transformer.py), which is good. However, it also adds additional examples, which I’m somewhat concerned about—depending on how far we want to go down this path. Do we plan to keep adding tens of thousands of examples to our repo? If no, where do we draw the line?

@jackzhxng
Copy link
Contributor Author

@guangy10 I don't think that this is apples-to-apples with whatever is on HuggingFace. What's here isn't just for "proof-of-experience", this is able to freely leverage all of our previous work in optimizing the llama_transformer for our Llama models. This involves source transformation for custom flash attention kernels (--use_sdpa_with_kv_cache), quantization (embedding, kv_cache, and linear), etc. I haven't done any benchmarks yet since I still have a blocker on the ao side for quantization, but I expect this to be very performant out of the box.

As for the tokenizer, we just added a json tokenizer for Python to use with pybindings here, and @larryliu0820 is working on integrating one for C++ into ExecuTorch.

Also adding this specific example wasn't very difficult at all - it's much easier than writing a model. I had a few hiccups along the way to deal with setting up things such as the json tokenizer and (currently) bias support for quantization, but overall it wasn't too involved. The goal is that future transformer decoder models can quickly be integrated this way to leverage all of our pre-built performance features.

@guangy10
Copy link
Contributor

This involves source transformation for custom flash attention kernels (--use_sdpa_with_kv_cache), quantization (embedding, kv_cache, and linear), etc.

What prevent using these optimizations as building blocks outside ExecuTorch repo?

@larryliu0820
Copy link
Contributor

@guangy10 I think it's a very good user experience to leverage optimum-executorch whenever it's possible. I can see some limitations though:

  1. Do we have a good story for the models not supported by optimum-executorch?
  2. What should we tell a user to do, if they want to lower the model to a backend other than xnnpack? What about quantization?

Moving forward I suppose optimum-executorch will gradually increase coverage but we should have a coherent solution for people wanting to use ExecuTorch for the scenarios mentioned above.

@guangy10
Copy link
Contributor

guangy10 commented Feb 20, 2025

@guangy10 I think it's a very good user experience to leverage optimum-executorch whenever it's possible. I can see some limitations though:

  1. Do we have a good story for the models not supported by optimum-executorch?

Trying to get more clarity on this. Do you mean if the model is not published to Hugging Face? For example, some private modeling or weights from an end user?

  1. What should we tell a user to do, if they want to lower the model to a backend other than xnnpack? What about quantization?

Moving forward I suppose optimum-executorch will gradually increase coverage but we should have a coherent solution for people wanting to use ExecuTorch for the scenarios mentioned above.

Initially I think we should host all model code rewriting and recipes (to other backends and with quantization) in optimum as it's how other accelerators and backends work on HF. After chatting with @tarun292, I'm open to host canonical recipes in our repo and allow customization elsewhere. This is possible by leveraging @tarun292's work. I shared the doc with you 😄 (both mine and Tarun's proposal is WIP so won't share it publicly here), and we can discuss it in tomorrow's meeting.

@jackzhxng
Copy link
Contributor Author

@guangy10 Definitely possible - I'm just highlighting that optimum-executorch at the moment is pretty basic and only supports a simple XNNPack export, so it will likely take a good amount of effort match our performance llama_transformer.

self.wq = nn.Linear(self.dim, self.n_heads * self.head_dim, bias=False)
self.wk = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
self.wv = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
self.attention_qkv_bias = args.attention_qkv_bias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the same to static_attention, so that Qwen would work with accelerators?

@larryliu0820
Copy link
Contributor

Do you mean if the model is not published to Hugging Face? For example, some private modeling or weights from an end user?

No like "Falconsai/text_summarization", although it's hosted by HF, this model can't leverage optimum-executorch. I haven't looked into details but from a user's perspective I would be curious to learn what model is supported by optimum-executorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: examples Issues related to demos under examples/ release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants