Add qwen 2.5 #8355

jackzhxng · 2025-02-11T00:58:31Z

Summary

Add qwen model. Instead of loading the model and directly exporting, we load it into a format for export_llama to consume via llama_transformer.py.

Was a relatively painless process, used TorchTune checkpointing utils but didn't need to, was only for convenience.

Test plan

Convert weights:

python examples/models/qwen2_5/convert_weights.py <path-to-checkpoint-dir> <output-path>

Exporting (without quantization and at fp32 because ao quantizer doesn't support bias in 8da4w and xnnpack doesn't support bf16):

./install_executorch.sh --pybind xnnpack && python -m examples.models.llama.export_llama   \
--model qwen2_5 --params examples/models/qwen2_5/1_5b_config.json  \
--checkpoint qwen2_5-1_5b.pth -kv --use_sdpa_with_kv_cache  \
-X -d fp32 --metadata '{"get_bos_id":151643, "get_eos_ids":[151643]}'  \
--output_name="qwen2_5-1_5b.pte" --verbose

Generate:

./install_executorch.sh --pybind xnnpack && python -m examples.models.llama.runner.native \
--model qwen2_5 --pte qwen2_5-1_5b.pte \
--tokenizer ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json \
--prompt "Who is the founder of Meta?" --params examples/models/qwen2_5/1_5b_config.json \
--max_len 64 --temperature 0 -kv

cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel

pytorch-bot · 2025-02-11T00:58:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8355

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit 44aa34d with merge base 1858086 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for examples/models/qwen2_5/README.md:
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t 8e9372bd520487c8dff702bb56a83d7894bb62a0f7e80850df7781a81b03d685 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t c21ca228305e241d74702ee85b429b5997b396b88e7f414eb586a07f60402d9b /exec failed with exit code 127
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t e6eea34053b72208864fd7813b9da605fe07fe28d6bf50b6402b6bb29c9c8141 /exec failed with exit code 127
pull / unittest-buck / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

examples/models/llama/attention.py

examples/models/llama/model.py

examples/models/llama/rope.py

examples/models/llama/attention.py

larryliu0820 · 2025-02-13T19:47:07Z

Can we add a test? Basically extend EagerModelBase and add it to examples/models/__init__.py

jackzhxng · 2025-02-18T23:35:34Z

@larryliu0820 had to make some model loading changes to make the EagerModelBase test work, made a separate PR: #8552

mergennachin · 2025-02-19T11:36:23Z

Can you also a README.md page?

guangy10 · 2025-02-20T00:33:51Z

@mergennachin Thank you for tagging me on this PR.

BTW, this model has been enabled in HuggingFace optimum-executorch a while ago using HF's modeling code, with the simplest XNNPACK recipe (fp32) as a proof-of-experience. Everybody can quickly run it by just following the Quick Start and replace the model_id with "Qwen/Qwen2.5-0.5B".

IMO, the real gap preventing users from deploying this model on-device is converting the Hugging Face tokenizer into a binary format recognizable by llama_runner. Enabling this feature is tracked in this task: pytorch/executorch#6813. This issue arises after leaving the Hugging Face ecosystem and attempting deployment on the actual target device. I explored this a while ago but haven’t had the time to revisit it yet. We should prioritize to fill this gap if we think it's high priority.

I can see that PR adds an additional path for users to convert HF weights using the ET-friendly modeling code (llama_transformer.py), which is good. However, it also adds additional examples, which I’m somewhat concerned about—depending on how far we want to go down this path. Do we plan to keep adding tens of thousands of examples to our repo? If no, where do we draw the line?

jackzhxng · 2025-02-20T01:00:58Z

@guangy10 I don't think that this is apples-to-apples with whatever is on HuggingFace. What's here isn't just for "proof-of-experience", this is able to freely leverage all of our previous work in optimizing the llama_transformer for our Llama models. This involves source transformation for custom flash attention kernels (--use_sdpa_with_kv_cache), quantization (embedding, kv_cache, and linear), etc. I haven't done any benchmarks yet since I still have a blocker on the ao side for quantization, but I expect this to be very performant out of the box.

As for the tokenizer, we just added a json tokenizer for Python to use with pybindings here, and @larryliu0820 is working on integrating one for C++ into ExecuTorch.

Also adding this specific example wasn't very difficult at all - it's much easier than writing a model. I had a few hiccups along the way to deal with setting up things such as the json tokenizer and (currently) bias support for quantization, but overall it wasn't too involved. The goal is that future transformer decoder models can quickly be integrated this way to leverage all of our pre-built performance features.

guangy10 · 2025-02-20T01:26:56Z

This involves source transformation for custom flash attention kernels (--use_sdpa_with_kv_cache), quantization (embedding, kv_cache, and linear), etc.

What prevent using these optimizations as building blocks outside ExecuTorch repo?

larryliu0820 · 2025-02-20T01:31:49Z

@guangy10 I think it's a very good user experience to leverage optimum-executorch whenever it's possible. I can see some limitations though:

Do we have a good story for the models not supported by optimum-executorch?
What should we tell a user to do, if they want to lower the model to a backend other than xnnpack? What about quantization?

Moving forward I suppose optimum-executorch will gradually increase coverage but we should have a coherent solution for people wanting to use ExecuTorch for the scenarios mentioned above.

guangy10 · 2025-02-20T02:14:08Z

@guangy10 I think it's a very good user experience to leverage optimum-executorch whenever it's possible. I can see some limitations though:

Do we have a good story for the models not supported by optimum-executorch?

Trying to get more clarity on this. Do you mean if the model is not published to Hugging Face? For example, some private modeling or weights from an end user?

What should we tell a user to do, if they want to lower the model to a backend other than xnnpack? What about quantization?

Moving forward I suppose optimum-executorch will gradually increase coverage but we should have a coherent solution for people wanting to use ExecuTorch for the scenarios mentioned above.

Initially I think we should host all model code rewriting and recipes (to other backends and with quantization) in optimum as it's how other accelerators and backends work on HF. After chatting with @tarun292, I'm open to host canonical recipes in our repo and allow customization elsewhere. This is possible by leveraging @tarun292's work. I shared the doc with you 😄 (both mine and Tarun's proposal is WIP so won't share it publicly here), and we can discuss it in tomorrow's meeting.

jackzhxng · 2025-02-20T03:41:29Z

@guangy10 Definitely possible - I'm just highlighting that optimum-executorch at the moment is pretty basic and only supports a simple XNNPack export, so it will likely take a good amount of effort match our performance llama_transformer.

iseeyuan · 2025-02-20T15:04:33Z

examples/models/llama/attention.py

-        self.wq = nn.Linear(self.dim, self.n_heads * self.head_dim, bias=False)
-        self.wk = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
-        self.wv = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
+        self.attention_qkv_bias = args.attention_qkv_bias


Should we add the same to static_attention, so that Qwen would work with accelerators?

larryliu0820 · 2025-02-21T07:09:54Z

Do you mean if the model is not published to Hugging Face? For example, some private modeling or weights from an end user?

No like "Falconsai/text_summarization", although it's hosted by HF, this model can't leverage optimum-executorch. I haven't looked into details but from a user's perspective I would be curious to learn what model is supported by optimum-executorch.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025

jackzhxng added the module: examples Issues related to demos under examples/ label Feb 11, 2025

jackzhxng force-pushed the jz/export_qwen branch from 36152fb to 9cc5238 Compare February 11, 2025 21:22

jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Feb 12, 2025

jackzhxng requested a review from iseeyuan February 12, 2025 20:11

jackzhxng marked this pull request as ready for review February 12, 2025 20:11

jackzhxng commented Feb 12, 2025

View reviewed changes

examples/models/llama/attention.py Outdated Show resolved Hide resolved

jackzhxng commented Feb 12, 2025

View reviewed changes

examples/models/llama/model.py Outdated Show resolved Hide resolved

jackzhxng marked this pull request as draft February 12, 2025 20:12

iseeyuan reviewed Feb 13, 2025

View reviewed changes

examples/models/llama/rope.py Outdated Show resolved Hide resolved

iseeyuan reviewed Feb 13, 2025

View reviewed changes

examples/models/llama/attention.py Outdated Show resolved Hide resolved

jackzhxng mentioned this pull request Feb 13, 2025

How to run Qwen using Executorch? #7467

Open

jackzhxng force-pushed the jz/export_qwen branch from 3cb27bc to d4a3f91 Compare February 13, 2025 19:37

jackzhxng marked this pull request as ready for review February 13, 2025 19:37

jackzhxng force-pushed the jz/export_qwen branch 5 times, most recently from 5ef3fef to a0f91b0 Compare February 18, 2025 23:32

jackzhxng added 9 commits February 18, 2025 15:32

Add qwen 2.5

fc07cc2

Fix output embedding

110abd0

Comment / lint

42fdb0d

Add 1.5 config

3ab0bd9

Comment

0a17e3b

Remove qwen rope, use hf rope instead

a27ed67

Back to meta

8aadf45

Parametrize qkv bias

8b0b9f9

Parametrize use hf rope

52d7a11

jackzhxng force-pushed the jz/export_qwen branch from a0f91b0 to 52d7a11 Compare February 18, 2025 23:32

jackzhxng mentioned this pull request Feb 18, 2025

Add tests for qwen + allow uninitialized weights in Llama model #8552

Open

mergennachin requested a review from guangy10 February 19, 2025 11:33

jackzhxng added 2 commits February 19, 2025 14:44

Clean up convert_weights

347c6fb

Add README.md

44aa34d

iseeyuan reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add qwen 2.5 #8355

Add qwen 2.5 #8355

jackzhxng commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

larryliu0820 commented Feb 13, 2025

jackzhxng commented Feb 18, 2025 •

edited

Loading

mergennachin commented Feb 19, 2025

guangy10 commented Feb 20, 2025

jackzhxng commented Feb 20, 2025

guangy10 commented Feb 20, 2025

larryliu0820 commented Feb 20, 2025

guangy10 commented Feb 20, 2025 •

edited

Loading

jackzhxng commented Feb 20, 2025

iseeyuan Feb 20, 2025

larryliu0820 commented Feb 21, 2025

Add qwen 2.5 #8355

Are you sure you want to change the base?

Add qwen 2.5 #8355

Conversation

jackzhxng commented Feb 11, 2025 • edited Loading

Summary

Test plan

pytorch-bot bot commented Feb 11, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8355

❌ 6 New Failures

larryliu0820 commented Feb 13, 2025

jackzhxng commented Feb 18, 2025 • edited Loading

mergennachin commented Feb 19, 2025

guangy10 commented Feb 20, 2025

jackzhxng commented Feb 20, 2025

guangy10 commented Feb 20, 2025

larryliu0820 commented Feb 20, 2025

guangy10 commented Feb 20, 2025 • edited Loading

jackzhxng commented Feb 20, 2025

iseeyuan Feb 20, 2025

Choose a reason for hiding this comment

larryliu0820 commented Feb 21, 2025

jackzhxng commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

jackzhxng commented Feb 18, 2025 •

edited

Loading

guangy10 commented Feb 20, 2025 •

edited

Loading