-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add qwen 2.5 #8355
base: main
Are you sure you want to change the base?
Add qwen 2.5 #8355
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8355
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New FailuresAs of commit 44aa34d with merge base 1858086 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
36152fb
to
9cc5238
Compare
3cb27bc
to
d4a3f91
Compare
Can we add a test? Basically extend |
5ef3fef
to
a0f91b0
Compare
a0f91b0
to
52d7a11
Compare
@larryliu0820 had to make some model loading changes to make the |
Can you also a README.md page? |
@mergennachin Thank you for tagging me on this PR. BTW, this model has been enabled in HuggingFace optimum-executorch a while ago using HF's modeling code, with the simplest XNNPACK recipe ( IMO, the real gap preventing users from deploying this model on-device is converting the Hugging Face tokenizer into a binary format recognizable by llama_runner. Enabling this feature is tracked in this task: pytorch/executorch#6813. This issue arises after leaving the Hugging Face ecosystem and attempting deployment on the actual target device. I explored this a while ago but haven’t had the time to revisit it yet. We should prioritize to fill this gap if we think it's high priority. I can see that PR adds an additional path for users to convert HF weights using the ET-friendly modeling code ( |
@guangy10 I don't think that this is apples-to-apples with whatever is on HuggingFace. What's here isn't just for "proof-of-experience", this is able to freely leverage all of our previous work in optimizing the As for the tokenizer, we just added a json tokenizer for Python to use with pybindings here, and @larryliu0820 is working on integrating one for C++ into ExecuTorch. Also adding this specific example wasn't very difficult at all - it's much easier than writing a model. I had a few hiccups along the way to deal with setting up things such as the json tokenizer and (currently) bias support for quantization, but overall it wasn't too involved. The goal is that future transformer decoder models can quickly be integrated this way to leverage all of our pre-built performance features. |
What prevent using these optimizations as building blocks outside ExecuTorch repo? |
@guangy10 I think it's a very good user experience to leverage optimum-executorch whenever it's possible. I can see some limitations though:
Moving forward I suppose optimum-executorch will gradually increase coverage but we should have a coherent solution for people wanting to use ExecuTorch for the scenarios mentioned above. |
Trying to get more clarity on this. Do you mean if the model is not published to Hugging Face? For example, some private modeling or weights from an end user?
Initially I think we should host all model code rewriting and recipes (to other backends and with quantization) in |
@guangy10 Definitely possible - I'm just highlighting that optimum-executorch at the moment is pretty basic and only supports a simple XNNPack export, so it will likely take a good amount of effort match our performance llama_transformer. |
self.wq = nn.Linear(self.dim, self.n_heads * self.head_dim, bias=False) | ||
self.wk = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False) | ||
self.wv = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False) | ||
self.attention_qkv_bias = args.attention_qkv_bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the same to static_attention, so that Qwen would work with accelerators?
No like "Falconsai/text_summarization", although it's hosted by HF, this model can't leverage optimum-executorch. I haven't looked into details but from a user's perspective I would be curious to learn what model is supported by optimum-executorch. |
Summary
Add qwen model. Instead of loading the model and directly exporting, we load it into a format for
export_llama
to consume viallama_transformer.py
.Was a relatively painless process, used TorchTune checkpointing utils but didn't need to, was only for convenience.
Test plan
Convert weights:
Exporting (without quantization and at
fp32
because ao quantizer doesn't support bias in 8da4w and xnnpack doesn't support bf16):Generate:
cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel