-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workaround of AWQ for Turing GPUs #1252
Conversation
@twaka Thanks for this work. Would you mind upstreaming this into AutoAWQ as well? |
@casper-hansen I'm glad to see it added to autoawq and excellent benchmark results! To be honest, I was not yet familiar enough to find where need to be changed in autoawq codebase ;) |
@twaka Hi, Thanks for sharing this workaround. It works for me using 2 tensor parallism. When i use TP4, it will throw error: Group size should be a multiple of 32 in gemm kernel. And my awq model uses group size 128 and 4bit. Do you have ideas? |
@esmeetu I'm happy to hear it works with TP2. Though I don't have environment to run with TP4, I think we can isolate your issue by running TP4 with Ampere GPUs to see if the error persists. |
@twaka Yes, it should be a common problem. i opened a new issue about that. |
I observed the same output from T4 vs other GPUs. I will see if I have time to measure perplexity before I merge it - I expect it to be the same. 4 vs 2 instructions is obviously going to be slower but it’s still decently fast on T4 to the point of being usable. |
It's a bit tricky to test this branch. I had to fork my own branch and merge #1290 to be able to build, and set The root cause looks to be an Has anyone found a reliable way to test this branch on Turing GPUs? Let me know I would gladly appreciate to test the AWQ export of my model with vLLM. |
When you merged the PR into your fork, you forgot to merge the updated kernels from this PR @wasertech |
It works! I’ve seen a remarkable improvement, going from approximately 39 tokens per second to a speedy 86! What’s even more impressive is that the model size is now less than 4 GB on disk. I’d like to extend a special shoutout to @casper-hansen for consistently steering me in the right direction and to @twaka for making the process of quantization on vLLM much more accessible. Of course, a heartfelt thanks to everyone who made this achievement possible! |
@wasertech I am curious that which model size do you use? I cannot achieve that speed on my T4. I have 35t/s for 7b-awq model and 7b-FP16 only have 16t/s. |
@esmeetu , you’ve certainly doubled your token throughput 😅. I used to use assistant-llama2-7b-chat, but now I’m currently using assistant-llama2-7b-chat-awq , which are both fine-tuned models based on a peft adapter for QLoRA, from Photolens/llama-2-7b-langchain-chat. Edit: I’m not using a T4 but a RTX Titan which can explain the difference in throughput. |
@wasertech Yeah, but i meant how did you get 86tokens/s? Might batch size is 2?🫨 |
@esmeetu I’m using the default batch size, I took the server example code and slightly modified it here to only stream back the output and not the input with it. I really think the GPU is the biggest difference at play here… |
@wasertech Which GPU do you use? |
@esmeetu https://www.nvidia.com/en-us/deep-learning-ai/products/titan-rtx/ |
@wasertech Thank you! I'm sorry that I missed that message you sent.😆 |
As I noted in my PR in AutoAWQ, this PR in vLLM enables older GPUs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twaka Thanks for the fix! Sorry the late reply, I was very busy for the last week. Left a very minor comment. Please check it out.
BTW, I've also checked that this PR works for 1 and 2 T4 GPUs. @twaka Thanks for the great work! |
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@WoosukKwon Thanks, updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twaka LGTM! Many thanks again for the fix!
I use the project(https://github.com/casper-hansen/AutoAWQ) quantification yi-34b-chat model, then i run vllm demo also meet “RuntimeError: CUDA error: device-side assert triggered”. How did you solve it?
2024-01-02 20:21:56,878 - modelscope - INFO - PyTorch version 2.1.2+cu118 Found. Processed prompts: 0%| | 0/1 [00:00<?, ?it/s]
nvcc: NVIDIA (R) Cuda compiler driver
absl-py 2.0.0 |
@mingyangAbc as @casper-hansen greatly pointed out: Anyways, I don't really remember what I did exactly, but here is the gist: Using docker I built vLLM and that's all. DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=24 My mistake was to try to built with the wrong In any case your issue might not be caused by the same situation. Best to probably open a new issue not tied to a PR merged 3 months ago. Also maybe your model (Yi-34B) doesn't support AWQ quantization atm? See there is so much difference from your case to mine, you should just open a proper issue. |
ok, thanks. I will open a new issue. |
I was just wondering whether support can be extended to Pascal class GPUs such as P100? I'm not sure which intrinsics are missing compared to Turing (if any). |
@cduk Unfortunately, your GPU has a compute capability score of 6.0, which is insufficient for AWQ quantization that requires a compute capability of 7.5 or above. The Pascal architecture lacks several key features present in more recent architectures like Turing, such as:
This PR is already a god sent for the Turing architecture as we need to compute two operations where newer architectures (>=80) only have to compute one... I'm not saying it couldn't be done but it would require more computation steps (4, maybe 8 or more) and therefor would be slow. |
As far as I saw, only
mma.sync.aligned.m16n8k16
op requires sm_80. For sm_75, using twomma.sync.aligned.m16n8k8
op can yield the same result.It may not be optimal for performance but it works at least for who wants try AWQ with Turing GPUs.