-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWQ quant support #762
Conversation
Great benchmarks. I was looking into implementing this myself but was waiting on your implementation. Here are my 2 cents. EDIT: Everyone should also note that the GEMM kernels are optimized for Ampere and later architectures (e.g. RTX 3000-4000, A5000, A6000, A100, H100, etc.), i.e. it is unlikely to work well on a V100 GPU. However, I would argue this does not matter as using a V100 would be vastly inferior in terms of both cost and speed for deployment. Design questionI noticed a few extra Quant classes need to be added for every model. Here are my thoughts on how it could (potentially) be reduced to a simpler method. Instead of implementing QuantLlamaMLP and other classes for each part of every model, why not implement the replacement of the Linear layers at a lower level? For example, in RowParallelLinear - F.linear() and ColumnParallelLinear - F.linear(). e.g. a very naive example to instantiate: if quant_config.method is not None:
self.linear = get_quantized_layer(in_features, out_features, quant_config)
else:
self.linear = F.linear This way, you don't have to modify the model files that much since you could just pass down your quant_config and decide on a lower level. In summary, we could simplify the code and make it easier to extend in the future. Replacing activationsI noticed activations are not replaced (not sure if you tested this). In AWQ, they also replace activations in some functions with a ScaledActivation. Not sure if this makes a difference, but wanted to highlight it. |
@casperbh96 thanks for the feedback. Going to look into improving some of the design. Scaled activations https://github.com/mit-han-lab/llm-awq/blob/main/awq/quantize/quantizer.py#L14 looking at the AWQ code it seems that is only applied to MPT, Bloom and Falcon models and has no effect for Llama. Plus I have run some tests of inference quality and it seems to be fine. |
Updated with some improvements + removed the draft / WIP status. |
I have tested some models, your 13B Vicuna model and LLaMa 7B. These models are solely measured on tokens/s instead of throughput. Hardware is RTX 3090 + Threadripper Pro 3955WX. Multiple prompts are measured individually. TLDR: The performance is seemingly getting up to 85-90% of the original work.
Note: I also tested A100 and RTX 6000 Ada, but they are not yielding better results. vLLM example: import time
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Write me a letter to Sam Altman",
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="rirv938/wizard-vicuna-13b-uncensored-awq-4bit-g128", **{'quantization': 'awq'})
start = time.time()
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
tokens = output.outputs[0].token_ids
end = time.time()
elapsed = end-start
print(output)
print(len(tokens) / elapsed, 'tokens/s') TinyChat example (need to git clone huggingface first):
|
@casperbh96 looks like tinychat does some other things like fusing layers. Might be a more efficient implementation. In particular tinychat does if args.precision == "W4A16" and args.model_type.lower() == 'llama':
from tinychat.modules import make_quant_norm, make_quant_attn, make_fused_mlp
make_quant_attn(model, args.device)
make_quant_norm(model)
make_fused_mlp(model) For A6000 when you say it didn't yield better results do you mean that they performaned worse or comparable? I'm particular interested in this because intend to deploy to A5000 / A6000 hardware. |
TinyChat has a few extra things happening, yes. Performance discrepancy is so small that I would not focus on it, but if you wanted to, you should focus on the T5LayerNorm kernel that they adapted from FasterTransformer. I meant that A100, A6000, 4090, 3090 all yield similar results on the quantized models. A6000 being a little slower than the others though. This is to be expected and is probably due to CPU being the bottleneck. |
throughput increases lots when QKV and gate + up proj layers are merged. EDIT: initial estimates of throughput increases were incorrect, its only a modest improvement I think. |
Good to hear! How generalizable is this to other models like MPT? |
only tried with llama. Merging the linear layers I am assuming can be done for most models with attention blocks. But I dont know much about MPT model. |
From what I could find it's only LLaMa models that can have their qkv projection fused because they are the only ones that have one linear layer for each q,k,v which makes them slower. So LLaMa and InternLM seems like the one's that can benefit from this. Falcon, MPT, qwen, baichuan models have their qkv operations fused already, so it should be optimized by AWQ quantization already since they defined it like this: self.Wqkv = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias=False) Versus LLaMa: self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) |
I see this supports 4 bits. Is there a plan to add support for 8 bit quantization? |
ultimately im finding A100 unquantised is cheaper than quantised on A5000 or A6000. In other words the cheaper hardware benefit is not making it cheaper overall. Thats why I think need better CUDA kernels for this. @jhartman @casperbh96 |
That seems surprising to me. Can you please try my branch? I integrated AWQ into RowParallel and ColumnParallel, The implementation is a little hacky for loading the model, but worked fine for me. I suspect maybe this could make a difference? https://github.com/casperbh96/vllm-quantisation/tree/add_awq_quant_support EDIT: I added you |
Loading a model I get "Unable to import awq_inference_engine: run setup.py" to install AWQ CUDA kernels" |
You need to run |
@ri938 Thanks for the awesome work! And sorry for the late response. I was tracking this PR, but didn't have a bandwidth to look into it. We'd love to merge this PR into our main branch. That said, we'd like to clean up the code in this PR as we found several files are redundant. For example, IIUC, the CUDA kernels besides the ones in Thanks again for the wonderful work. And I'd also appreciate everyone in the discussion @casperbh96, @TheBloke, @jhartman. |
@WoosukKwon thanks for checking in. Much of what you mentioned has already been done in a fork. I believe there are a few items that need doing:
Please do make improvements that you see fit. The model loading code is what I see lacking the most as it's not easy to extend to other models. |
remove not needed files
Organise
@WoosukKwon thanks and no problem with the delay.
I know @casperbh96 has some code to make it work with MPT models and also a refactor to get tensor parallelism working. I didnt merge that into this change because didnt have the time to test and review at the moment so thought better to leave as a future merge request after. |
dont error if user doesnt have kernels installed
@ri938 @casperbh96 Awesome! Thanks for cleaning up the code! Could we take over the PR and do additional cleanup? Actually @julian-q has some ideas to make the quantization-related code more modular and reusable (and he also added support for TP). Of course, you will be recognized as a coauthor of this PR. |
@WoosukKwon yes its ok for you to take over the PR. Thanks. |
boss, are you going to work on the tensor parallelism because I have 16x A100 and it is going to be a night nightmare to run them one by one. |
I believe @WoosukKwon mentioned that tensor parallelism will be supported with AWQ. |
@WoosukKwon @julian-q Not sure if I'm a bit late to this, but I have a version of Row/Col layers that integrate TP & AWQ. Would love to discuss further if I can be any help. |
@belericant I believe this was already implemented by Julian. See code below. |
Just a quick note of thanks and to say that I have tested this PR and it works really well. I have had to subclass langchain so that will need a small PR once this is live:
|
@ri938 any ideas why this model would produce garbage output? rirv938/WizardLM-33B-V1.0-Uncensored-awq-4bit-g128 hd canciónbólści Zum framідnexProgram nov напskieWrapperabi go totaleacional Stuartárs;"club Phil Doctor}$- FIFAdwékpit internally premiers quatre retaya Variableello incorrectlyCy timer АлександрRemote Branch ProductionButt flying Aw Clar марта onde materialah Altern>{ Amtun thereforeUD recommendedpythonmeck Liste Blozychlej dig amongINCT Product chooseutableätz Sarah ColeFrameworkowanebeginidenteacingconde:" Ukraineindre équipeamomedia Kapuka segucitepності Ged hurt forecremote wieś`) пяysisiedinnerisiónfg converteraget patientCh+$GERзько Moписок ezboxp sede sorti acc call carri phys encontrPr systvaavigator Mattјаtac conventidenoteца KidPath позво , |
@ri938 @casper-hansen #1032 succeeded this PR and now it's merged. We've refactored the code a bit to make it extensible to other quantization methods like GPTQ and SqueezeLLM. Thanks again for the great PR! |
test:
Issues
some TODOs to resolve (e.g hard coding device for loading quantised layers)currently only supports Llama(not intending to add support for this in this PR)it scales poorly with larger batch sizes. Would be good for more optimisation. (I think this is a seperate PR after / community work)