Yi-9b模型4bits量化报错,请问如何解决 #457
Unanswered
codeman0987
asked this question in
Q&A
Replies: 1 comment
-
我这边只试过 llama.cpp 的 q4 量化,instruction 生成体感还比较好, 供参考
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
使用官方代码做AWQ量化
python quantization/awq/quant_autoawq.py --model models/01-ai__Yi-9B/ --output_dir models/yi-9b-int4 --bits 4 --group_size 128 --trust_remote_code
报错如下:
Generating validation split: 214670 examples [00:03, 55198.64 examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (8947 > 4096). Running this sequence through the model will result in indexing errors
AWQ: 2%|████▏ | 1/48 [00:22<17:49, 22.75s/it]
Traceback (most recent call last):
File "quantization/awq/quant_autoawq.py", line 53, in
run_quantization(args)
File "quantization/awq/quant_autoawq.py", line 21, in run_quantization
model.quantize(tokenizer, quant_config=quant_config)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/awq/models/base.py", line 176, in quantize
self.quantizer.quantize()
File "/opt/conda/lib/python3.8/site-packages/awq/quantize/quantizer.py", line 147, in quantize
input_feat = self._get_input_feat(self.modules[i], named_linears)
File "/opt/conda/lib/python3.8/site-packages/awq/quantize/quantizer.py", line 535, in _get_input_feat
self.inps = layer(self.inps, **module_kwargs)[0]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 740, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 377, in forward
causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
Beta Was this translation helpful? Give feedback.
All reactions