-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
类似 llama.cpp的加速? #51
Comments
llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 |
最新版本的llama.cpp install Python dependenciespython3 -m pip install -r requirements.txt 最新开发版本ggufcd llama.cpp/gguf-py 转换python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 量化./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 运行./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding macbook pro 15年款 完美可以推理 |
不知道为什么,llama.cpp server方式,推理结果没有命令行好,命令行还可以的哈,使用server的话,就比较奇怪了。 选择模型 Baichuan2-13B-Chat |
@jameswu2014 @dlutsniper quantize gguf model failed on
|
Solved by building a docker image from |
我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。 |
问题解决了。需要按照@jameswu2014的步骤,将 Baichuan2转 Baichuan1. 所以当前版本的 llama.cpp 还不能直接支持 Baichuan2 模型的 convert |
This works for $ ./quantize /models/baichuan2-13b-chat.gguf /models/baichuan2-13b-chat-Q4_K_M.gguf Q4_K
...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llama_model_quantize_internal: meta size = 2883232 bytes
[ 1/ 363] token_embd.weight - [ 5120, 125696, 1, 1], type = f16, quantizing to q4_K .. size = 1227.50 MB -> 345.23 MB | hist:
[ 2/ 363] blk.0.attn_output.weight - [ 5120, 5120, 1, 1], type = f16, quantizing to q4_K .. size = 50.00 MB -> 14.06 MB | hist:
[ 3/ 363] blk.0.ffn_gate.weight - [ 5120, 13696, 1, 1], type = f16, quantizing to q4_K .. size = 133.75 MB -> 37.62 MB | hist:
[ 4/ 363] blk.0.ffn_down.weight - [13696, 5120, 1, 1], type = f16,
get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for k-quants
llama_model_quantize: failed to quantize: Unsupported tensor size encountered
main: failed to quantize model from '/output/baichuan2-13b-chat.gguf' |
(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1 |
我也是这个问题,CPU正常,但是GPU不行。 |
同样的问题,有没有解决的思路 |
好像和我的问题不太一样,请问你是什么设备? |
请问微调后的baichuan2也可以用这个方法来加速吗? |
有没有类似llama.cpp的加速,现存的库好像都不支持baichuan2的加速
The text was updated successfully, but these errors were encountered: