Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

类似 llama.cpp的加速? #51

Open
lx0126z opened this issue Sep 7, 2023 · 15 comments
Open

类似 llama.cpp的加速? #51

lx0126z opened this issue Sep 7, 2023 · 15 comments

Comments

@lx0126z
Copy link

lx0126z commented Sep 7, 2023

有没有类似llama.cpp的加速,现存的库好像都不支持baichuan2的加速

@jameswu2014
Copy link
Contributor

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009
你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

@dlutsniper
Copy link

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py
pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf
27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0
7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding

macbook pro 15年款 完美可以推理
有个小问题,不知道提示词模版是不是需要调整 @jameswu2014
图片

@dlutsniper
Copy link

不知道为什么,llama.cpp server方式,推理结果没有命令行好,命令行还可以的哈,使用server的话,就比较奇怪了。
请问这个问题怎么回事儿啊 @jameswu2014

选择模型 Baichuan2-13B-Chat
最新版本llama.cpp

main推理命令行
图片
main推理测试结果
图片

server推理服务命令行
图片
server推理服务测试结果
图片

@songkq
Copy link

songkq commented Sep 18, 2023

@jameswu2014 @dlutsniper quantize gguf model failed on RTX3090 with Driver Version: 525.105.17 CUDA Version: 12.0. Could you please give some advice for this issue?

./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7

CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW
current device: 0

@songkq
Copy link

songkq commented Sep 18, 2023

@jameswu2014 @dlutsniper quantize gguf model failed on RTX3090 with Driver Version: 525.105.17 CUDA Version: 12.0. Could you please give some advice for this issue?

./quantize /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-f16.gguf /workspace/llama.cpp/models/Baichuan2-13B-Chat-ggml-model-Q8_0.gguf 7

CUDA error 804 at /llama.cpp/ggml-cuda.cu:5522: forward compatibility was attempted on non supported HW
current device: 0

Solved by building a docker image from nvidia/cuda:12.0.0-devel-ubuntu22.04

@zhangqiangauto
Copy link

zhangqiangauto commented Sep 19, 2023

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding

macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014 图片

我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。
附件是日志记录:debug.txt

@zhangqiangauto
Copy link

最新版本的llama.cpp

install Python dependencies

python3 -m pip install -r requirements.txt

最新开发版本gguf

cd llama.cpp/gguf-py pip install --editable .

转换

python convert-baichuan-hf-to-gguf.py /Users/wy/Downloads/Baichuan2-13B-Chat --outfile Baichuan2-13B-Chat-ggml-model-f16.gguf 27.8GB

量化

./build/bin/quantize ./Baichuan2-13B-Chat-ggml-model-f16.gguf ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf q4_0 7.99GB

运行

./build/bin/server -ngl 0 -m ./Baichuan2-13B-Chat-ggml-model-q4_0.gguf -c 4096 --embedding
macbook pro 15年款 完美可以推理 有个小问题,不知道提示词模版是不是需要调整 @jameswu2014 图片

我按照这个步骤量化了Baichuan2-7B 的 chat 版本,但问答输出大部分是英文,且质量很差,不知道什么原因。 附件是日志记录:debug.txt

问题解决了。需要按照@jameswu2014的步骤,将 Baichuan2转 Baichuan1. 所以当前版本的 llama.cpp 还不能直接支持 Baichuan2 模型的 convert

@aisensiy
Copy link

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

This works for Q8_0 Q5_0 Q4_0 but failed for others with such error message:

$ ./quantize /models/baichuan2-13b-chat.gguf /models/baichuan2-13b-chat-Q4_K_M.gguf Q4_K

...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llama_model_quantize_internal: meta size = 2883232 bytes
[   1/ 363]                    token_embd.weight - [ 5120, 125696,     1,     1], type =    f16, quantizing to q4_K .. size =  1227.50 MB ->   345.23 MB | hist: 
[   2/ 363]             blk.0.attn_output.weight - [ 5120,  5120,     1,     1], type =    f16, quantizing to q4_K .. size =    50.00 MB ->    14.06 MB | hist: 
[   3/ 363]                blk.0.ffn_gate.weight - [ 5120, 13696,     1,     1], type =    f16, quantizing to q4_K .. size =   133.75 MB ->    37.62 MB | hist: 
[   4/ 363]                blk.0.ffn_down.weight - [13696,  5120,     1,     1], type =    f16, 

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for k-quants
llama_model_quantize: failed to quantize: Unsupported tensor size encountered

main: failed to quantize model from '/output/baichuan2-13b-chat.gguf'

@chunzha1
Copy link

(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1
Log start
main: warning: changing RoPE frequency base to 0 (default 10000.0)
main: warning: scaling RoPE frequency by 0 (default 1.0)
main: build = 1270 (c091cdf)
main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0
main: seed = 1695699630
llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ]
llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ]
llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: baichuan.tensor_data_layout str
llama_model_loader: - kv 3: baichuan.context_length u32
llama_model_loader: - kv 4: baichuan.embedding_length u32
llama_model_loader: - kv 5: baichuan.block_count u32
llama_model_loader: - kv 6: baichuan.feed_forward_length u32
llama_model_loader: - kv 7: baichuan.rope.dimension_count u32
llama_model_loader: - kv 8: baichuan.attention.head_count u32
llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32
llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - kv 19: general.file_type u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = baichuan
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 125696
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff = 13696
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.90 B
llm_load_print_meta: model size = 7.44 GiB (4.60 BPW)
llm_load_print_meta: general.name = Baichuan2-13B-Chat
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state)
...........................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32
ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 256.97 MB
llama_new_context_with_model: max tensor size = 503.47 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67)
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented"
zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0
按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014

@dereklll
Copy link

dereklll commented Oct 10, 2023

(alpaca_env) chunzhamini@chunzhamini llama.cpp % ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p '从前有一只小狐狸,他' --temp 0 -ngl 1 Log start main: warning: changing RoPE frequency base to 0 (default 10000.0) main: warning: scaling RoPE frequency by 0 (default 1.0) main: build = 1270 (c091cdf) main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin22.5.0 main: seed = 1695699630 llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin (version GGUF V2 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 2: blk.0.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 3: blk.0.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 5: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 7: blk.1.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.1.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 9: blk.1.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 10: blk.1.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 12: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 13: blk.2.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 14: blk.2.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 15: blk.2.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.2.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 17: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.3.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 20: blk.3.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 21: blk.3.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 22: blk.3.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 23: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 24: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 25: blk.4.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.4.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 27: blk.4.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 28: blk.4.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 29: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 30: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 31: blk.5.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 32: blk.5.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 33: blk.5.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.5.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 35: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.6.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 38: blk.6.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 39: blk.6.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 40: blk.6.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 41: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 42: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 43: blk.7.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.7.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 45: blk.7.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 46: blk.7.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 47: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 48: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 49: blk.8.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 50: blk.8.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 51: blk.8.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.8.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 53: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.9.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 56: blk.9.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 57: blk.9.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 58: blk.9.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 59: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 60: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 61: blk.10.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.10.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 63: blk.10.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 64: blk.10.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 65: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 66: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 67: blk.11.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.11.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 69: blk.11.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 70: blk.11.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 71: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.12.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.12.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 75: blk.12.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 76: blk.12.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 77: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 78: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 79: blk.13.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 80: blk.13.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 81: blk.13.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 82: blk.0.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.0.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.0.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 85: blk.1.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.1.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 87: blk.1.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 88: blk.2.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 89: blk.2.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.2.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 91: blk.3.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.3.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.3.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 94: blk.4.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.4.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 96: blk.4.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 97: blk.5.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.5.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 100: blk.6.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.6.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.6.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 103: blk.7.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.7.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 105: blk.7.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 106: blk.8.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 107: blk.8.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.8.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 109: blk.9.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.9.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.9.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 112: blk.10.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.10.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 114: blk.10.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 115: blk.11.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 116: blk.11.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.11.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 118: blk.12.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.12.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.12.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.13.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 123: blk.13.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.14.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 129: blk.14.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 130: blk.14.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 131: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 133: blk.15.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 134: blk.15.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 135: blk.15.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 136: blk.15.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 138: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 139: blk.16.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.16.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 141: blk.16.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 142: blk.16.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 143: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.17.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 146: blk.17.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 147: blk.17.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 148: blk.17.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 149: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 150: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 151: blk.18.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.18.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 153: blk.18.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 154: blk.18.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 155: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 156: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 157: blk.19.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 158: blk.19.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 159: blk.19.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.19.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 161: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.20.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 164: blk.20.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 165: blk.20.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 166: blk.20.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 167: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 168: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 169: blk.21.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.21.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 171: blk.21.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 172: blk.21.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 173: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 174: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 175: blk.22.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 176: blk.22.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 177: blk.22.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.22.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 179: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.23.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 182: blk.23.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 183: blk.23.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 184: blk.23.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 185: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 186: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 187: blk.24.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.24.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 189: blk.24.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 190: blk.24.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 191: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 192: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 193: blk.25.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 194: blk.25.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 195: blk.25.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.25.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 197: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.26.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 200: blk.26.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 201: blk.26.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 202: blk.26.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 203: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 204: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 205: blk.27.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.27.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 207: blk.27.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 208: blk.27.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 209: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 210: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 211: blk.28.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 212: blk.28.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 213: blk.28.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.28.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 215: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.29.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 218: blk.29.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 219: blk.14.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 220: blk.14.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 221: blk.14.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 222: blk.15.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.15.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.15.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.16.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 226: blk.16.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 227: blk.16.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.17.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 229: blk.17.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 230: blk.17.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 231: blk.18.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.18.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.18.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.19.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 235: blk.19.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 236: blk.19.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.20.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 238: blk.20.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 239: blk.20.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 240: blk.21.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.21.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.21.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.22.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 244: blk.22.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 245: blk.22.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.23.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 247: blk.23.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 248: blk.23.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 249: blk.24.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.24.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.24.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.25.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 253: blk.25.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 254: blk.25.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.26.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 256: blk.26.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 257: blk.26.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 258: blk.27.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.27.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.27.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 262: blk.28.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 263: blk.28.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 272: blk.30.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 273: blk.30.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 277: blk.31.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 278: blk.31.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 279: blk.31.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 280: blk.31.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 283: blk.32.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 284: blk.32.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 285: blk.32.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 286: blk.32.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 287: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 289: blk.33.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.33.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 291: blk.33.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.33.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 293: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 294: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 295: blk.34.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.34.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 297: blk.34.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 298: blk.34.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 299: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 301: blk.35.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 302: blk.35.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 303: blk.35.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 304: blk.35.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 305: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 306: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 307: blk.36.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.36.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 309: blk.36.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.36.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 311: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 312: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 313: blk.37.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.37.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 315: blk.37.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 316: blk.37.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 317: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 319: blk.38.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 320: blk.38.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 321: blk.38.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 322: blk.38.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 323: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 324: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 325: blk.39.attn_output.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.39.ffn_gate.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 327: blk.39.ffn_down.weight q4_0 [ 13696, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.39.ffn_up.weight q4_0 [ 5120, 13696, 1, 1 ] llama_model_loader: - tensor 329: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 330: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 331: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 332: output.weight q6_K [ 5120, 125696, 1, 1 ] llama_model_loader: - tensor 333: blk.30.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 334: blk.30.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.30.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 336: blk.31.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.31.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 338: blk.31.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 339: blk.32.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 340: blk.32.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.32.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.33.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 343: blk.33.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.33.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 345: blk.34.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.34.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 347: blk.34.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 348: blk.35.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 349: blk.35.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.35.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.36.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 352: blk.36.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.36.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 354: blk.37.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.37.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 356: blk.37.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 357: blk.38.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 358: blk.38.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.38.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_k.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 362: blk.39.attn_v.weight q4_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: baichuan.tensor_data_layout str llama_model_loader: - kv 3: baichuan.context_length u32 llama_model_loader: - kv 4: baichuan.embedding_length u32 llama_model_loader: - kv 5: baichuan.block_count u32 llama_model_loader: - kv 6: baichuan.feed_forward_length u32 llama_model_loader: - kv 7: baichuan.rope.dimension_count u32 llama_model_loader: - kv 8: baichuan.attention.head_count u32 llama_model_loader: - kv 9: baichuan.attention.head_count_kv u32 llama_model_loader: - kv 10: baichuan.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 llama_model_loader: - kv 18: general.quantization_version u32 llama_model_loader: - kv 19: general.file_type u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = baichuan llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 125696 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base = 10000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 13.90 B llm_load_print_meta: model size = 7.44 GiB (4.60 BPW) llm_load_print_meta: general.name = Baichuan2-13B-Chat llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 1099 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 7614.46 MB (+ 400.00 MB per state) ........................................................................................... llama_new_context_with_model: kv self size = 400.00 MB ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: loading '/Volumes/WD_sn770/LLAMA2/llamacpp/llama.cpp/ggml-metal.metal' ggml_metal_init: loaded kernel_add 0x119507430 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_add_row 0x119507c60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul 0x119508180 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_row 0x1195087b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_scale 0x119508cd0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_silu 0x1195091f0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_relu 0x119509710 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_gelu 0x119509c30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max 0x13cf059a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_soft_max_4 0x13ce07530 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf 0x13ce07b70 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_diag_mask_inf_8 0x13ce08340 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f32 0x13ce089f0 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_f16 0x13ce090a0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_0 0x13ce09750 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ce09e00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q8_0 0x13ce0a4b0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ce0ab60 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ce0b210 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ce0ba30 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ce0c0e0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ce0c790 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_rms_norm 0x13ce0ce50 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_norm 0x13ce0d680 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x13ce0dee0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ce0e740 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x13ce0efa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x13ce0fa00 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ce10160 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13ce10b20 | th_max = 896 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x13ce11280 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x13ce119e0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x13ce11f00 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x13ce12660 | th_max = 576 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x13ce12dc0 | th_max = 640 | th_width = 32 ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x13ce13520 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x13ce13d30 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x13ce14540 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x13ce14d50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x13ce15560 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x13ce15d70 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x13ce16580 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x13ce16d90 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x13ce175a0 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11950a320 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11950ac50 | th_max = 768 | th_width = 32 ggml_metal_init: loaded kernel_rope 0x11950b3d0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_alibi_f32 0x11950bfa0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f16 0x11950c830 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f32_f32 0x11950d0c0 | th_max = 1024 | th_width = 32 ggml_metal_init: loaded kernel_cpy_f16_f16 0x11950d950 | th_max = 1024 | th_width = 32 ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 256.97 MB llama_new_context_with_model: max tensor size = 503.47 MB ggml_metal_add_buffer: allocated 'data ' buffer, size = 7617.11 MB, ( 7617.61 / 10922.67) ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 7619.09 / 10922.67) ggml_metal_add_buffer: allocated 'kv ' buffer, size = 402.00 MB, ( 8021.09 / 10922.67) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 255.52 MB, ( 8276.61 / 10922.67) GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" GGML_ASSERT: ggml-metal.m:1146: false && "only power-of-two n_head implemented" zsh: abort ./main -m ./zh-models/baichuan/Baichuan2-13B-Chat-ggml-model-q4_0.bin -p 0 按照上面的步骤,GPU推理报错,CPU下正常。大佬可以帮忙看下吗?MAC MINI M2 @jameswu2014

我也是这个问题,CPU正常,但是GPU不行。
CUDA error 9 at ggml-cuda.cu:6829: invalid configuration argument

@wzp123123
Copy link

同样的问题,有没有解决的思路

@wzp123123
Copy link

ggerganov/llama.cpp#3740

@chunzha1
Copy link

好像和我的问题不太一样,请问你是什么设备?

@guoqiangqi
Copy link

guoqiangqi commented Mar 6, 2024

使用最新版本llama.cpp 量化后server推理结果不准确,量化前未将baichuan2-13b-chat转为baichuan1模式,是这个导致的吗?
Uploading llama_cpp.png…

@VJJJJJJ1
Copy link

llama.cpp我提了一个PR:ggerganov/llama.cpp#3009 你先按照Baichuan2的Readme里的Baichuan2->Baichuan1 的lm_head转换修改一下模型,就可以用上面链接里的修改。

请问微调后的baichuan2也可以用这个方法来加速吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests