You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've figure it out, you need to change those memory Settings in chatglm.h. Also make sure "max_context_length" and "max_tokens" do not occupy more than than memory value
首先大赞本项目的推理加速效果!666!
环境Linux py38
我在使用python 绑定编译后的chatglm.cpp 模块后,使用q4_0量化chatglm2-6b
推理设置
generation_kwargs = dict(
max_length=6000,
max_context_length=2400,
do_sample=args.temp > 0,
top_k=args.top_k,
top_p=args.top_p,
temperature=args.temp,
repetition_penalty=args.repeat_penalty,
stream=True,
)
设置max_context_length > 2048 (比如langchain 场景下很长的上下文)时 报错:
ggml_new_tensor_impl: not enough space in the scratch memory pool
这个问题貌似不少llama.cpp用户遇到过 google一搜遍地开花
我在google搜
llama-cpp-python 是有类似的issue
abetlen/llama-cpp-python#356
abetlen/llama-cpp-python#356 (comment) 提到是llama.cpp内存泄漏
llama.cpp 项目里面也有这个bug 有人 回滚了版本貌似解决了 llama_cpp_python回滚到0.1.74
ggerganov/llama.cpp#29 (comment)
ggerganov/llama.cpp#2404 (comment)
使用本项目跑推理时遇到这个bug怎么解决呢?谢谢
The text was updated successfully, but these errors were encountered: