-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: sample time becomes very long when using Llama-3 #7554
Comments
I noticed there is a relevant issue #4218. |
The specific grammar might be relevant, so provide that as well and the commands that you are using |
The vocabulary of llama3 is much larger than llama2, so I think it is expected that samplers will be slower. However this seems excessive. |
I second this. Windows 10, runing on: No grammar: With grammar: 11x slower. Update: |
Could you give #7587 a try and report results |
Sure. No grammar: Grammar: llama_print_timings: sample time = 167.80 ms / 150 runs ( 1.12 ms per token, 893.93 tokens per second) llama_print_timings: sample time = 56.97 ms / 229 runs ( 0.25 ms per token, 4019.52 tokens per second) llama_print_timings: sample time = 145.16 ms / 432 runs ( 0.34 ms per token, 2975.99 tokens per second) Update: After (3e5d281): Definetely an improvement, but still much slower with grammar. |
Updated above. |
What happened?
I was running Llama-3 on 3090 and I encountered the same performance problem in #1376.
When using grammar files, sample time becomes very long and GPU utilization dropped from 70%+(when not using grammar) to 10%.
I tried two different fine-tuned version of Llama-3 and the problem remains.
With Llama-2 there is no such problem. So I believe it is due to some kind of bug in llama.cpp
I offloaded all layers to GPU and I believe I have llama.cpp properly configured.
Name and Version
version: 2998 (9588f19)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: