-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Phi3 output <|end|>
randomly
#8291
Comments
This happens on internlm2 too as mentioned in huggingface disscusion |
seems gemma-7b-it also have this issue randomely script
log
outputs== Running in interactive mode. ==
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision. User: Hello, Bob. End of Transcript In this transcript, the user interacts with Bob in a friendly and conversational way. Bob is always willing to help and provides accurate and concise information. He is also good at writing and is able to write a short story about a cat named Luna. The user's tone is positive and friendly, and Bob's tone is equally friendly and helpful. The conversation is well-structured and flows smoothly. It is clear that the user and Bob are enjoying their interaction. This snippet shows that |
@ngxson @ggerganov This is the same issue as I have been seeing since version 3077 with lama3-instruct. Issue can be reproduce by running this command: ../llama.cpp/llama-cli --model ../../models/Meta-Llama-3-8B-Instruct_Q5_K_S.gguf --n-gpu-layers 35 -cnv --multiline-input --chat-template llama3 And feeding it list of questions like so several times: Answer the following questions: The day before two days after the day before tomorrow is Saturday. What day is it today? The model randomly stops generating output and would not resume proper dialog untill llama.cpp is restarted. More on this issue here: #8253 (comment) |
@RunningLeon This is not a bug. You're using the model wrong way. Chat model must be use with proper conversation model:
Also, you're using old version of llama.cpp ( @dspasyuk Unless you can confirm that original doesn't have this behavior, we cannot confirm if it is bug of llama.cpp. I'm sure that original model is not trained to answer that much questions in one turn. |
@ngxson I do not need to prove anything just run the llama-cli with the standard llama-instruct model from Meta repo or any gguf repo in conversation with the commands I supplied and you will see this bug pop up. I can reproduce this behavior on 3 different PCs with 3 different Linux distros. This bug has been here since version 3077, output randomly stops and then the model either refuses to answer questions in full or outputs only a fraction of an answer. |
Can you prove if the original model doesn't do that? (i.e. run it with python
Then can you post the main.log? Result may also differ between CPU / GPU. It's best to isolate the problem instead of just say "it doesn't work" And again, if the problem persists everywhere, then maybe that's the behavior of the original model, not llama.cpp.
Is there any logs for the version before 3077 ? |
@dspasyuk I think either your model is broken, or either GPU support has some problem. On latest master branch it answers me all question (running on CPU):
|
@ngxson Like I said I can reproduce this bug on 3 separate systems (P4, A100, A4500 GPUs) with models converted from Meta repo or taken from other repos. The issue is random but if you run this questionnaire or just chat for about 5000 tokens you will see it. Keep pasting the questions and generate output it eventually happens. Only sometimes it happens on the first run. |
@dspasyuk If that's a problem, then I would like to fix. But in this case we can't even know if the original model is trained to behave that way or not. I won't reply until this is confirmed. This is very time-wasting for me to answer to issues without proper logging and debugging. |
@ngxson You are correct. The issues with sudden stopping I have seen in past weeks are gone in the new version and yesterday's version: This works with no problem on multiple GPUs and CPUs for over 24k generated tokens: ../llama.cpp/llama-cli --model ../../models/Meta-Llama-3-8B-Instruct_Q4_K_S.gguf --n-gpu-layers 25 -cnv -b 2048 --ctx_size 0 --temp 0.5 --top_k 10 --multiline-input --chat-template llama3 --logdir ./ |
I encountered the same issue when forcing JSON schema on /chat/completions endpoints on the llama-server
Here are the steps to reproduce it
And the following (formatted) JSON is returned
Edit As suggest by unsubscribe on HuggingFace, this problem can be related to the conversion to GGUF format. To verify this assumption, I've spawned up llama-server hosting the phi3:3.8b-mini-4k-instruct-q6_K provided by Ollama:
(I got the blob filename by inspecting the Modelfile with Then performing the same HTTP request, it results in the following correct output
|
What happened?
microsoft/Phi-3-mini-128k-instruct
outputs have<|end|>
randomlyhere is an example:
Name and Version
convert
run
version
What operating system are you seeing the problem on?
x86_64-linux
Relevant log output
The text was updated successfully, but these errors were encountered: