-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama.cpp: infinite loop of context switch #1333
Comments
|
a workaround (not a solution) is available in #1339 - it still doesn't handle gracefully connections, but at least avoids the API to stall indefinetly |
What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK. Model: TinyLlama-1.1B-Chat-v1.0
Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it
|
ollama doesn't use the llama.cpp http/server code, indeed ggerganov/llama.cpp#3969 affects only the http implementation. When we switched away from using the bindings - we now rely directly on the llama.cpp code and we build grpc server around it in C++, and that brings us more close to llama.cpp implementation (with eventual bugs attached as well) |
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
@dionysius this is going to be fixed in #1704 |
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
Infinite context loop might as well trigger an infinite loop of context shifting if the model hallucinates and does not stop answering. This has the unpleasant effect that the predicion never terminates, which is the case especially on small models which tends to hallucinate. Workarounds #1333 by removing context-shifting. See also upstream issue: ggerganov/llama.cpp#3969
This is fixed in LocalAI. Upstream workaround is as well to put a cap on max tokens as the models tends to hallucinate, infinite context shifting might actually lead to infinite answers too (see commit message in c56b6dd ). It was nice to see that upstream confirmed the issue with ggerganov/llama.cpp#3969 (comment) after the above workaround - it sounds much more safer to not expose the user at all to this by disabling it entirely, and I think what we do is to shield the user from such nuances. We can look at this again if someone really thinks this is an issue. Closing it for now |
This card is a tracker for ggerganov/llama.cpp#3969
This seems to happen to me as well with the llama.cpp backend only: I can reproduce it programmatically with certain text by using grammars
Update:
There is an "epic" here that we should keep an eye on: ggerganov/llama.cpp#4216
The text was updated successfully, but these errors were encountered: