-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a possible memory leak in llama_cpp.llama_decode()? #924
Comments
Thanks @littlebai3618, I'm actually working on the same thing for #771 so will look into this. My guess is that llama.cpp is simply not shrinking the kv cache and repeated calls lead to excess fragmentation and cache bloat, do you always OOM or does it hit ever hit a ceiling? |
Here are the latest developments:
I'm not aware of the progress on continuous batch processing, so I've submitted a draft of my simplified version. This draft code can handle 1000 sentences in parallel with 10 instances. However, it crashes when the prompt is too long. Please review it for any potential bugs. If this issue can be resolved, I'd be happy to assist you with implementing support for continuous batch processing. |
After investigation, it was discovered that I misunderstood the batch parameter. The issue of fragmented kv cache has been resolved, and the latest llama.cpp has fixed this problem. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Is there a possible memory leak in llama_cpp.llama_decode()? If this is a normal behavior, please let me know.
Background: I'm using the low-level API provided by llama_cpp.py to implement a Python version of continuous batch processing based on parallel.cpp in llama.cpp. During runtime, the program's memory and GPU memory usage keep increasing slowly, and eventually, the program crashes.
I used memory_profiler to observe the memory usage of each line in the self.eval method of the higher-level llama API. I found that the memory usage significantly increases in the following line and is not released in subsequent runs
My understanding is that any operations performed by decode should operate within pre-allocated memory instead of occupying new memory.
Please note that
not every call to this line increases memory usage, but once it increases, the memory is not released.
Current Behavior
As mentioned above, the memory appears to be increasing abnormally.
Environment and Context
$ lscpu
nvidia-smi
$ uname -a
Linux aistudio-31572-prod-0 4.19.96 #1 SMP Tue Mar 10 10:34:01 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
It appears that you are using the "CodeLlama-7B-HF" model from the repository you mentioned (https://huggingface.co/codellama/CodeLlama-7b-hf). You mentioned that you performed the conversion using the "convert.py" script included in the repository.
Steps to Reproduce
Failure Logs
Please ignore the line numbers. I have inherited the Llama class to add memory monitoring code.
Here are some of the detection results:
The text was updated successfully, but these errors were encountered: