-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload the LLM #91
base: main
Are you sure you want to change the base?
Offload the LLM #91
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #91 +/- ##
==========================================
- Coverage 38.57% 38.44% -0.13%
==========================================
Files 64 64
Lines 2331 2339 +8
==========================================
Hits 899 899
- Misses 1432 1440 +8
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! @LeonNissen
Had some questions regarding the implementation.
await MainActor.run { | ||
modelContainer = nil | ||
} | ||
MLX.GPU.clearCache() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure: Aren't there any other cleanup operations necessary, e.g. from MLX?
After the offload() function completes, is the entire memory footprint allocated by the local LLM session freed up again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please document this new method in the respective DocC articles / in-line docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code mlx-swift-examples
does not include any documentation for that.
But, however, yes, the simple clear cache function does free up the entire memory footprint 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code
mlx-swift-examples
does not include any documentation for that. But, however, yes, the simple clear cache function does free up the entire memory footprint 👍
I would recommend to go through the MLX transformer codebase a bit and check if there's any undocumented functionality in there. I assume that someone from the MLX team thought about offloading models before. Otherwise, we can also create an issue / question post in that repo.
/// Releases the memory associated with the current model. | ||
/// | ||
/// This function frees up memory resources by clearing the model container and reset the GPU cache, allowing to e.g. load a different model. | ||
public func offload() async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one re-load a model into memory?
Is a simple setup()
call on the session enough to restore the entire underlying execution environment?
If yes, we should document that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).
I'm not 100% sure if that's still the case, but if yes, wouldn't it then make more sense to put such a function on the LLMRunner
? IMO that makes more sense semantically. The downside here is that the LLMRunner
is broad and applies to all platforms, one might needs to find a more appropriate solution for local LLM offloading then..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so self.state
), and maybe even add a new state (offloaded
or something similar, which is not the same as uninitialized
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one re-load a model into memory? Is a simple
setup()
call on the session enough to restore the entire underlying execution environment? If yes, we should document that!
Yes, it does. The setup
function will reload the LLM when called, but even without it, the generate()
function will automatically call the setup function if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so
self.state
), and maybe even add a new state (offloaded
or something similar, which is not the same asuninitialized
)
Yes you're right, I'd missed to set the state to uninitialized
again, will fix that now 🚀.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep totally agree with @PSchmiedmayer here, this behavior should definitely be configurable, as mentioned above. A maximum memory footprint is probably the best metric here. And yes, the offloading / setup of models should then happen fully transparent via the LLMRunner
(or the LLMLocalPlatform
).
I would definitely still aim for providing developers with an API to "manually" offload models, but yes, the LLMRunner
should be able to automatically decide when to offload modes in certain scenarios, e.g. when hitting the aforementioned memory footprint or when the app goes into the background, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).
@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).
@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.
@jdisho There are tons of use cases where multiple LLM sessions with different models can be required. Don't just think of the chat use case for LLMs but more towards using LLMs for certain features / services within the application. The probably best example is Apple Intelligence (at least the "idea" of it) that seamlessly integrates multiple small LLMs into iOS / macOS to power certain services, often at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking LLMonFHIR
as an example, there could be two sessions: one using a model for resource interpretation and another for the summary. Did I get it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, and these sessions run locally in parallel. However, if the maximum memory footprint is hit, the LLMRunner
should be smart enough to automatically offload one of the models used in the respective session and then continue the offloaded session once the first session's generation completes. All totally seamless to the developer / user.
The more I'm thinking about this concept, the more I like it. We whould definitley implement this @LeonNissen! 🚀
Offload the LLM
♻️ Current situation & Problem
When a model is loaded, the associated memory remains occupied, even after the user has finished using the model. This can lead to unnecessary resource consumption, potentially impacting performance and the overall user experience. To address this, resources should be released as soon as they are no longer needed. (see #90)
⚙️ Release Notes
This feature introduces an
offload
function, which releases the memory associated with the current model, optimizing resource utilization.📝 Code of Conduct & Contributing Guidelines
By submitting creating this pull request, you agree to follow our Code of Conduct and Contributing Guidelines: