Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload the LLM #91

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Offload the LLM #91

wants to merge 4 commits into from

Conversation

LeonNissen
Copy link
Contributor

@LeonNissen LeonNissen commented Jan 19, 2025

Offload the LLM

♻️ Current situation & Problem

When a model is loaded, the associated memory remains occupied, even after the user has finished using the model. This can lead to unnecessary resource consumption, potentially impacting performance and the overall user experience. To address this, resources should be released as soon as they are no longer needed. (see #90)

⚙️ Release Notes

This feature introduces an offload function, which releases the memory associated with the current model, optimizing resource utilization.

📝 Code of Conduct & Contributing Guidelines

By submitting creating this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:

Copy link

codecov bot commented Jan 19, 2025

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 38.44%. Comparing base (26b1e07) to head (b3d55f7).

Files with missing lines Patch % Lines
Sources/SpeziLLMLocal/LLMLocalSession.swift 0.00% 8 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #91      +/-   ##
==========================================
- Coverage   38.57%   38.44%   -0.13%     
==========================================
  Files          64       64              
  Lines        2331     2339       +8     
==========================================
  Hits          899      899              
- Misses       1432     1440       +8     
Files with missing lines Coverage Δ
...eziLLMLocal/Configuration/LLMLocalParameters.swift 0.00% <ø> (ø)
Sources/SpeziLLMLocal/LLMLocalSession.swift 0.00% <0.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b1e07...b3d55f7. Read the comment docs.

Copy link
Member

@philippzagar philippzagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! @LeonNissen
Had some questions regarding the implementation.

Comment on lines 121 to 125
await MainActor.run {
modelContainer = nil
}
MLX.GPU.clearCache()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure: Aren't there any other cleanup operations necessary, e.g. from MLX?
After the offload() function completes, is the entire memory footprint allocated by the local LLM session freed up again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please document this new method in the respective DocC articles / in-line docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code mlx-swift-examples does not include any documentation for that.
But, however, yes, the simple clear cache function does free up the entire memory footprint 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code mlx-swift-examples does not include any documentation for that. But, however, yes, the simple clear cache function does free up the entire memory footprint 👍

I would recommend to go through the MLX transformer codebase a bit and check if there's any undocumented functionality in there. I assume that someone from the MLX team thought about offloading models before. Otherwise, we can also create an issue / question post in that repo.

/// Releases the memory associated with the current model.
///
/// This function frees up memory resources by clearing the model container and reset the GPU cache, allowing to e.g. load a different model.
public func offload() async {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one re-load a model into memory?
Is a simple setup() call on the session enough to restore the entire underlying execution environment?
If yes, we should document that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).
I'm not 100% sure if that's still the case, but if yes, wouldn't it then make more sense to put such a function on the LLMRunner? IMO that makes more sense semantically. The downside here is that the LLMRunner is broad and applies to all platforms, one might needs to find a more appropriate solution for local LLM offloading then..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so self.state), and maybe even add a new state (offloaded or something similar, which is not the same as uninitialized)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one re-load a model into memory? Is a simple setup() call on the session enough to restore the entire underlying execution environment? If yes, we should document that!

Yes, it does. The setup function will reload the LLM when called, but even without it, the generate() function will automatically call the setup function if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so self.state), and maybe even add a new state (offloaded or something similar, which is not the same as uninitialized)

Yes you're right, I'd missed to set the state to uninitialized again, will fix that now 🚀.

Copy link
Member

@philippzagar philippzagar Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep totally agree with @PSchmiedmayer here, this behavior should definitely be configurable, as mentioned above. A maximum memory footprint is probably the best metric here. And yes, the offloading / setup of models should then happen fully transparent via the LLMRunner (or the LLMLocalPlatform).

I would definitely still aim for providing developers with an API to "manually" offload models, but yes, the LLMRunner should be able to automatically decide when to offload modes in certain scenarios, e.g. when hitting the aforementioned memory footprint or when the app goes into the background, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).

@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).

@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.

@jdisho There are tons of use cases where multiple LLM sessions with different models can be required. Don't just think of the chat use case for LLMs but more towards using LLMs for certain features / services within the application. The probably best example is Apple Intelligence (at least the "idea" of it) that seamlessly integrates multiple small LLMs into iOS / macOS to power certain services, often at the same time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking LLMonFHIR as an example, there could be two sessions: one using a model for resource interpretation and another for the summary. Did I get it right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and these sessions run locally in parallel. However, if the maximum memory footprint is hit, the LLMRunner should be smart enough to automatically offload one of the models used in the respective session and then continue the offloaded session once the first session's generation completes. All totally seamless to the developer / user.
The more I'm thinking about this concept, the more I like it. We whould definitley implement this @LeonNissen! 🚀

@PSchmiedmayer PSchmiedmayer added the enhancement New feature or request label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants