Offload the LLM #91

LeonNissen · 2025-01-19T18:30:54Z

Offload the LLM

♻️ Current situation & Problem

When a model is loaded, the associated memory remains occupied, even after the user has finished using the model. This can lead to unnecessary resource consumption, potentially impacting performance and the overall user experience. To address this, resources should be released as soon as they are no longer needed. (see #90)

⚙️ Release Notes

This feature introduces an offload function, which releases the memory associated with the current model, optimizing resource utilization.

📝 Code of Conduct & Contributing Guidelines

By submitting creating this pull request, you agree to follow our Code of Conduct and Contributing Guidelines:

I agree to follow the Code of Conduct and Contributing Guidelines.

codecov · 2025-01-19T18:51:56Z

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 38.44%. Comparing base (26b1e07) to head (b3d55f7).

Files with missing lines	Patch %	Lines
Sources/SpeziLLMLocal/LLMLocalSession.swift	0.00%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #91      +/-   ##
==========================================
- Coverage   38.57%   38.44%   -0.13%     
==========================================
  Files          64       64              
  Lines        2331     2339       +8     
==========================================
  Hits          899      899              
- Misses       1432     1440       +8

Files with missing lines	Coverage Δ
...eziLLMLocal/Configuration/LLMLocalParameters.swift	`0.00% <ø> (ø)`
Sources/SpeziLLMLocal/LLMLocalSession.swift	`0.00% <0.00%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b1e07...b3d55f7. Read the comment docs.

philippzagar

Thanks for the PR! @LeonNissen
Had some questions regarding the implementation.

philippzagar · 2025-01-19T19:37:23Z

Sources/SpeziLLMLocal/LLMLocalSession.swift

+        await MainActor.run {
+            modelContainer = nil
+        }
+        MLX.GPU.clearCache()
+    }


Just making sure: Aren't there any other cleanup operations necessary, e.g. from MLX?
After the offload() function completes, is the entire memory footprint allocated by the local LLM session freed up again?

Also, please document this new method in the respective DocC articles / in-line docs

Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code mlx-swift-examples does not include any documentation for that.
But, however, yes, the simple clear cache function does free up the entire memory footprint 👍

Unfortunately, I haven't found any other more appropriate functions to offload a model and the underlying code mlx-swift-examples does not include any documentation for that. But, however, yes, the simple clear cache function does free up the entire memory footprint 👍

I would recommend to go through the MLX transformer codebase a bit and check if there's any undocumented functionality in there. I assume that someone from the MLX team thought about offloading models before. Otherwise, we can also create an issue / question post in that repo.

philippzagar · 2025-01-19T19:40:35Z

Sources/SpeziLLMLocal/LLMLocalSession.swift

+    /// Releases the memory associated with the current model.
+    ///
+    /// This function frees up memory resources by clearing the model container and reset the GPU cache, allowing to e.g. load a different model.
+    public func offload() async {


How does one re-load a model into memory?
Is a simple setup() call on the session enough to restore the entire underlying execution environment?
If yes, we should document that!

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).
I'm not 100% sure if that's still the case, but if yes, wouldn't it then make more sense to put such a function on the LLMRunner? IMO that makes more sense semantically. The downside here is that the LLMRunner is broad and applies to all platforms, one might needs to find a more appropriate solution for local LLM offloading then..

Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so self.state), and maybe even add a new state (offloaded or something similar, which is not the same as uninitialized)

How does one re-load a model into memory? Is a simple setup() call on the session enough to restore the entire underlying execution environment? If yes, we should document that!

Yes, it does. The setup function will reload the LLM when called, but even without it, the generate() function will automatically call the setup function if necessary.

Just making sure: Do we properly handle the error if a new inference request is dispatched when the model is offloaded? I guess you also need to update the state of the model appropriately (so self.state), and maybe even add a new state (offloaded or something similar, which is not the same as uninitialized)

Yes you're right, I'd missed to set the state to uninitialized again, will fix that now 🚀.

Yep totally agree with @PSchmiedmayer here, this behavior should definitely be configurable, as mentioned above. A maximum memory footprint is probably the best metric here. And yes, the offloading / setup of models should then happen fully transparent via the LLMRunner (or the LLMLocalPlatform).

I would definitely still aim for providing developers with an API to "manually" offload models, but yes, the LLMRunner should be able to automatically decide when to offload modes in certain scenarios, e.g. when hitting the aforementioned memory footprint or when the app goes into the background, ...

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).

@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.

Back in the days, I implemented SpeziLLM to only support one concurrent local LLM session (for obvious performance reasons).

@philippzagar - Do you have a specific use case in mind where multiple sessions are needed? I'm considering the case of different chat sessions. But, even in this case, I'm not sure why multiple concurrent sessions would be necessary. When a chat view is closed, its session should end as well.

@jdisho There are tons of use cases where multiple LLM sessions with different models can be required. Don't just think of the chat use case for LLMs but more towards using LLMs for certain features / services within the application. The probably best example is Apple Intelligence (at least the "idea" of it) that seamlessly integrates multiple small LLMs into iOS / macOS to power certain services, often at the same time.

Taking LLMonFHIR as an example, there could be two sessions: one using a model for resource interpretation and another for the summary. Did I get it right?

Yep, and these sessions run locally in parallel. However, if the maximum memory footprint is hit, the LLMRunner should be smart enough to automatically offload one of the models used in the respective session and then continue the offloaded session once the first session's generation completes. All totally seamless to the developer / user.
The more I'm thinking about this concept, the more I like it. We whould definitley implement this @LeonNissen! 🚀

Sources/SpeziLLMLocal/Configuration/LLMLocalParameters.swift

add unload function

4040fa3

LeonNissen requested a review from philippzagar January 19, 2025 18:31

cancel before offloading

2d2f946

philippzagar reviewed Jan 19, 2025

View reviewed changes

Leon Nissen added 2 commits January 19, 2025 12:12

change state

3a03907

add documentation

b3d55f7

philippzagar reviewed Jan 19, 2025

View reviewed changes

Sources/SpeziLLMLocal/Configuration/LLMLocalParameters.swift Show resolved Hide resolved

PSchmiedmayer assigned philippzagar and LeonNissen Jan 20, 2025

PSchmiedmayer added the enhancement New feature or request label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offload the LLM #91

Offload the LLM #91

LeonNissen commented Jan 19, 2025 •

edited

Loading

codecov bot commented Jan 19, 2025 •

edited

Loading

philippzagar left a comment

philippzagar Jan 19, 2025

philippzagar Jan 19, 2025

LeonNissen Jan 19, 2025

philippzagar Jan 19, 2025

philippzagar Jan 19, 2025

philippzagar Jan 19, 2025

philippzagar Jan 19, 2025

LeonNissen Jan 19, 2025

LeonNissen Jan 19, 2025

philippzagar Jan 20, 2025 •

edited

Loading

jdisho Jan 25, 2025

philippzagar Jan 25, 2025

jdisho Jan 25, 2025

philippzagar Jan 25, 2025

Offload the LLM #91

Are you sure you want to change the base?

Offload the LLM #91

Conversation

LeonNissen commented Jan 19, 2025 • edited Loading

Offload the LLM

♻️ Current situation & Problem

⚙️ Release Notes

📝 Code of Conduct & Contributing Guidelines

codecov bot commented Jan 19, 2025 • edited Loading

Codecov Report

philippzagar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philippzagar Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LeonNissen commented Jan 19, 2025 •

edited

Loading

codecov bot commented Jan 19, 2025 •

edited

Loading

philippzagar Jan 20, 2025 •

edited

Loading