Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix #10119
Using residency sets makes the allocated memory stay wired and eliminates almost completely the overhead observed in #10119. For example, on M2 Ultra, using 7B Q8_0 model the requests are ~250ms faster thanks to this change. It seems it is not necessary to attach the residency sets to the command queue and buffers, so the change is rather simple. For each buffer, we create an associated
MTLResidencySet
and add theMTLBuffer
objects to it. After that we commit it and request residency:llama.cpp/ggml/src/ggml-metal/ggml-metal.m
Lines 1084 to 1091 in 225d2e0
build: b9126fe (4561)
Metal backend changes
Checks the environment variable
GGML_METAL_NO_RESIDENCY
. If set, then no residency sets will be created, allowing the GPU memory to be collected by the OS after 1 second of inactivity. Generally, this should rarely be needed as it hurts the performance of the application, but keeping support just in case.