-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor model loader with backend registry #10026
Conversation
0c91485
to
ac4b252
Compare
bea440d
to
24b79ca
Compare
@JohannesGaessler I would appreciate your opinion regarding the changes to
|
Regarding the kompute backend, I've started adding support for the backend/device register interface in #10045 I expect to be ready for review in a couple days. |
I didn't yet look at the code due to the statement
Is this still accurate? |
I still expect to make significant changes, so I don't see the point of doing a code review at this point, but the change to |
I think it would be fine to replace the current Long-term I think the correct way to parallelize a transformer is to split the attention by heads since each head can be computed independently of the others. That would then also naturally distribute the KV cache. |
24b79ca
to
31b1a7c
Compare
d06a75e
to
916d263
Compare
dac2953
to
9afae6a
Compare
6f0c502
to
1a408d7
Compare
1a408d7
to
63c47ab
Compare
This should be ready now. I will leave the flash attention changes for a different PR, since this is already becoming too big to review. |
Do I understand the idea correctly that for example the CPU backend can "export" an extra "aarch64" buffer type that would be added at the beginning of the CPU buffer types list (after the ACCEL buffer types). When we test if a weight can be allocated with this extra buffer type, we will check if for example the hardware supports SVE or SME in the |
Yes, that's exactly it. The check for hardware support may also be done in |
Please consider merging #10045 first to avoid breaking the kompute backend upstream. |
Vram usage seems to be a lot higher. |
@8XXD8 I cannot reproduce that, and I can't even imagine how any of that could happen. Without more details, I am just going to assume that it is an AMD driver issue. |
There will be additional changes needed to the Kompute backend after this is merged to adapt to the interface changes here, which I will not be able to test. Ultimately, the Kompute backend needs a maintainer that is willing to keep it updated with these changes, I cannot be responsible for updating every backend. |
On MacOS, the ./scripts/compare-commits.sh master sl/load-time-supports-op -m ./models/llama-3.2-1b-instruct/ggml-model-q4_0.gguf -r 1 -n 0
...
build: 63c47ab8 (3984)
+ ./scripts/compare-llama-bench.py -b master -c sl/load-time-supports-op
Traceback (most recent call last):
File "./scripts/compare-llama-bench.py", line 307, in <module>
gpu_blas = bool(rows_full[0][KEY_PROPERTIES.index("gpu_blas")])
~~~~~~~~~^^^
IndexError: list index out of range |
Due to the changes to
In short: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool work!
FWIW, I can help with the maintenance work for the Kompute backend. |
Do I understand correctly that with the extra buft added in ggml_backend_dev_get_extra_bufts the repack of Q4_0 could be done in ggml_backend_cpu_buffer_set_tensor? BTW there is small naming discrepancy in ggml_backend_cpu_get_proc_address where "ggml_backend_dev_get_extra_bufts" should be "ggml_backend_cpu_get_extra_bufts" |
More or less. The process for implementing this would be something like this:
This is intentional. |
Since there are a few changes waiting on this, I will merge this now. @8XXD8 please open an issue with more details if you still see that problem after this is merged. |
The problem still persists, I'll try a few things before opening an issue. |
-sm row
): KV and other non-matrix weights are split among the available GPUs in the same way as split by layer mode.-ngl
, it is considered a CPU accelerator and used automatically in CPU layers. Unsupported types automatically use the CPU backend._Mapped
suffix so that it is easier to tell when a model is being loaded with mmapllama-bench
now uses the backend registry to determine the backend used and obtain device descriptions-ngl
.alloc_buffer
function is no longer called (backends no longer need to handle this case separately)GGML_SCHED_DEBUG
now takes an integer value: 0 will no print any debugging trace, 1 will print the split headers only, 2 will print the entire graphggml_backend_dev_get_extra_bufts
optional function (returned withget_proc_address
) for backends that have multiple buffer types. llama.cpp will automatically use these buffer types if available. This is intended to be used with buffer types that change tensor layouts, eg. for automatic conversion of Q4_0 to the aarch64 types.ggml_backend_split_buffer_type
now takes a device parameter, which represents the main device intended to be used with this split buffer. Only this device should be reported as supported in thesupports_buft
function. Backends that implement split buffer types (CUDA and SYCL) should update this function to support the changes to-sm row
.Note: when this is merged, support for backends that do not implement the reg/device interfaces will be dropped. This will affect the Kompute backend until #10045 is merged.