-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : adapt Metal to new ggml_backend interface #2258
Conversation
985457b
to
90503f1
Compare
Metal can share the RAM memory and can utilize mmap without temp buffer
I'll need some mechanism to "map" the RAM to Metal buffers. Lines 2819 to 2825 in 294f424
I guess I'll do something similar here. Edit: On second look, the |
I think you only need to do that to support mmap, right? For the rest of the buffers, just implement
|
llama.cpp
Outdated
struct ggml_backend_buffer * buf_kv = ctx->kv_self.buf->backend_buffer; | ||
|
||
LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "eval", buf_compute->backend_data, buf_compute->backend_size, 0)); | ||
LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "kv", buf_kv->backend_data, buf_kv->backend_size, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal of mapping these buffers to make the tensors work with the CPU backend, to be able to use Accelerate when processing prompts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping is not needed only for mmap
. The Apple Silicon chips have unified memory. This means that the same memory block can be read and write both by the CPU and the GPU. So even if we don't use mmap
, we want the CPU allocated buffers to be used by Metal directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that's the way the Metal backend works currently, but the backends interface is not designed to work in this way. You are going to find a lot of problems if you try to implement it like this. If the Metal backend is capable of doing matrix-matrix multiplication (even if it is slow), I suggest that you implement it in the simplest way possible for now: implement alloc_buffer
, get_tensor
and set_tensor
, and use the data
member of the tensors in the Metal kernels directly instead of mapping addresses. The ggml-cuda backend should be a good example of this, just replace the CUDA memory allocation and copy functions with the Metal ones.
After that works, we can add the changes necessary to support Accelerate again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me try to do it like this. Good thing is I just got it running and thanks to the splits I can now run part of the model on the CPU and the rest on the GPU which was not possible before.
Let's see if I can avoid all the hacks as you suggest.
Bonus: supports partial inference on the CPU
ggml-metal.m
Outdated
id<MTLBuffer> id_dst = dst ? ggml_metal_get_buffer(ctx, dst, &offs_dst) : nil; | ||
id<MTLBuffer> id_src0 = src0 ? src0->data : nil; | ||
id<MTLBuffer> id_src1 = src1 ? src1->data : nil; | ||
id<MTLBuffer> id_dst = dst ? dst->data : nil; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, instead of src0->data
which is the GPU address of the memory, I need to get the Metal buffer that I created in the wrapper. How should I access it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cannot use the pointer directly, an option could be to store the offset in ggml_tensor::data
(just pass a NULL pointer to ggml_allocator_simple_init
), and store the MTLBuffer in ggml_tensor::extra
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To set the extra
pointer, you can implement init_tensor
in the ggml_backend_buffer_interface
in the same way as free_data
. It is called by the allocator after allocating a tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will try to do it later tonight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this in this way, I think you will also have to handle views differently, because in some cases views are created in a different ggml_buffer
than their sources. For example, the KV cache has its own ggml_buffer
, but it is used via views that are created in the compute buffer.
So you would have to consider the OP of a tensor to determine if it is a view, and if so use the MTLBuffer
of its parent.
It may also be possible to use the gpuAddress
of MTLBuffer
in the kernels directly, if you pass the pointer directly instead of going through a MTLComputeCommandEncoder
. But I may be wrong about that, this is just what I could find in a quick search now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's likely because Vcur
is a "view" (i.e. ggml_transpose
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version almost generates coherent text but not quite. I'm missing something somewhere.
But overall the extra
mechanism is not great and we have to figure out something better.
It is ok when the user code uses it, but with the current approach, it looks like ggml.c
has to be "aware" of it, which I think we should avoid. Maybe the automatic allocator that you are introducing would somehow resolves that.
The extra
also seems incompatible with "view" operations. I suspect the bug that I have is somehow related to that part, but it's difficult to trace.
I'll leave this for now as there are other things piling up. Am a bit worried that it will be more and more difficult to keep the branch up-to-date with master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that dealing with extra
in ggml.c
is not good. I will look into a better way to solve this. I think this could possibly be solved by adding a callback similar to init_tensor
for views, then the backend would have an opportunity to set extra
there, or any other initialization it may need to do with views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got an idea how to implement this to fit the new interface and keep the Metal buffers in the metal context avoiding the use of extra
. Will probably take another shot over the weekend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebase (I suggest you don't, the allocator needs more work), keep in mind that partial offloading to the GPU is currently broken, only full offloading works.
d626b55
to
7252963
Compare
7252963
to
4daa5ee
Compare
4daa5ee
to
d45c163
Compare
Obsoleted by ggerganov/ggml#547 |
WIP in progress