ggml : adapt Metal to new ggml_backend interface #2258

ggerganov · 2023-07-18T13:56:22Z

WIP in progress

Metal can share the RAM memory and can utilize mmap without temp buffer

ggml-backend.h

ggerganov · 2023-07-19T13:51:40Z

I'll need some mechanism to "map" the RAM to Metal buffers.
On master we do it like this:

Lines 2819 to 2825 in 294f424

    
           LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "data", data_ptr, data_size, max_size)); 
        
           LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "eval", ctx->buf_compute.addr, ctx->buf_compute.size, 0)); 
        
           LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "kv",   ctx->kv_self.buf.addr, ctx->kv_self.buf.size, 0)); 
        
           LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "scr0", ctx->buf_scratch[0].addr, ctx->buf_scratch[0].size, 0)); 
        
           LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "scr1", ctx->buf_scratch[1].addr, ctx->buf_scratch[1].size, 0));

I guess I'll do something similar here.
Does not look like we need to extend the backend interface with new calls, but if you have some better idea - let me know

Edit: On second look, the ggml_metal_add_buffer() has to be part of the backend interface as I cannot access the metal context from llama.cpp. Will add a new interface call, something like buffer_map

slaren · 2023-07-19T14:10:40Z

I think you only need to do that to support mmap, right? For the rest of the buffers, just implement alloc_buffer by allocating a normal Metal buffer, without bytesNoCopy.

mmap support isn't handled by the backend interface. For now, I would suggest adding a custom function to create a mapped metal buffer, then in load_all_data calculate the address and assign it to ggml_tensor::data. It's not great, but until we have a way to generalize support for mmap, it should work. An example of a custom backend function is ggml_backend_cpu_set_n_threads.

…s-metal

slaren · 2023-07-20T11:47:49Z

llama.cpp

+        struct ggml_backend_buffer * buf_kv      = ctx->kv_self.buf->backend_buffer;
+
+        LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "eval", buf_compute->backend_data, buf_compute->backend_size, 0));
+        LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "kv",   buf_kv->backend_data,      buf_kv->backend_size,      0));


Is the goal of mapping these buffers to make the tensors work with the CPU backend, to be able to use Accelerate when processing prompts?

The mapping is not needed only for mmap. The Apple Silicon chips have unified memory. This means that the same memory block can be read and write both by the CPU and the GPU. So even if we don't use mmap, we want the CPU allocated buffers to be used by Metal directly.

I understand that's the way the Metal backend works currently, but the backends interface is not designed to work in this way. You are going to find a lot of problems if you try to implement it like this. If the Metal backend is capable of doing matrix-matrix multiplication (even if it is slow), I suggest that you implement it in the simplest way possible for now: implement alloc_buffer, get_tensor and set_tensor, and use the data member of the tensors in the Metal kernels directly instead of mapping addresses. The ggml-cuda backend should be a good example of this, just replace the CUDA memory allocation and copy functions with the Metal ones.
After that works, we can add the changes necessary to support Accelerate again.

Ok, let me try to do it like this. Good thing is I just got it running and thanks to the splits I can now run part of the model on the CPU and the rest on the GPU which was not possible before.

Let's see if I can avoid all the hacks as you suggest.

Bonus: supports partial inference on the CPU

ggml-metal.m

ggerganov · 2023-07-20T14:01:53Z

ggml-metal.m

-                id<MTLBuffer> id_dst  = dst  ? ggml_metal_get_buffer(ctx, dst,  &offs_dst)  : nil;
+                id<MTLBuffer> id_src0 = src0 ? src0->data : nil;
+                id<MTLBuffer> id_src1 = src1 ? src1->data : nil;
+                id<MTLBuffer> id_dst  = dst  ? dst->data  : nil;


So here, instead of src0->data which is the GPU address of the memory, I need to get the Metal buffer that I created in the wrapper. How should I access it?

If you cannot use the pointer directly, an option could be to store the offset in ggml_tensor::data (just pass a NULL pointer to ggml_allocator_simple_init), and store the MTLBuffer in ggml_tensor::extra.

To set the extra pointer, you can implement init_tensor in the ggml_backend_buffer_interface in the same way as free_data. It is called by the allocator after allocating a tensor.

Ok, will try to do it later tonight

If you do this in this way, I think you will also have to handle views differently, because in some cases views are created in a different ggml_buffer than their sources. For example, the KV cache has its own ggml_buffer, but it is used via views that are created in the compute buffer.
So you would have to consider the OP of a tensor to determine if it is a view, and if so use the MTLBuffer of its parent.

It may also be possible to use the gpuAddress of MTLBuffer in the kernels directly, if you pass the pointer directly instead of going through a MTLComputeCommandEncoder. But I may be wrong about that, this is just what I could find in a quick search now.

Ah, it's likely because Vcur is a "view" (i.e. ggml_transpose)

The latest version almost generates coherent text but not quite. I'm missing something somewhere.

But overall the extra mechanism is not great and we have to figure out something better.
It is ok when the user code uses it, but with the current approach, it looks like ggml.c has to be "aware" of it, which I think we should avoid. Maybe the automatic allocator that you are introducing would somehow resolves that.

The extra also seems incompatible with "view" operations. I suspect the bug that I have is somehow related to that part, but it's difficult to trace.

I'll leave this for now as there are other things piling up. Am a bit worried that it will be more and more difficult to keep the branch up-to-date with master.

I agree that dealing with extra in ggml.c is not good. I will look into a better way to solve this. I think this could possibly be solved by adding a callback similar to init_tensor for views, then the backend would have an opportunity to set extra there, or any other initialization it may need to do with views.

I think I got an idea how to implement this to fit the new interface and keep the Metal buffers in the metal context avoiding the use of extra. Will probably take another shot over the weekend

If you rebase (I suggest you don't, the allocator needs more work), keep in mind that partial offloading to the GPU is currently broken, only full offloading works.

ggml-metal.m

ggerganov · 2023-10-08T10:59:30Z

Obsoleted by ggerganov/ggml#547

metal : adapting to ggml_backend (WIP)

0a3861c

ggerganov changed the title ~~ggml : add adapt Metal to new ggml_backend interface~~ ggml : adapt Metal to new ggml_backend interface Jul 18, 2023

llama : init metal backend as CPU backend for now

90503f1

ggerganov force-pushed the ggml-backends-metal branch from 985457b to 90503f1 Compare July 18, 2023 14:54

ggerganov added 2 commits July 18, 2023 18:51

ggml : add is_ram_shared to ggml_backend

652c849

Metal can share the RAM memory and can utilize mmap without temp buffer

llama : separate compute buffer for metal

ed960fa

slaren reviewed Jul 18, 2023

View reviewed changes

ggml-backend.h Outdated Show resolved Hide resolved

metal : create backend, mostly reuse CPU backend interface

70c55c1

ggerganov added 2 commits July 19, 2023 17:45

Merge remote-tracking branch 'origin/ggml-backends' into ggml-backend…

f38433e

…s-metal

metal : map the CPU buffers to Metal buffers (WIP)

290cb70

slaren reviewed Jul 20, 2023

View reviewed changes

metal : first working version of the inference without prompt processing

cb82ada

Bonus: supports partial inference on the CPU

ggerganov commented Jul 20, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

slaren reviewed Jul 20, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

slaren reviewed Jul 20, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

ggerganov commented Jul 20, 2023

View reviewed changes

ggerganov force-pushed the ggml-backends-metal branch 3 times, most recently from d626b55 to 7252963 Compare July 20, 2023 18:28

slaren reviewed Jul 20, 2023

View reviewed changes

ggml-metal.m Outdated Show resolved Hide resolved

ggerganov force-pushed the ggml-backends-metal branch from 7252963 to 4daa5ee Compare July 20, 2023 18:44

metal : rewrite to fit new backend interface correctly (WIP)

d45c163

ggerganov force-pushed the ggml-backends-metal branch from 4daa5ee to d45c163 Compare July 20, 2023 19:51

ggerganov closed this Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : adapt Metal to new ggml_backend interface #2258

ggml : adapt Metal to new ggml_backend interface #2258

ggerganov commented Jul 18, 2023

ggerganov commented Jul 19, 2023 •

edited

Loading

slaren commented Jul 19, 2023

slaren Jul 20, 2023

ggerganov Jul 20, 2023

slaren Jul 20, 2023

ggerganov Jul 20, 2023

ggerganov Jul 20, 2023

slaren Jul 20, 2023

slaren Jul 20, 2023

ggerganov Jul 20, 2023

slaren Jul 20, 2023

ggerganov Jul 20, 2023

ggerganov Jul 20, 2023

slaren Jul 20, 2023

ggerganov Jul 21, 2023

slaren Jul 21, 2023

ggerganov commented Oct 8, 2023

ggml : adapt Metal to new ggml_backend interface #2258

ggml : adapt Metal to new ggml_backend interface #2258

Conversation

ggerganov commented Jul 18, 2023

ggerganov commented Jul 19, 2023 • edited Loading

slaren commented Jul 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Oct 8, 2023

ggerganov commented Jul 19, 2023 •

edited

Loading