Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : adapt Metal to new ggml_backend interface #2258

Closed
wants to merge 9 commits into from

Conversation

ggerganov
Copy link
Owner

WIP in progress

@ggerganov ggerganov changed the title ggml : add adapt Metal to new ggml_backend interface ggml : adapt Metal to new ggml_backend interface Jul 18, 2023
@ggerganov ggerganov force-pushed the ggml-backends-metal branch from 985457b to 90503f1 Compare July 18, 2023 14:54
Metal can share the RAM memory and can utilize mmap without temp buffer
ggml-backend.h Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner Author

ggerganov commented Jul 19, 2023

I'll need some mechanism to "map" the RAM to Metal buffers.
On master we do it like this:

llama.cpp/llama.cpp

Lines 2819 to 2825 in 294f424

LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "data", data_ptr, data_size, max_size));
LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "eval", ctx->buf_compute.addr, ctx->buf_compute.size, 0));
LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "kv", ctx->kv_self.buf.addr, ctx->kv_self.buf.size, 0));
LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "scr0", ctx->buf_scratch[0].addr, ctx->buf_scratch[0].size, 0));
LLAMA_METAL_CHECK_BUF(ggml_metal_add_buffer(ctx->ctx_metal, "scr1", ctx->buf_scratch[1].addr, ctx->buf_scratch[1].size, 0));

I guess I'll do something similar here.
Does not look like we need to extend the backend interface with new calls, but if you have some better idea - let me know

Edit: On second look, the ggml_metal_add_buffer() has to be part of the backend interface as I cannot access the metal context from llama.cpp. Will add a new interface call, something like buffer_map

@slaren
Copy link
Collaborator

slaren commented Jul 19, 2023

I think you only need to do that to support mmap, right? For the rest of the buffers, just implement alloc_buffer by allocating a normal Metal buffer, without bytesNoCopy.

mmap support isn't handled by the backend interface. For now, I would suggest adding a custom function to create a mapped metal buffer, then in load_all_data calculate the address and assign it to ggml_tensor::data. It's not great, but until we have a way to generalize support for mmap, it should work. An example of a custom backend function is ggml_backend_cpu_set_n_threads.

llama.cpp Outdated
struct ggml_backend_buffer * buf_kv = ctx->kv_self.buf->backend_buffer;

LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "eval", buf_compute->backend_data, buf_compute->backend_size, 0));
LLAMA_METAL_CHECK_BUF(ggml_backend_metal_map_buffer(ctx->model.backend_metal, "kv", buf_kv->backend_data, buf_kv->backend_size, 0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of mapping these buffers to make the tensors work with the CPU backend, to be able to use Accelerate when processing prompts?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping is not needed only for mmap. The Apple Silicon chips have unified memory. This means that the same memory block can be read and write both by the CPU and the GPU. So even if we don't use mmap, we want the CPU allocated buffers to be used by Metal directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's the way the Metal backend works currently, but the backends interface is not designed to work in this way. You are going to find a lot of problems if you try to implement it like this. If the Metal backend is capable of doing matrix-matrix multiplication (even if it is slow), I suggest that you implement it in the simplest way possible for now: implement alloc_buffer, get_tensor and set_tensor, and use the data member of the tensors in the Metal kernels directly instead of mapping addresses. The ggml-cuda backend should be a good example of this, just replace the CUDA memory allocation and copy functions with the Metal ones.
After that works, we can add the changes necessary to support Accelerate again.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me try to do it like this. Good thing is I just got it running and thanks to the splits I can now run part of the model on the CPU and the rest on the GPU which was not possible before.

Let's see if I can avoid all the hacks as you suggest.

ggml-metal.m Outdated Show resolved Hide resolved
ggml-metal.m Outdated Show resolved Hide resolved
ggml-metal.m Outdated Show resolved Hide resolved
ggml-metal.m Outdated
id<MTLBuffer> id_dst = dst ? ggml_metal_get_buffer(ctx, dst, &offs_dst) : nil;
id<MTLBuffer> id_src0 = src0 ? src0->data : nil;
id<MTLBuffer> id_src1 = src1 ? src1->data : nil;
id<MTLBuffer> id_dst = dst ? dst->data : nil;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, instead of src0->data which is the GPU address of the memory, I need to get the Metal buffer that I created in the wrapper. How should I access it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you cannot use the pointer directly, an option could be to store the offset in ggml_tensor::data (just pass a NULL pointer to ggml_allocator_simple_init), and store the MTLBuffer in ggml_tensor::extra.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To set the extra pointer, you can implement init_tensor in the ggml_backend_buffer_interface in the same way as free_data. It is called by the allocator after allocating a tensor.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will try to do it later tonight

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this in this way, I think you will also have to handle views differently, because in some cases views are created in a different ggml_buffer than their sources. For example, the KV cache has its own ggml_buffer, but it is used via views that are created in the compute buffer.
So you would have to consider the OP of a tensor to determine if it is a view, and if so use the MTLBuffer of its parent.

It may also be possible to use the gpuAddress of MTLBuffer in the kernels directly, if you pass the pointer directly instead of going through a MTLComputeCommandEncoder. But I may be wrong about that, this is just what I could find in a quick search now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's likely because Vcur is a "view" (i.e. ggml_transpose)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version almost generates coherent text but not quite. I'm missing something somewhere.

But overall the extra mechanism is not great and we have to figure out something better.
It is ok when the user code uses it, but with the current approach, it looks like ggml.c has to be "aware" of it, which I think we should avoid. Maybe the automatic allocator that you are introducing would somehow resolves that.

The extra also seems incompatible with "view" operations. I suspect the bug that I have is somehow related to that part, but it's difficult to trace.

I'll leave this for now as there are other things piling up. Am a bit worried that it will be more and more difficult to keep the branch up-to-date with master.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that dealing with extra in ggml.c is not good. I will look into a better way to solve this. I think this could possibly be solved by adding a callback similar to init_tensor for views, then the backend would have an opportunity to set extra there, or any other initialization it may need to do with views.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got an idea how to implement this to fit the new interface and keep the Metal buffers in the metal context avoiding the use of extra. Will probably take another shot over the weekend

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebase (I suggest you don't, the allocator needs more work), keep in mind that partial offloading to the GPU is currently broken, only full offloading works.

@ggerganov ggerganov force-pushed the ggml-backends-metal branch 3 times, most recently from d626b55 to 7252963 Compare July 20, 2023 18:28
ggml-metal.m Outdated Show resolved Hide resolved
@ggerganov ggerganov force-pushed the ggml-backends-metal branch from 7252963 to 4daa5ee Compare July 20, 2023 18:44
@ggerganov ggerganov force-pushed the ggml-backends-metal branch from 4daa5ee to d45c163 Compare July 20, 2023 19:51
@ggerganov
Copy link
Owner Author

Obsoleted by ggerganov/ggml#547

@ggerganov ggerganov closed this Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants