-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml backends interface v1 #547
Conversation
I will do a sync of |
@slaren Let's rebase on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing something similar as the comment in ggml-cuda.h
for all other backends, should be the goal IMO.
Introducing ggml to existing codebases would be easiest from the inside out as a plugin: first doing some leaves of the computation, then more and more towards the root.
@@ -41,6 +43,10 @@ GGML_API bool ggml_cuda_compute_forward(struct ggml_compute_params * params, s | |||
GGML_API int ggml_cuda_get_device_count(void); | |||
GGML_API void ggml_cuda_get_device_description(int device, char * description, size_t description_size); | |||
|
|||
// backend API | |||
GGML_API ggml_backend_t ggml_backend_cuda_init(void); // TODO: take a list of devices to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, it will be nice if besides devices, one could provide stream handles and cublas handles here. This will allow plugging ggml in existing code.
I currently have hacked ggml_init_cublas as it is in master to allow exactly this. Thus I created a pytorch plugin which uses the cuda instance from pytorch and now I directly assign pytorch tensor pointers to ggml tensors.
I was planning to make my hack more presentable, but this PR is basically the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, support for the old API will be removed, and then all the globals will be moved into the backend context. At that point we could consider how to add this functionality, but I think it is too early for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with using the globals?
If one wants to add a plugin, it is very unlikely for them to need multiple backend instances. And even if they do, this can be marked as "not supported yet".
If one wants to use the old behavior, they won't provide handles and it will be just the same.
No existing functionality is going to break if this feature is added with the PR. On the contrary: plugins like mine will be possible with vanilla ggml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few problems with this:
- The backends interface is still not finished, many things are going to change in the next weeks
- I don't really like the idea exposing internals of the CUDA backend, especially at a point where things are going to change significantly
- This looks like an extremely niche application, so I am not convinced that it needs to be merged. The patch can be maintained in the relevant repository.
- I do not understand what are your requirements, so I cannot really do this myself in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say that it's a niche application. Consider an existing inference codebase which doesn't use ggml. Be it pytorch-based, pure cuda (FT, cuDNN, TRT...), tensorflow, or other. If one wants to migrate it to ggml for some reason, what can they do?
They can completely rewrite the inference in ggml which is quite error prone. Or they can implement parts of it in ggml. For example port parts of the computation graph and leave the rest as is. Gradually moving to entirely ggml-based inference from the inside out while having a working version at each step.
This is exactly my use case. I'm moving a pytorch-based application to ggml.
I think this can be quite beneficial for the library and people who want to evaluate it without having to go all the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a function to change the main stream work? That would allow you to queue the ggml-cuda computations on the pytorch stream, which may improve performance due to reduced synchronization. Is there any reason to re-use the pytorch cuBLAS handle? Does each cuBLAS handle use a large amount of GPU resources?
Need to log off earlier today - will review tomorrow. |
The branch should be stable, I think this is already good enough for a v1. It should be easier to implement than the last time. There is a pointer to the buffer in the tensors, and that should help with backends that cannot abstract a buffer+offset into a pointer, like Metal. Views inherit this buffer from the parent automatically. But there are very likely going to be some rough edges. So I would say to not do it yet, unless you are willing to look into the internals and change anything that may be required. v2 will introduce partial offloading. Fallback to CPU will come at the same time, or a little later. After fallback to CPU is implemented, I will implement support for the OpenCL backend. I think that Metal is closer to OpenCL than to CUDA, so it may be in a better state to implement Metal support after that point. |
* ggml-backend : code style suggestions * ggml-backend : move ggml_backend and ggml_backend_buffer in the source file * ggml-backend : move structs back to header + rename type * ggml-backend : remove obsolete comment * fix leak in ggml_backend_buffer_free * ggml-backend : re-introduce typedefs as a declaration of intent --------- Co-authored-by: slaren <slarengh@gmail.com>
* ggml-backend : metal (WIP) * ggml-backend : metal (adapt CPU backend) * ggml-backend : working metal * ggml-backend : clean-up metal implementation * ggml-backend : add ggml_backend_is_metal()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this - I think it is a good addition to the framework
…ions correctly with `vocab_only` setting. Also confirmed that the code works as expected after running with reduced memory usage due to deletion of no-longer-needed variable. (ggerganov#547)
Initial version of the common backends interface.
Supports CPU and CUDA with full offloading, but not partial offloading or fallback to the CPU backend for unimplemented ops.
Modifies the gpt-2 example to use the backends interface. If built with CUDA support and executed with
-ngl 1
, the CUDA backend is used.Support for Metal added in #552