Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a device parameter to malloc_or_wait #1820

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zcbenz
Copy link
Contributor

@zcbenz zcbenz commented Feb 2, 2025

This PR adds a device parameter to mallocs, which makes the code written for unified memory backend also usable for general gpu backend (#1789).

@awni
Copy link
Member

awni commented Feb 6, 2025

Per some offline discussion, we prefer not to merge this as it breaks one of the key design goals of MLX (e.g. intended for unified memory systems).

For example, if we ever added a CUDA backend we would want to use the unified memory programming model (which it supports): https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

I think WebGPU does things a bit in reverse: which is if you are on a unified memory system then copies between CPU-GPU can be no-ops, but the programmer is always required to manage memory.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

@zcbenz
Copy link
Contributor Author

zcbenz commented Feb 7, 2025

I totally understand the decision. My only purpose is to make the code reusable, rather than changing MLX's design goal.

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

The GPU memory can be allocated lazily, but there is no way to do the same for CPU: we can not hook the memcpy or *ptr calls.

A workaround is to just always allocate CPU memory, which only happens when using code from common, and I think if we do things carefully there won't be much affections to performance and memory usages. But this design complicates things a lot and does not look very delicate.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I can define a hook function like allocate_for_array which calls malloc_or_wait in cpu/metal backends and something else in webgpu, and for the code that I want to reuse, I just replace calls like

out.set_data(allocator::malloc_or_wait(nbytes);

with:

out.set_data(malloc_for_array(out, nbytes));

Though I think this option essentially is just a variant of this PR. With this PR I did not intend to change every invocation of malloc_or_wait to pass a device argument, instead I only plan to change the code that can be reused, i.e. use malloc_or_wait(device, nbytes) for code that might be used in webgpu backend and malloc_or_wait(nbytes) for everything else.

@zcbenz
Copy link
Contributor Author

zcbenz commented Feb 11, 2025

I just realized a pitfall of webgpu: the bool type in webgpu takes 4 bytes instead of 1. So I likely have to do conversions on the fly when uploadTo/downloadFrom GPU, and I can no longer assume the data in CPU and GPU has same size.

Either alternative should still work though since the array's dtype information is available when doing allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants