Add a device parameter to malloc_or_wait #1820

zcbenz · 2025-02-02T01:12:39Z

This PR adds a device parameter to mallocs, which makes the code written for unified memory backend also usable for general gpu backend (#1789).

awni · 2025-02-06T16:32:21Z

Per some offline discussion, we prefer not to merge this as it breaks one of the key design goals of MLX (e.g. intended for unified memory systems).

For example, if we ever added a CUDA backend we would want to use the unified memory programming model (which it supports): https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

I think WebGPU does things a bit in reverse: which is if you are on a unified memory system then copies between CPU-GPU can be no-ops, but the programmer is always required to manage memory.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

zcbenz · 2025-02-07T10:27:09Z

I totally understand the decision. My only purpose is to make the code reusable, rather than changing MLX's design goal.

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

The GPU memory can be allocated lazily, but there is no way to do the same for CPU: we can not hook the memcpy or *ptr calls.

A workaround is to just always allocate CPU memory, which only happens when using code from common, and I think if we do things carefully there won't be much affections to performance and memory usages. But this design complicates things a lot and does not look very delicate.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I can define a hook function like allocate_for_array which calls malloc_or_wait in cpu/metal backends and something else in webgpu, and for the code that I want to reuse, I just replace calls like

out.set_data(allocator::malloc_or_wait(nbytes);

with:

out.set_data(malloc_for_array(out, nbytes));

Though I think this option essentially is just a variant of this PR. With this PR I did not intend to change every invocation of malloc_or_wait to pass a device argument, instead I only plan to change the code that can be reused, i.e. use malloc_or_wait(device, nbytes) for code that might be used in webgpu backend and malloc_or_wait(nbytes) for everything else.

zcbenz · 2025-02-11T23:45:29Z

I just realized a pitfall of webgpu: the bool type in webgpu takes 4 bytes instead of 1. So I likely have to do conversions on the fly when uploadTo/downloadFrom GPU, and I can no longer assume the data in CPU and GPU has same size.

Either alternative should still work though since the array's dtype information is available when doing allocations.

Add a device parameter to malloc_or_wait

b920f62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a device parameter to malloc_or_wait #1820

Add a device parameter to malloc_or_wait #1820

zcbenz commented Feb 2, 2025

awni commented Feb 6, 2025

zcbenz commented Feb 7, 2025

zcbenz commented Feb 11, 2025

Add a device parameter to malloc_or_wait #1820

Are you sure you want to change the base?

Add a device parameter to malloc_or_wait #1820

Conversation

zcbenz commented Feb 2, 2025

awni commented Feb 6, 2025

zcbenz commented Feb 7, 2025

zcbenz commented Feb 11, 2025