Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

maleadt · 2025-01-11T20:54:13Z

A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):

a = cu(rand(N, 2))

@async begin
  @cuda kernel(a[:, 1])
end

@async begin
  @cuda kernel(a[:, 2])
end

Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:

CUDA.jl/src/memory.jl

Lines 565 to 569 in a4a9166

    
           # accessing memory on another stream: ensure the data is ready and take ownership 
        
           if managed.stream != state.stream 
        
             maybe_synchronize(managed) 
        
             managed.stream = state.stream 
        
           end

This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.

It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:

it would re-introduce the IMO surprising and hard to explain behavior of having to explicitly synchronize() on each exit path outside of an @async block to even make it possible to read the data in a valid manner;
we cannot easily identify when the synchronization is happening between different tasks, unless we would also track tasks operating on array, which doesn't seem straightforward.

The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use CuArrays in an @async block without having to explain the asynchronous nature of GPU computing.

To illustrate the second point:

device!(0)
a = cu(rand(N, 2))
@cuda kernel(a[:, 1])
device!(1)
@cuda kernel(a[:, 2])

# is detected the same as

device!(0)
a = cu(rand(N, 2))
@async begin
  device!(0)
  @cuda kernel(a[:, 1])
end
@async begin
  device!(1)
  @cuda kernel(a[:, 2])
end

Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely fetch values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of @async that inserts this synchronization automatically? Seems like that would hurt portability, though.

Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.

Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.

cc @vchuravy

The text was updated successfully, but these errors were encountered:

maleadt · 2025-01-11T21:14:33Z

For the easy workaround, I'd add a simple synchronized field or so to the Managed struct and have maybe_synchronize bail out if its set. Some non-exported (as they're supposed to be temporary) interface methods like CUDA.enable_synchronization!(a::CuArray, enabled::Bool=true) could be used to toggle that.

maleadt · 2025-02-13T08:52:13Z

@luraess ran into this:

The issue

CUDA 5.4 improved memory management, better tracking memory allocations (as explained in https://juliagpu.org/post/2024-05-28-cuda_5.4/). One feature introduced would ensure implicit synchronisation removing the burden from the user to be careful about synchronising data used in multiple tasks (see 2. Using multiple streams). The PR introducing the change is #2335.

While this approach has certainly advantages, it also breaks the successful execution of the GPU-aware MPI pipeline in we are having Chmy.jl to allow for overlapping MPI-related communication with stencil computation using a task-based approach to leverage asynchronous device execution on different streams.

Profiling one of the MWEs (https://github.com/PTsolvers/Chmy.jl/blob/main/examples/diffusion_2d_mpi_perf.jl) we see that using CUDA >= 5.4 serialisation is happening:

CUDA 5.4 (incorrect pipeline)

CUDA 5.3.5 (correct pipeline)

The core issue come from he fact that in Chmy, we are using an approach similar to
using CUDA
A = CUDA.zeros(2)
@sync begin
    Threads.@spawn begin
        @view(A[1:1]) .= 1
        synchronize()
    end
    Threads.@spawn begin
        @view(A[2:2]) .= 2
        synchronize()
    end
end
where views from an array are accessed by different tasks on different streams while we can guarantee no data races.

Way forward

While there are advantages of implicit synchronisation, the current situation introduces significant limitation when "A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved", see #2617. According to the reports from this issue, it would be ideal if an "opt-out" mechanism would be available to

restore previous functionality of using task-based approach for asynchronous execution with full control.

prevent design divergence among different GPU backend (AMDGPU.jl here)

luraess · 2025-02-13T09:14:28Z

For the easy workaround, I'd add a simple synchronized field or so to the Managed struct and have maybe_synchronize bail out if its set. Some non-exported (as they're supposed to be temporary) interface methods like CUDA.enable_synchronization!(a::CuArray, enabled::Bool=true) could be used to toggle that.

Is this currently the way to go wrt opting out of implicit sync on a per array basis? If so would this be a viable alternative in Chmy @utkinis ?

maleadt · 2025-02-13T09:18:12Z

Is this currently the way to go

That is a suggestion that's not implemented yet.

luraess · 2025-02-13T09:47:05Z

Got it, is it the current "best candidate" to be implemented?

maleadt · 2025-02-13T09:52:54Z

Pretty much, but I haven't put too much thought into it yet. Would be good to get feedback from people running into this.

utkinis · 2025-02-18T13:12:19Z

Thanks, @maleadt and @vchuravy, for looking into this.

I appreciate the effort to make concurrency with CUDA as seamless as possible. However, in HPC, hiding communication latency is essential for scaling. To achieve this, the ability to use asynchronous computation on different sub-arrays is crucial. When I correctly use the CUDA async API as per the documentation, I expect operations on different streams to overlap, regardless of the arrays involved. If this does not happen, I consider it a bug rather than a "conservative way to ensure correctness." After all, if Base Julia implicitly serialized array access based on threadid(), then Threads.@threads would be very safe and correct - but also completely useless. Once users opt for tasks, they enter the realm of concurrency, which is inherently complex, and I don’t believe implicit serialization will meaningfully simplify it.

That said, to be more constructive, if we continue moving in the direction of implicit synchronization as drafted in #2662, I think the following points should be addressed:

Consistency between backends and KernelAbstractions.jl compatibility
Currently, KernelAbstractions.jl users have no clear way to determine whether they need to explicitly synchronize at the end of a task. Additionally, disabling task synchronization has to be handled via the extension mechanism, despite being essentially identical across backends. The question is then should it be required from all KA-compatible backends to implement this function?
Impact on users who do not use tasks but directly use streams
If I understand correctly, as long as the same array is used across multiple streams, implicit synchronization will occur. This diverges from the behavior of lower-level CUDA APIs and could be confusing for users translating CUDA C code into equivalent CUDA.jl code. Perhaps @omlins could comment on whether this affects ImplicitGlobalGrid.jl.
Granularity of implicit synchronization control
If a user opts out of implicit synchronization for a single array used within a task, they need an explicit synchronise at the end of the task. This potentially makes implicit synchronization redundant for all other arrays in the same task. Should this setting be per-task or per-array then? That is, should producing a pointer check a task-local state before calling maybe_synchronize, rather than relying on an array field? I might be overlooking something here, but I'd rather call disable_task_sync in the beginning of the task, which should be then matched by synchronize at the end of the task.

I'd be happy to contribute if we can agree on the design.

maleadt · 2025-02-18T18:25:39Z

When I correctly use the CUDA async API as per the documentation, I expect operations on different streams to overlap, regardless of the arrays involved.

You are not using CUDA async APIs as per the documentation though, you are using CUDA.jl which provides different semantics that hopefully suit Julia code and users better. That applies to Julia's concurrent programming constructs, which CUDA.jl users expect to work without footguns. This is not just a matter of convenience; Julia packages and stdlibs actually use these concurrent programming constructs, so we need to handle them correctly. See e.g. #875, which basically boils down to:

da = fetch(@async CuArray([1]))
@assert Array(da) == [1]

The above would fail without automatic synchronization, while IMO being a far cry from the "implicitly serializing on threadid()" comparison you made. And it is actively used in the REPL, VSCode, Pluto, etc.

So I don't think it's an option to simply perform everything as it would happen in C. If HPC users, which are only a subset of the CUDA.jl users, have different needs, I'm happy to accomodate as long as it doesn't break more common user codes. Specifically, I think this means that we need to keep the current auto-synchronizing behavior by default, unless somebody can come up with a better mechanism that's compatible with your applications. One such alternative me and @vchuravy considered is a task hook that automatically synchronizes at the end of a task, but that was not accepted upstream. So at this point it looks like the only alternative is an opt-out mechanism as suggested in #2662, but I'd be really happy to be proven wrong. (And on a personal note, I'd really like if people would try, instead of asking for reverts or opt-outs.)

Finally, if you want full control, you can always use the underlying APIs. I've attempted to make that part as painless as possible, by providing copious converting methods for all our high-level structures, so you should be fine calling cuMemcpy on a CuArray.

luraess · 2025-02-19T08:55:15Z

And on a personal note, I'd really like if people would try, instead of asking for reverts or opt-outs.

Having tested and profiled the suggestions on ALPS supercomputer in a multi-GPU setting after making the required changes in the code PTsolvers/Chmy.jl#65, there are two points that make the current proposition challenging, namely the behaviour and interaction with other GPU backends and the return behaviour of CUDA.unsafe_disable_task_sync!.

The divergent behaviour wrt stream sync amongst various GPU backend, particularly CUDA and AMDGPU, requires extra handling in order to achieve backend-agnostic implementation and expected identical behaviour. If done using e.g. KernelAbstraction, there is no way yet to trigger unsafe_disable_task_sync! which thus requires one to add an extra step. Because this step should be backend agnostic, one would need to make this happen via extensions and dispatch over backend, which looks rn more like a hack.

It could further be interesting to improve the return behaviour of unsafe_disable_task_sync! as to be able to retrieve the array instead of the status which would allow to apply the setting upon array construction. Right now,

julia> a = CUDA.rand(2, 1) |> CUDA.unsafe_disable_task_sync!
false

julia> a
false

while it could be interesting to get back a as it would avoid having to create temporary storage and extra lines of code.

Besides the above, which are rather of practical concerns wrt using the suggested feature to opt out from implicit sync and restore used-to-be default and consistency amongst GPU backends, I would be interested in getting others opinion on the points 1-3 raised by @utkinis (especially 1 and 2).

maleadt added cuda array Stuff about CuArray. hard This is difficult. speculative Not sure about this one yet. labels Jan 11, 2025

maleadt added the good first issue Good for newcomers label Jan 11, 2025

efaulhaber mentioned this issue Jan 16, 2025

[Proof of Concept] Multi-GPU prototype (single node) trixi-framework/PointNeighbors.jl#89

Draft

luraess mentioned this issue Feb 13, 2025

Implicit stream sync in tasks serialise kernel execution #2654

Closed

maleadt marked this as a duplicate of #2654 Feb 13, 2025

vchuravy mentioned this issue Feb 17, 2025

Support disabling automatic sync on task switch #2662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

maleadt commented Jan 11, 2025

maleadt commented Jan 11, 2025

maleadt commented Feb 13, 2025

The issue

Way forward

luraess commented Feb 13, 2025

maleadt commented Feb 13, 2025

luraess commented Feb 13, 2025

maleadt commented Feb 13, 2025

utkinis commented Feb 18, 2025 •

edited

Loading

maleadt commented Feb 18, 2025

luraess commented Feb 19, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Comments

maleadt commented Jan 11, 2025

maleadt commented Jan 11, 2025

maleadt commented Feb 13, 2025

The issue

Way forward

luraess commented Feb 13, 2025

maleadt commented Feb 13, 2025

luraess commented Feb 13, 2025

maleadt commented Feb 13, 2025

utkinis commented Feb 18, 2025 • edited Loading

maleadt commented Feb 18, 2025

luraess commented Feb 19, 2025

utkinis commented Feb 18, 2025 •

edited

Loading