-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617
Comments
For the easy workaround, I'd add a simple |
@luraess ran into this:
|
Is this currently the way to go wrt opting out of implicit sync on a per array basis? If so would this be a viable alternative in Chmy @utkinis ? |
That is a suggestion that's not implemented yet. |
Got it, is it the current "best candidate" to be implemented? |
Pretty much, but I haven't put too much thought into it yet. Would be good to get feedback from people running into this. |
Thanks, @maleadt and @vchuravy, for looking into this. I appreciate the effort to make concurrency with CUDA as seamless as possible. However, in HPC, hiding communication latency is essential for scaling. To achieve this, the ability to use asynchronous computation on different sub-arrays is crucial. When I correctly use the CUDA async API as per the documentation, I expect operations on different streams to overlap, regardless of the arrays involved. If this does not happen, I consider it a bug rather than a "conservative way to ensure correctness." After all, if Base Julia implicitly serialized array access based on That said, to be more constructive, if we continue moving in the direction of implicit synchronization as drafted in #2662, I think the following points should be addressed:
I'd be happy to contribute if we can agree on the design. |
You are not using CUDA async APIs as per the documentation though, you are using CUDA.jl which provides different semantics that hopefully suit Julia code and users better. That applies to Julia's concurrent programming constructs, which CUDA.jl users expect to work without footguns. This is not just a matter of convenience; Julia packages and stdlibs actually use these concurrent programming constructs, so we need to handle them correctly. See e.g. #875, which basically boils down to: da = fetch(@async CuArray([1]))
@assert Array(da) == [1] The above would fail without automatic synchronization, while IMO being a far cry from the "implicitly serializing on So I don't think it's an option to simply perform everything as it would happen in C. If HPC users, which are only a subset of the CUDA.jl users, have different needs, I'm happy to accomodate as long as it doesn't break more common user codes. Specifically, I think this means that we need to keep the current auto-synchronizing behavior by default, unless somebody can come up with a better mechanism that's compatible with your applications. One such alternative me and @vchuravy considered is a task hook that automatically synchronizes at the end of a task, but that was not accepted upstream. So at this point it looks like the only alternative is an opt-out mechanism as suggested in #2662, but I'd be really happy to be proven wrong. (And on a personal note, I'd really like if people would try, instead of asking for reverts or opt-outs.) Finally, if you want full control, you can always use the underlying APIs. I've attempted to make that part as painless as possible, by providing copious converting methods for all our high-level structures, so you should be fine calling |
Having tested and profiled the suggestions on ALPS supercomputer in a multi-GPU setting after making the required changes in the code PTsolvers/Chmy.jl#65, there are two points that make the current proposition challenging, namely the behaviour and interaction with other GPU backends and the return behaviour of The divergent behaviour wrt stream sync amongst various GPU backend, particularly CUDA and AMDGPU, requires extra handling in order to achieve backend-agnostic implementation and expected identical behaviour. If done using e.g. KernelAbstraction, there is no way yet to trigger It could further be interesting to improve the return behaviour of julia> a = CUDA.rand(2, 1) |> CUDA.unsafe_disable_task_sync!
false
julia> a
false while it could be interesting to get back Besides the above, which are rather of practical concerns wrt using the suggested feature to opt out from implicit sync and restore used-to-be default and consistency amongst GPU backends, I would be interested in getting others opinion on the points 1-3 raised by @utkinis (especially 1 and 2). |
A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):
Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:
CUDA.jl/src/memory.jl
Lines 565 to 569 in a4a9166
This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.
It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:
synchronize()
on each exit path outside of an@async
block to even make it possible to read the data in a valid manner;The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use
CuArray
s in an@async
block without having to explain the asynchronous nature of GPU computing.To illustrate the second point:
Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely
fetch
values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of@async
that inserts this synchronization automatically? Seems like that would hurt portability, though.Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.
Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.
cc @vchuravy
The text was updated successfully, but these errors were encountered: