-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Improve argument passing / Support some non device copiable types #5320
Comments
When we porting PyTorch kernel templates, we found out that tensor dimensions and strides (arrays) captured by kernel functor member was constantly copied into private memory before each start of kernel. despite the optimizations it prevent, the copy itself are suboptimal. @Naghasan Did your patch address this problem? |
Indirectly, this should ease SROA/DAE. I guess your problem has more to do with the way SYCL passes argument to the device. If you get an unnecessary copy, I guess it is because one of those 2 passes bailed out. If you open a discussion about that, I'm happy to talk about that at greater length. |
@bader do you have opinions on this ? |
Replacing a copy loop with a mem copy sounds reasonable. According to my understanding,
This sounds a bit concerning. AFAIK, today compiler rejects non-device_copyable types and saves users from potential UB/bugs. It would be great to make non-standard behavior explicit and per-type to have "safe" default behavior.
Could you elaborate more on that, please? |
That relates to the bit above actually. if you have T k;
q.single_task([k]{}); But you can specialize template <> struct is_device_copyable<T> : std::true_type {};
T k;
q.single_task([k]{}); Supporting the second case is what I'm after, I'm not advocating to change anything for the first case. In our case struct T {
some_device_copyable_type1 a;
some_device_copyable_type2 b;
[...]
std::string unused_on_device_yet_defined;
[...]
}; As the So the proposal here is just to make sure this dtor call doesn't happen (which is fine spec wise) and mem copy argument to help optimizations but also ensure no Ctor/Dtor is being called (still fine spec wise). I know this might look odd, but this help supporting framework not designed around SYCL. I guess @joeatodd or @masterleinad could provide more insight about the context if needed. |
We actually just merged kokkos/kokkos#4637 in https://github.com/Kokkos/Kokkos that uses some more tricks to make |
While working on Kokkos, we looked at supporting directly called kernels, but this however implies supporting types that are does not meet device copyable criteria.
Context:
By default Kokkos passes it arguments by storing the values into a USM buffer and pass the pointer to the kernel. This is a common approach that is also used for the CUDA backend.
So in pseudo code we have something like this:
The type
T
used in the pseudo code doesn't meet the requirements to be device copyable (like non trivial destructor). And setting specializingis_device_copyable_v
is of no help.T
contains a std::string field (unused) and generate a call to delete. However,T
is intended to be bitwise copied to the device, so even if it is technically not device copyable, there is a "promise" it is.Further more, we also noticed that large arrays are copied over using a copy loop rather than a mem copy. This prevent SROA to operate (dynamic indexing makes it bails out) and as a consequence prevents DAE as well.
Proposition:
During the generation of the opencl/spir like kernel's body:
__init
/__finalize
to maintain current behavior.The proposition aims to tackle 2 aspects:
This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). Namely, this exploits:
It is unspecified whether the implementation actually calls the copy constructor, move constructor, copy assignment operator, or move assignment operator of a class declared as is_device_copyable_v when doing an inter-device copy.
The destructor has no effect when executed on the device
I have a prototype that is close to be finished, if code owners are fine with the approach, I should be able to push it this week.
The text was updated successfully, but these errors were encountered: