Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve argument passing / Support some non device copiable types #5320

Open
Naghasan opened this issue Jan 17, 2022 · 6 comments
Open
Assignees
Labels
confirmed enhancement New feature or request

Comments

@Naghasan
Copy link
Contributor

While working on Kokkos, we looked at supporting directly called kernels, but this however implies supporting types that are does not meet device copyable criteria.

Context:

By default Kokkos passes it arguments by storing the values into a USM buffer and pass the pointer to the kernel. This is a common approach that is also used for the CUDA backend.

So in pseudo code we have something like this:

auto * args = sycl::malloc_shared<T>(...)
*args = /* set struct */

[...]

parallel_for(range, [=](item i) {
  args->use_fields;
})

The type T used in the pseudo code doesn't meet the requirements to be device copyable (like non trivial destructor). And setting specializing is_device_copyable_v is of no help. T contains a std::string field (unused) and generate a call to delete. However, T is intended to be bitwise copied to the device, so even if it is technically not device copyable, there is a "promise" it is.

Further more, we also noticed that large arrays are copied over using a copy loop rather than a mem copy. This prevent SROA to operate (dynamic indexing makes it bails out) and as a consequence prevents DAE as well.

Proposition:

During the generation of the opencl/spir like kernel's body:

  • Wrap the sycl kernel inside a union instead of direct instantiation (this will prevent the call to the destructor);
  • Initialize each non special types using a memcopy;
  • Initialize each special types using a placement new (force the call to the default Ctor) and call __init/__finalize to maintain current behavior.

The proposition aims to tackle 2 aspects:

  • Allow non device copiable type to be usable
  • Shape code to be more SROA friendly

This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). Namely, this exploits:

  • It is unspecified whether the implementation actually calls the copy constructor, move constructor, copy assignment operator, or move assignment operator of a class declared as is_device_copyable_v when doing an inter-device copy.
  • The destructor has no effect when executed on the device

I have a prototype that is close to be finished, if code owners are fine with the approach, I should be able to push it this week.

@Naghasan Naghasan added the enhancement New feature or request label Jan 17, 2022
@CaoZhongZ
Copy link

When we porting PyTorch kernel templates, we found out that tensor dimensions and strides (arrays) captured by kernel functor member was constantly copied into private memory before each start of kernel. despite the optimizations it prevent, the copy itself are suboptimal.

@Naghasan Did your patch address this problem?

@Naghasan
Copy link
Contributor Author

Indirectly, this should ease SROA/DAE.

I guess your problem has more to do with the way SYCL passes argument to the device. If you get an unnecessary copy, I guess it is because one of those 2 passes bailed out. If you open a discussion about that, I'm happy to talk about that at greater length.

@Naghasan
Copy link
Contributor Author

@bader do you have opinions on this ?

@bader
Copy link
Contributor

bader commented Jan 26, 2022

  1. "Further more, we also noticed that large arrays are copied over using a copy loop rather than a mem copy. This prevent SROA to operate (dynamic indexing makes it bails out) and as a consequence prevents DAE as well."

Replacing a copy loop with a mem copy sounds reasonable. According to my understanding, device_copyable property makes this transformation legal.

  1. "supporting types that are does not meet device copyable criteria."

"This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). "

This sounds a bit concerning. AFAIK, today compiler rejects non-device_copyable types and saves users from potential UB/bugs. It would be great to make non-standard behavior explicit and per-type to have "safe" default behavior.

setting specializing is_device_copyable_v is of no help.

Could you elaborate more on that, please?

@Naghasan
Copy link
Contributor Author

"This bend the specs as it will allow types that should normally be rejected, but remains within the limits (https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable). "

This sounds a bit concerning. AFAIK, today compiler rejects non-device_copyable types and saves users from potential UB/bugs. It would be great to make non-standard behavior explicit and per-type to have "safe" default behavior.

setting specializing is_device_copyable_v is of no help.

Could you elaborate more on that, please?

That relates to the bit above actually.

if you have T that is not device copyable per spec rules, this will fail to compile

T k;
q.single_task([k]{});

But you can specialize is_device_copyable to by-pass this compiler check (the user is making a promise)

template <> struct is_device_copyable<T> : std::true_type {};

T k;
q.single_task([k]{});

Supporting the second case is what I'm after, I'm not advocating to change anything for the first case.

In our case T is basically something like:

struct T {
 some_device_copyable_type1 a;
 some_device_copyable_type2 b;
[...]
  std::string unused_on_device_yet_defined;
[...]
};

As the std::string is unused, we don't care if it is in a valid state or not (user specialized is_device_copyable, so his responsibility). The problem that comes later with DPC++ is the sycl functor dtor is called in the opencl kernel entry point (compiler generated). Because there is a std::string , this calls a non trivial destructor and uses ban features on device (a delete here).

So the proposal here is just to make sure this dtor call doesn't happen (which is fine spec wise) and mem copy argument to help optimizations but also ensure no Ctor/Dtor is being called (still fine spec wise).

I know this might look odd, but this help supporting framework not designed around SYCL. I guess @joeatodd or @masterleinad could provide more insight about the context if needed.

@masterleinad
Copy link
Contributor

We actually just merged kokkos/kokkos#4637 in https://github.com/Kokkos/Kokkos that uses some more tricks to make sycl::is_device_copyable actually work. In particular, we needed to wrap the functor into a union as also suggested above to prevent the compiler from calling the copy assignment, move assignment, copy construction and move construction operators and the destructor, see https://github.com/kokkos/kokkos/blob/20a090bf6a7268f6d46521b686e46da3b40e925a/core/src/SYCL/Kokkos_SYCL_Instance.hpp#L256-L290.
I very much agree that promising not to call these functions by the intel/llvm implementation would increase the usability of sycl::is_device_copyable quite a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants