Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Driver] fsycl-device-only does not offload, target triple is used to define programing model #1814

Closed
Naghasan opened this issue Jun 4, 2020 · 7 comments
Assignees

Comments

@Naghasan
Copy link
Contributor

Naghasan commented Jun 4, 2020

The -fsycl-device-only flag does not mark the driver actions as offloading.
As shown by -ccc-print-phases, -fsycl-device-only dismisses SYCL offloading

./bin/clang -fsycl-device-only foo.cpp -ccc-print-phases
      +- 0: input, "foo.cpp", c++
   +- 1: preprocessor, {0}, c++-cpp-output
+- 2: compiler, {1}, ir
3: backend, {2}, ir

Whereas with fsycl (note the device-sycl not present before)

./bin/clang -fsycl foo.cpp -ccc-print-phases
                  +- 0: input, "foo.cpp", c++, (host-sycl)
               +- 1: preprocessor, {0}, c++-cpp-output, (host-sycl)
               |     +- 2: input, "foo.cpp", c++, (device-sycl)
               |  +- 3: preprocessor, {2}, c++-cpp-output, (device-sycl)
               |- 4: compiler, {3}, sycl-header, (device-sycl)
            +- 5: offload, "host-sycl (x86_64-unknown-linux-gnu)" {1}, "device-sycl (spir64-unknown-unknown-sycldevice)" {4}, c++-cpp-output
         +- 6: compiler, {5}, ir, (host-sycl)
      +- 7: backend, {6}, assembler, (host-sycl)
   +- 8: assembler, {7}, object, (host-sycl)
+- 9: linker, {8}, image, (host-sycl)
|           +- 10: compiler, {3}, ir, (device-sycl)
|        +- 11: linker, {10}, ir, (device-sycl)
|     +- 12: sycl-post-link, {11}, tempfiletable, (device-sycl)
|     |  +- 13: file-table-tform, {12}, tempfilelist, (device-sycl)
|     |- 14: llvm-spirv, {13}, tempfilelist, (device-sycl)
|  +- 15: file-table-tform, {12, 14}, tempfiletable, (device-sycl)
|- 16: clang-offload-wrapper, {15}, object, (device-sycl)
17: offload, "host-sycl (x86_64-unknown-linux-gnu)" {9}, "device-sycl (spir64-unknown-unknown-sycldevice)" {16}, image

The problem it raises is this makes the Clang::ConstructJob kind of convoluted and leads to the target triple being used to define the source language/programing model of the input file (due to UseSYCLTriple).

For instance,

$ cat foo.c
void foo() {}
$ ./bin/clang -target spir64-unknown-unknown foo.c -emit-llvm -c -S -o -
; ModuleID = 'foo.c'
source_filename = "foo.c"
target datalayout = "e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-n8:16:32:64"
target triple = "spir64-unknown-unknown"

; Function Attrs: noinline nounwind optnone
define dso_local spir_func void @foo() #0 {
entry:
  ret void
}

attributes #0 = { noinline nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 11.0.0"}
$ ./bin/clang -target spir64-unknown-unknown-sycldevice foo.c -emit-llvm -c -S -o -
; ModuleID = 'foo.c'
source_filename = "foo.c"
target datalayout = "e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-n8:16:32:64"
target triple = "spir64-unknown-unknown-sycldevice"

!llvm.module.flags = !{!0}
!opencl.spir.version = !{!1}
!spirv.Source = !{!2}
!llvm.ident = !{!3}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 1, i32 2}
!2 = !{i32 4, i32 100000}
!3 = !{!"clang version 11.0.0"}
@bader bader assigned AGindinson and mdtoguchi and unassigned bader Jun 4, 2020
@bader
Copy link
Contributor

bader commented Jun 8, 2020

I'm not sure if it's the same issue, but I also notice the difference in how device code is emitted in regular mode (-fsycl) vs "device-only" mode (-fsycl -fsycl-device-only).

I wrote a simple SYCL app, which relies on CTAD and it doesn't compile in "device-only" mode, but compiles fine in regular mode.

#include <CL/sycl.hpp>
using namespace sycl;

#define SIZE 1024
int main() {
  queue q;
  buffer<int, 1> result{SIZE};
  q.submit([&](handler &cgh) {
    auto Accessor = result.get_access<access::mode::write>(cgh);
    cgh.parallel_for<class test_kernel>(range{SIZE}, [=](id<1> WIid) {
          Accessor[WIid] = 42;
        });
  });
  auto hostAccessor = result.get_access<access::mode::read>();
}
clang++ -fsycl -c /tmp/test.cpp -o /tmp/test.o
; okay

clang++ -fsycl -fsycl-device-only /tmp/test.cpp -c -o /tmp/test.o
/tmp/test.cpp:10:41: error: use of class template 'range' requires template arguments
    cgh.parallel_for<class test_kernel>(range{SIZE}, [=](id<1> WIid) {
                                        ^
include/sycl/CL/sycl/item.hpp:26:33: note: template is declared here
template <int dimensions> class range;
~~~~~~~~~~~~~~~~~~~~~~~~~       ^
1 error generated.

When I dump compiler command I see significant difference in how device compiler is invoked.

clang++ -fsycl -c /tmp/test.cpp -o /tmp/test.o -###
"clang++" "-cc1" "-triple" "spir64-unknown-unknown-sycldevice" "-fsycl" "-fsycl-is-device" "-fdeclare-spirv-builtins" "-aux-triple" "x86_64-unknown-linux-gnu" "-Wno-sycl-strict" "-sycl-std=2017" "-emit-llvm-bc" "-emit-llvm-uselists" "-disable-free" "-main-file-name" "test.cpp" "-mrelocation-model" "static" "-mthread-model" "posix" "-mframe-pointer=all" "-fmath-errno" "-fno-rounding-math" "-fno-verbose-asm" "-mconstructor-aliases" "-aux-target-cpu" "x86-64" "-dwarf-column-info" "-fno-split-dwarf-inlining" "-debugger-tuning=gdb" "-resource-dir" "build/linux_prod/lib/clang/11.0.0" "-internal-isystem" "../include/sycl" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "build/linux_prod/lib/clang/11.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "linux_prod/lib/clang/11.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-std=c++17" "-fdeprecated-macro" "-fdebug-compilation-dir" "llvm-project" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-faddrsig" "-o" "/tmp/test-dbf9e3.bc" "-x" "c++" "/tmp/test.cpp"

clang++ -fsycl -fsycl-device-only /tmp/test.cpp -c -o /tmp/test.o -###
"clang++" "-cc1" "-triple" "spir64-unknown-linux-sycldevice" "-fsycl" "-fsycl-is-device" "-fdeclare-spirv-builtins" "-aux-triple" "x86_64-unknown-linux-gnu" "-Wno-sycl-strict" "-emit-llvm-bc" "-emit-llvm-uselists" "-disable-free" "-main-file-name" "test.cpp" "-mrelocation-model" "static" "-mthread-model" "posix" "-mframe-pointer=all" "-fmath-errno" "-fno-rounding-math" "-fno-verbose-asm" "-no-integrated-as" "-mconstructor-aliases" "-dwarf-column-info" "-fno-split-dwarf-inlining" "-debugger-tuning=gdb" "-resource-dir" "lib/clang/11.0.0" "-internal-isystem" "include/sycl" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/spir64-unknown-linux-sycldevice/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "lib/clang/11.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-fdeprecated-macro" "-fno-dwarf-directory-asm" "-fdebug-compilation-dir" "llvm-project" "-ferror-limit" "19" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-o" "/tmp/test.o" "-x" "c++" "/tmp/test.cpp"

In "device-only" mode driver does not set C++ and SYCL standard versions.

@mdtoguchi, is this caused by the same issue reported by @Naghasan here or I should open another one?

@mdtoguchi
Copy link
Contributor

When using -fsycl-device-only vs -fsycl, The expectation is the device compilations match, and the issue described from @bader matches the original from @Naghasan

@Naghasan
Copy link
Contributor Author

Naghasan commented Jun 9, 2020

To add a littile bit, because the fsycl-device-only does not enter in an "offloading mode" (not exactly sure on how to call it), this creates kind of redundant code:

For instance here: https://github.com/intel/llvm/blob/sycl/clang/lib/Driver/ToolChains/Clang.cpp#L1225

  if (JA.isOffloading(Action::OFK_SYCL) ||
      Args.hasArg(options::OPT_fsycl_device_only))

IMO this really just be

 if (JA.isOffloading(Action::OFK_SYCL))

But what is really odd is that the result of

  bool IsSYCLOffloadDevice = JA.isDeviceOffloading(Action::OFK_SYCL);

(in https://github.com/intel/llvm/blob/sycl/clang/lib/Driver/ToolChains/Clang.cpp#L3945)
evaluate to false if you only pass fsycl-device-only. And because of that, there is a "workaround" a few line below

  bool IsSYCLDevice = (RawTriple.getEnvironment() == llvm::Triple::SYCLDevice);
  // Using just the sycldevice environment is not enough to determine usage
  // of the device triple when considering fat static archives.  The
  // compilation path requires the host object to be fed into the partial link
  // step, and being part of the SYCL tool chain causes the incorrect target.
  // FIXME - Is it possible to retain host environment when on a target
  // device toolchain.
  bool UseSYCLTriple = IsSYCLDevice && (!IsSYCL || IsSYCLOffloadDevice);

And UseSYCLTriple is then some how used as the condition to "are we setting up the device invocation" (like here https://github.com/intel/llvm/blob/sycl/clang/lib/Driver/ToolChains/Clang.cpp#L4077)

I think a more stable and general approach would be to follow the CUDA approach. Last time I looked into the code, the cuda mode is always building host and device actions and if only the device actions are required (e.g. using --cuda-device-only), it drops the host actions (vice versa if only the host is required). This makes the job building simpler as isOffloading always means the programming model enabled, and isDeviceOffloading always means is building an invocation for device side (regardless of whether the host is kept).

@AGindinson
Copy link
Contributor

AGindinson commented Jun 16, 2020

@Naghasan, I largely agree with the suggestions you've made in your latest comment; I believe unifying the "offloading"/"device offloading" semantics is both doable and useful. @pvchupin mentioned that your team's contributions are blocked by the existing approach; could you please elaborate on the particular difficulties caused/the desired timeline?

Myself, I should be able to work on this starting next week; so EO next week could be a realistic target for completion.

@Naghasan
Copy link
Contributor Author

Thanks for looking at the problem.

The binding with libclc is causing difficulties as we can't really trigger behavior specific to the sycl abi (image lowering, type management, extensions etc.). So not be able to specify the sycldevice in triple without changing the input language is problematic. We have hacks at the moment so we can make progress, but they have some side effects. I tried to prevent this from happening, but I quickly ran into this issue.

The other part is that we can't compile the device only to extract the ptx (or the llvm IR for the ptx target). This is slowing down many investigation as people have to rely on hacks to be able to inspect the output. Here moving to a clean host/device SYCL offloading action pipeline should, by construction, enable any targets for device only (now limited to SPIR targets). Or at least make this much more simpler implement for any target.

@AGindinson
Copy link
Contributor

#2713 by @mdtoguchi addresses the issues described here.

@mdtoguchi
Copy link
Contributor

#2713 has merged, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants