-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACKEND] initial llvm codegen for amdgpu #402
Conversation
check your runtime as it reports rocm not enabled. Need to change src/runtime/module.cc to add rocm enable checj |
src/codegen/llvm/codegen_amdgpu.cc
Outdated
// add function as void return value | ||
CodeGenLLVM::AddFunctionInternal(f, true); | ||
// annotate as kernel function | ||
module_->getOrInsertNamedMetadata("nvvm.annotations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely you don't need this, need to change to Amd kernel annotations
CHECK_EQ(info.scope.rank, 1) | ||
<< "Can only allocate shared or local memory inside kernel"; | ||
// Shared memory: address space == 3 | ||
const unsigned shared_address_space = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if address space is consistent with Amd gpu backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to make another pass on the codegen part as there are obvious differences between nvptx and amdgcn codegen. Is there a way I can see IR directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to check the LLVM code, do module_->dump(); you have to insert it manually in the code though. Otherwise, implement GetSource in hip module which should give you the assembly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use LOG(WARNING) << module_->dump();
to see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tqchen I am getting following error:
$ python tests/python/unittest/test_codegen_device.py
Traceback (most recent call last):
File "tests/python/unittest/test_codegen_device.py", line 1, in <module>
import tvm
File "/home/aditya/tvm/python/tvm/__init__.py", line 5, in <module>
from . import tensor
File "/home/aditya/tvm/python/tvm/tensor.py", line 4, in <module>
from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node
File "/home/aditya/tvm/python/tvm/_ffi/node.py", line 8, in <module>
from .node_generic import NodeGeneric, convert_to_node, const
File "/home/aditya/tvm/python/tvm/_ffi/node_generic.py", line 7, in <module>
from .base import string_types
File "/home/aditya/tvm/python/tvm/_ffi/base.py", line 43, in <module>
_LIB, _LIB_NAME = _load_lib()
File "/home/aditya/tvm/python/tvm/_ffi/base.py", line 35, in _load_lib
lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
File "/usr/lib/python2.7/ctypes/__init__.py", line 362, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /home/aditya/tvm/lib/libtvm.so: undefined symbol: _ZNK4llvm6Module4dumpEv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I use module_->dump() without piping it to stream and it should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen I am getting the following error: $ python test_codegen_device.py
[22:30:06] /home/aditya/tvm/dmlc-core/include/dmlc/./logging.h:308: [22:30:06] src/runtime/module.cc:74: Module[hip] does not support GetSource
Stack trace returned 10 entries:
[bt] (0) /home/aditya/tvm/lib/libtvm.so(_ZN3tvm7runtime10ModuleNode9GetSourceERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x30f) [0x7f3ecfc6d16f]
[bt] (1) /home/aditya/tvm/lib/libtvm.so(+0x955fd2) [0x7f3ecfc6ffd2]
[bt] (2) /home/aditya/tvm/lib/libtvm.so(TVMFuncCall+0x5e) [0x7f3ecfc7961e]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f3ed4d10e40]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f3ed4d108ab]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f3ed4f203df]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x11d82) [0x7f3ed4f24d82]
[bt] (7) python(PyObject_Call+0x43) [0x4b0cb3]
[bt] (8) python(PyEval_EvalFrameEx+0x5faf) [0x4c9faf]
[bt] (9) python(PyEval_EvalCodeEx+0x255) [0x4c2765]
Traceback (most recent call last):
File "test_codegen_device.py", line 88, in <module>
test_add_pipeline()
File "test_codegen_device.py", line 85, in test_add_pipeline
check_target("rocm", host="llvm")
File "test_codegen_device.py", line 46, in check_target
code = mdev.get_source()
File "/home/aditya/tvm/python/tvm/module.py", line 34, in get_source
return _GetSource(self, fmt)
File "/home/aditya/tvm/python/tvm/_ffi/function.py", line 255, in my_api_func
return flocal(*args)
File "/home/aditya/tvm/python/tvm/_ffi/_ctypes/function.py", line 183, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/home/aditya/tvm/python/tvm/_ffi/base.py", line 62, in check_call
raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [22:30:06] src/runtime/module.cc:74: Module[hip] does not support GetSource
|
src/codegen/llvm/codegen_amdgpu.cc
Outdated
" -mcpu=gfx900" + | ||
target.substr(5, target.length() - 5)); | ||
) >= 4 && | ||
target.substr(0, 4) == "rocm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend use GetLLVMTargetMachine so futher options can be passed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
implement get source here https://github.com/dmlc/tvm/blob/master/src/runtime/rocm/rocm_module.cc#L46 |
Summarizing,
$ python test_codegen_device.py
[23:11:18] src/runtime/rocm/rocm_module.cc:64: HSACO
Bus error (core dumped) I have to debug on it more. |
src/codegen/llvm/codegen_amdgpu.cc
Outdated
else { | ||
CHECK_EQ(ts.rank, 0); | ||
switch (ts.dim_index) { | ||
case 0: intrin_id = ::llvm::Intrinsic::nvvm_read_ptx_sreg_ctaid_x; break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need this, this corresponds to get_group_id in OpenCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it!
CHECK_EQ(info.scope.rank, 1) | ||
<< "Can only allocate shared or local memory inside kernel"; | ||
// Shared memory: address space == 3 | ||
const unsigned shared_address_space = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://llvm.org/docs/AMDGPUUsage.html#address-space-mapping Change to local memory space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I am aware of it. Once I get LLVM IR dump, I can get better understanding of what to change or even add more functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, seems this is something we will need frequently. Let us simply also print out llvm ir and save it to the code field(optional) in the ROCMModule, so we can access it with module.get_source()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR with IR dump code. I see some places that need to be changed (which is why I am getting Bus Error).
@tqchen I don't expect code to run as we have a new feature coming in from HIP (https://github.com/adityaatluri/HIP/commit/8a7328fd9de7f1d174e5f4b75de734fb4032f5b6). I'll write a CPP test to check whether the IR generated is valid or not. |
Can you elaborate a bit on what is expected? For example, is the problem lies in the additional argument packing, or other parts? We might be able to change the runtime accordingly to solve this issue It would be nice to get a runnable code. |
Specifically, we can pre-pack the arguments into a single buffer, if necessary, without going through the HIP CUDA compatible API. For example, in Metal runtime everything is packed into an array of ArgUnion, and the device code will r eceive a packed arguments instead https://github.com/dmlc/tvm/blob/master/src/runtime/metal/metal_module.mm#L202 |
src/codegen/llvm/codegen_llvm.cc
Outdated
@@ -113,7 +113,7 @@ void CodeGenLLVM::AddFunctionInternal(const LoweredFunc& f, bool ret_void) { | |||
ret_void ? t_void_ : t_int_, arg_type, false); | |||
// setup the function. | |||
function_ = llvm::cast<llvm::Function>(module_->getOrInsertFunction(f->name, ftype)); | |||
function_->setCallingConv(llvm::CallingConv::C); | |||
function_->setCallingConv(dev_type == AMDGPU ? llvm::CallingConv::AMDGPU_KERNEL : llvm::CallingConv::C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely you don't have to do this, we can overwrite AddFunction, and do setCallingConv after calling AddFunctionInternal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to create multiple functions which do the same and it is easier to read the code this way. Do you want me to overload the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don;t have to overload AddFunctionInternal. Simply overload AddFunction, which calls AddFunctionInternal, and then do function_->setCallingConv again
src/codegen/llvm/codegen_amdgpu.cc
Outdated
void AddFunction(const LoweredFunc& f) final { | ||
// add function as void return value | ||
CodeGenLLVM::AddFunctionInternal(f, true, AMDGPU); | ||
// annotate as kernel function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. simply add function_->setCallingConv here to override the old flag
@tqchen I am able to see good IR now and be able to generate ISA from it (through llc). https://gist.github.com/adityaatluri/1ac1ff72b927e42fdd8a61f98176039a |
Can you confirm the gap between this and actual kernel test that runs? Thanks |
Can you explain a bit more about what you meant? |
I mean directly run the test via the Rocm module and verify the correctness of the kernel |
Gotcha. Turns out the IR is not valid. These lines are causing bad output results. https://gist.github.com/adityaatluri/1ac1ff72b927e42fdd8a61f98176039a#file-tvm-amdgcn-ll-L10 |
This is shift left, used for address calculation, in condition if (blockIdx.x * 256 + threadIdx.x < n) {
...
} blockIdx.x * 256 becomes |
Do you know which code block generate this? |
Should due to LLVM's constant folder in IRBuilder, which automatically folds Mul(blockIdx, 256) into left shift, is the shift not supported by AMD ISA? |
It does support shl/shr but I don't think we need to mul workitem id with 8. Also, I didn't see the last arg i32. Let me retest. Also, 1024 is good for AMDGPUs. |
After retest, the data output got validated. |
nice, can we directly use RocmModule to run the test instead of the current test that is de-coupled from the compiler? |
We need new HIP which the team is working on. Once it lands, it'll make it easier to launch kernel. |
src/codegen/llvm/codegen_llvm.cc
Outdated
@@ -100,7 +101,7 @@ void CodeGenLLVM::AddFunctionInternal(const LoweredFunc& f, bool ret_void) { | |||
Type t = arg.type(); | |||
if (t.is_handle() && f->handle_data_type.count(arg)) { | |||
arg_type.push_back( | |||
LLVMType(f->handle_data_type[arg].type())->getPointerTo()); | |||
LLVMType(f->handle_data_type[arg].type())->getPointerTo(isTargetAMD ? 1 : 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a virtual function GetGlobalAddressSpace to CodeGenLLVM and override that in CodeGenAMDGPU
src/codegen/llvm/codegen_llvm.cc
Outdated
@@ -113,7 +114,8 @@ void CodeGenLLVM::AddFunctionInternal(const LoweredFunc& f, bool ret_void) { | |||
ret_void ? t_void_ : t_int_, arg_type, false); | |||
// setup the function. | |||
function_ = llvm::cast<llvm::Function>(module_->getOrInsertFunction(f->name, ftype)); | |||
function_->setCallingConv(llvm::CallingConv::C); | |||
function_->setCallingConv(isTargetAMD ? | |||
llvm::CallingConv::AMDGPU_KERNEL : llvm::CallingConv::C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this set in AddFunction override in CodeGenAMDGPU
src/codegen/llvm/codegen_amdgpu.cc
Outdated
CodeGenLLVM::AddFunctionInternal(f, true); | ||
// annotate as kernel function | ||
/* | ||
module_->getOrInsertNamedMetadata("nvvm.annotations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these comments
src/codegen/llvm/codegen_amdgpu.cc
Outdated
@@ -0,0 +1,176 @@ | |||
/*! | |||
* Copyright (c) 2017 by Contributors | |||
* \file codegen_nvptx.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comments
llvm::Value* CreateStorageSync(const Call* op) final { | ||
const std::string& sync = op->args[0].as<StringImm>()->value; | ||
if (sync == "warp") { | ||
// TODO(tqchen) warp sync in CUDA9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comment here, is there any need of warp(wavefront) synchronizer in AMD GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are sync commands for AMD GPU, but if it CUDA9 specific, current generation AMD GPUs don't support it.
src/codegen/llvm/codegen_amdgpu.cc
Outdated
}; | ||
|
||
runtime::Module BuildAMDGPU(Array<LoweredFunc> funcs, std::string target) { | ||
CHECK(1) << target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove prints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it for the last commit.
I see what you mean by looking at the test code you provided. Actually the metadata is already available in TVMRuntime, and we are using this to pack the data. So one possible way is simply implement parameter packing in TVM. For example, https://github.com/dmlc/tvm/blob/master/src/runtime/pack_args.h#L150 packs non pointer argument into a continuous memory region of one buffer(ArgUnion). This is used by Metal runtime, which requires non pointer arguments to be packed as one buffer. If we know the parameter packing requirement(e.g. alignment of each value) |
src/codegen/llvm/codegen_amdgpu.cc
Outdated
CHECK(tm->addPassesToEmitFile( | ||
pass, destAsm, llvm::TargetMachine::CGFT_AssemblyFile) == 0) | ||
<< "Cannot emit target CGFT_AssemblyFile"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need this in future. Especially helpful for kernel debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but usually it is not good to keep debug code into production. What we should do is to add it back later when there is need for debug.
src/codegen/llvm/codegen_amdgpu.cc
Outdated
}; | ||
|
||
runtime::Module BuildAMDGPU(Array<LoweredFunc> funcs, std::string target) { | ||
CHECK(1) << target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
Last two comments, and we can merge this in. Thanks for the work to make this happen |
src/codegen/llvm/codegen_amdgpu.cc
Outdated
arr.data = &obj[0]; | ||
arr.size = obj.length(); | ||
|
||
std::string hsaco = (*f)(arr), ll; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using a comma expression here, is it intended? This will results in hsaco take value from ll
destAsm.SetUnbuffered(); | ||
module->print(dest_ll, nullptr); | ||
std::unique_ptr<llvm::Module> mAsm = llvm::CloneModule(module.get()); | ||
std::unique_ptr<llvm::Module> mObj = llvm::CloneModule(module.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove mObjFile and mAsmFile. We can consider hold two optional source code in RocmModule, both ll and asm, and return them when different source suffix is requested, that might help you in debugging.
src/codegen/llvm/codegen_amdgpu.cc
Outdated
std::unique_ptr<llvm::Module> mObjFile = llvm::CloneModule(module.get()); | ||
llvm::legacy::PassManager pass; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra line here
Hi @adityaatluri , do I need a custom llvm + clang from AMD to use the rocm backend? My card is R9 Nano, so I replaced gfx900 in this line with gfx803. |
@masahi can you try LLVM 5.0? There are few issues with rocm runtime which will be fixed soon. |
@adityaatluri Thanks for the quick response. I'll try llvm 5.0 after I am back from work. By the way, the opencl backend with rocm's opencl stack works fine on my Nano. I can pass all tests in https://github.com/dmlc/tvm/tree/master/topi/tests/python . |
Thank you for trying it out. |
@adityaatluri I built tvm with llvm 5.0 from the official ubunutu package, but test_gemm.py still hangs my entire system. I think something is wrong with codegen. With opencl backend, when I do this: f_opencl = tvm.build(s, [A, B, C], "opencl") I get a valid opencl kernel string. But for rocm backend, f_rocm = tvm.build(s, [A, B, C], "rocm") just prints out '\x7fELF\x02\x01\x01@' test_codegen_device.py fails for the same reason. But test_runtime_ndarray.py passes. Any ideas? |
Can you do |
Ok I get this I'm correctly linking against llvm 5.0. |
Can you compile the ir to asm using |
Ok, the output of llc-5.0 -march=amdgcn -mcpu=gfx803 myadd_kernel.ll (not gfx900, my card is R9 Nano) I can also disassemble 'rocm_kernel.co', generated in here |
The asm looks good to me. Are you still getting runtime error? |
Yes, I either get 'Memory access fault error', or no error but the output array is [0., 0., 0., ....] . When I print the value of packed_nbytes here, it sayes 28 or 20 (the operator() is called twice, don't know why) |
@masahi Great! That is the bug we are seeing that I mentioned.
Do you have any interesting observations? |
Nothing so far, working on it. |
Can you join the dlpack slack channel? We can discuss more there. |
Sure, but how can I join? Haven't used slack before. |
It is dlpack.slack.com |
Ok, I'll ping @tqchen to send me an invite. |
@masahi You can send an email to my uw email address |
* [Arith] Inverse affine map * Update iter_affine_map.h * Update iter_affine_map.h * Update iter_affine_map.py * Topology order visit * doc * fix * address comments
* [Arith] Inverse affine map * Update iter_affine_map.h * Update iter_affine_map.h * Update iter_affine_map.py * Topology order visit * doc * fix * address comments
* [Arith] Inverse affine map * Update iter_affine_map.h * Update iter_affine_map.h * Update iter_affine_map.py * Topology order visit * doc * fix * address comments
The test results: