-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal reproducer for incorrect code generated for multiple barriers followed by multiple conditional ops in a kernel #906
Comments
An update:
|
@mingjie-intel The C++ reproducer does not use local memory, unlike the numba example. |
…atterns to `dpex.func` functions
…atterns to `dpex.func` functions
Adding more fuel to the investigation, and a workaround that seems consistent. First of all this is an even more minimal reproducer: import numba_dpex as dpex
import dpctl.tensor as dpt
import numpy as np
dtype = np.float32
@dpex.kernel
def kernel(result):
local_col_idx = dpex.get_local_id(0)
local_values = dpex.local.array((1,), dtype=dtype)
if (local_col_idx < 1):
local_values[0] = 1
dpex.barrier(dpex.LOCAL_MEM_FENCE)
if (local_col_idx < 1):
result[0] = 10
result = dpt.zeros(sh=(1), dtype=dtype)
kernel[32, 32](result)
print(result) it's the same than the initial minimal reproducer, but the first What seems to happen is that, either some work items (including the first work item) decide to abort right after the barrier, or the second @dpex.kernel
def kernel(result):
local_col_idx = dpex.get_local_id(0)
local_values = dpex.local.array((1,), dtype=dtype)
if (local_col_idx < 1):
local_values[0] = 1
dpex.barrier(dpex.LOCAL_MEM_FENCE)
local_col_idx = dpex.get_local_id(0)
if (local_col_idx < 1):
result[0] = 10 i.e. redefining Another interesting behavior: replacing Now, here's the consistent workaround that seems to have solved the issue in the 4 different kernels where I've witnessed it so far (including the matmul from #892). It consists in moving the instructions that seems to have been mis-compiled, to @dpex.func
def func(condition, idx, value, result):
if condition:
result[idx] = value
@dpex.kernel
def kernel(result):
local_col_idx = dpex.get_local_id(0)
local_values = dpex.local.array((1,), dtype=dtype)
if (local_col_idx < 1):
local_values[0] = 1
dpex.barrier(dpex.LOCAL_MEM_FENCE)
func((local_col_idx < 1), 0, 10, result) The rule seems to be "move to a @dpex.kernel
def kernel(result):
local_col_idx = dpex.get_local_id(0)
local_values = dpex.local.array((1,), dtype=dtype)
func((local_col_idx < 1), 0, 1, local_values)
dpex.barrier(dpex.LOCAL_MEM_FENCE)
if (local_col_idx < 1):
result[0] = 10 This trick also solves #892 and all the kernels I've had troubles with (which seems to confirm that it's the same bug everywhere, and also affect gpu runtimes). Also, instead of trying to guess how the compiler goes wrong without the @dpex.kernel
def kernel(result):
local_col_idx = dpex.get_local_id(0)
local_values = dpex.local.array((1,), dtype=dtype)
func((local_col_idx < 1), 0, 1, local_values)
dpex.barrier(dpex.LOCAL_MEM_FENCE)
func((local_col_idx < 1), 0, 10, result |
Forgot to add that the but still occurs when replacing array setitems with atomic add, and fortunately the same trick works. |
- Work around IntelPython/numba-dpex#906 by moving bugged instruction patterns to `dpex.func` functions - Fix tolerance value - Implement strict convergence checking --------- Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
|
Updating on the reproducer, it no longer fails with 100% certainty in |
@fcharras As you reported in #1152 (comment) I too am unable to reproduce the issue any more. Based on my findings in #892, I think this too might be down to LLVM compiler optimizations at I am investigating further and we also have a PR #1158 to disallow O3 level of optimizations and make O2 the default level of optimization on the kernel LLVM IR module. Do note even dpcpp does not run all LLVM optimizations on the kernel modules and further optimizations are pushed to a driver compiler (IGC or NVVM) that has a better understanding of the target device capabilities. |
Bumps of numba and llvmlite are not enough to explain the fix, since the issue is still reproducible with Could you add a unit test with the reproducer ? |
From the latest commit on main, the reproducer does not showcase any issue anymore in any context for me at all. Looks like it has been fixed as a side effect of something else. |
minimal reproducer:
In this case , due to the conditional statements inside the min_dpex, the value of operands of min_dpex are reset to 0 even before evaluating the expression. It works fine if we have a normal assignment operation before the barrier and not conditional operations. Also only the value of operands for |
@fcharras In line with #892 (comment), the modified reproducer works without issues on all supported device with import numba_dpex.experimental as dpex_exp
import dpctl.tensor as dpt
import numpy as np
from numba_dpex import kernel_api as kapi
dtype = np.float32
@dpex_exp.kernel
def kernel(nditem: kapi.NdItem, slm, result):
local_col_idx = nditem.get_local_id(0)
gr = nditem.get_group()
kapi.group_barrier(gr)
if local_col_idx < 1:
slm[0] = 1
kapi.group_barrier(gr)
if local_col_idx < 1:
result[0] = 10
result = dpt.zeros(shape=(1), dtype=dtype)
slm = kapi.LocalAccessor((1,), dtype=dtype)
dpex_exp.call_kernel(kernel, kapi.NdRange((32,), (32,)), slm, result)
print(result) |
@roxx30198 Can you update your reproducer with the initialization for the kernels? I have updated your reproducer as well with the latest API. Also, one important thing to note is that import dpnp
import numba_dpex.experimental as dpex_exp
from numba_dpex import kernel_api as kapi
@dpex_exp.device_func
def min_dpex(a, b):
t = a if a <= b else b
return t
@dpex_exp.kernel
def _pathfinder_kernel(
nditem: kapi.NdItem, prev, deviceWall, cols, iteration, cur_row, result
):
current_element = nditem.get_global_id(0)
left_ind = current_element - 1 if current_element >= 1 else current_element
right_ind = current_element + 1 if current_element < cols - 1 else cols - 1
up_ind = current_element
gr = nditem.get_group()
for i in range(iteration):
kapi.group_barrier(gr)
index = (cur_row + i) * cols + current_element
left = prev[left_ind]
up = prev[up_ind]
right = prev[right_ind]
shortest = min_dpex(left, up)
shortest = min_dpex(shortest, right)
kapi.group_barrier(gr)
prev[current_element] = deviceWall[index] + shortest
if i == iteration - 1:
break
kapi.group_barrier(gr)
result[current_element] = prev[current_element]
def pathfinder(data, rows, cols, pyramid_height, result):
# create a temp list that hold first row of data as first element and
# empty numpy array as second element
device_dest = dpnp.array(data[:cols], dtype=dpnp.int64) # first row
device_wall = dpnp.array(data[cols:], dtype=dpnp.int64)
t = 1
while t < rows:
iteration = min(pyramid_height, rows - t)
dpex_exp.call_kernel(
_pathfinder_kernel,
kapi.NdRange((cols,), (cols,)),
device_dest,
device_wall,
cols,
iteration,
t - 1,
result,
)
device_dest = dpnp.array(result, dpnp.int64)
t += pyramid_height |
Closing as the main issue with respect to |
The following snippet:
when ran on CPU, prints:
but it should print:
I think it is a simpler instance of #892 . The buggy pattern seems to be this particular sequence of instruction: barrier -> conditional op on local memory -> barrier -> conditional op on global memory. You can see how the correct result is printed when one of this step is altered.
Also, it does not occur for all group sizes. Here,
[32, 32]
triggers the bug but[16, 16]
works. (if you can't reproduce the issue, maybe try higher group sizes)I can't reproduce it on GPU but maybe there's a combination of group size that could make it fail too.
The text was updated successfully, but these errors were encountered: