-
Notifications
You must be signed in to change notification settings - Fork 448
DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353
Comments
Hello, @benbarsdell! Thank you for reporting this. As stated in the documentation, key bits should be different:
I think that's the reason there are no tests for this particular case: for (int begin_bit = 0; begin_bit <= 1; begin_bit++)
{
// Iterate end bit
for (int end_bit = begin_bit + 1; As I understand, you expect something like cudaMemcpyAsync to be performed in this case. I think we could generalize API to this case at some point. Can you use some wrapper function until then? |
Yes I've worked around it, so it's not blocking us. I think it would be useful to generalize the API to support it. It's not actually clear to me that that docstring excludes this case, because it does not say that the subrange must be non-empty. We are calling it with |
This seems reasonable. We should handle the case where I opened #355 to track the missing log output. |
#481 should fix this. |
We had a painful issue in TensorFlow that turned out to be because we were passing
begin_bit = end_bit = 0
(all keys were zero in our case). CUB failed with "Invalid configuration error", and debugging was difficult because the failing kernel launch was not logged even withdebug_synchronous=true
.Some isolated testing show that CUB succeeds and gives the correct answer for small inputs (i.e., single-block), but for large inputs it either produces the wrong value (CUDA <= 11.2) or returns "Invalid configuration error" (CUDA >= 11.3).
Here is a minimal reproducer (remove the ".txt" suffix):
test_cub_bits_bug.cu.txt
It would be great if this could be fixed, and if logging could be added for all kernel launches (specifically in the Onesweep path).
The text was updated successfully, but these errors were encountered: