DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

benbarsdell · 2021-08-02T12:28:02Z

We had a painful issue in TensorFlow that turned out to be because we were passing begin_bit = end_bit = 0 (all keys were zero in our case). CUB failed with "Invalid configuration error", and debugging was difficult because the failing kernel launch was not logged even with debug_synchronous=true.

Some isolated testing show that CUB succeeds and gives the correct answer for small inputs (i.e., single-block), but for large inputs it either produces the wrong value (CUDA <= 11.2) or returns "Invalid configuration error" (CUDA >= 11.3).

Here is a minimal reproducer (remove the ".txt" suffix):
test_cub_bits_bug.cu.txt

It would be great if this could be fixed, and if logging could be added for all kernel launches (specifically in the Onesweep path).

The text was updated successfully, but these errors were encountered:

gevtushenko · 2021-08-02T12:53:34Z

Hello, @benbarsdell! Thank you for reporting this. As stated in the documentation, key bits should be different:

An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified.

I think that's the reason there are no tests for this particular case:

for (int begin_bit = 0; begin_bit <= 1; begin_bit++)
{
    // Iterate end bit
    for (int end_bit = begin_bit + 1;

As I understand, you expect something like cudaMemcpyAsync to be performed in this case. I think we could generalize API to this case at some point. Can you use some wrapper function until then?

benbarsdell · 2021-08-03T02:49:08Z

Yes I've worked around it, so it's not blocking us.

I think it would be useful to generalize the API to support it. It's not actually clear to me that that docstring excludes this case, because it does not say that the subrange must be non-empty. We are calling it with end_bit = Log2Ceiling(N), for integer keys in the range [0, N), which results in end_bit = 0 when N = 1.
There is also the fact that it already works in the single-block path.

alliepiper · 2021-08-05T17:55:21Z

This seems reasonable. We should handle the case where begin_bit == end_bit to simplify generic usecases and add tests for this.

I opened #355 to track the missing log output.

canonizer · 2022-05-12T20:31:38Z

#481 should fix this.

benbarsdell mentioned this issue Aug 2, 2021

Fix bug in GpuRadixSort for SparseSegmentReduceGrad tensorflow/tensorflow#51094

Merged

alliepiper added type: bug: functional Does not work as intended. P2: nice to have Desired, but not necessary. labels Aug 5, 2021

alliepiper added this to the 1.14.0 milestone Aug 5, 2021

alliepiper mentioned this issue Aug 5, 2021

Onesweep radix sort implementation should handle debug_synchronous logging #355

Closed

alliepiper modified the milestones: 1.14.0, 1.15.0 Aug 17, 2021

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021

alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022

alliepiper assigned canonizer Apr 25, 2022

alliepiper modified the milestones: 1.17.0, 2.0.0 May 7, 2022

canonizer mentioned this issue May 12, 2022

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Merged

alliepiper linked a pull request May 13, 2022 that will close this issue

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Merged

alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022

gevtushenko closed this as completed in #481 Aug 9, 2022

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad removed this from CCCL Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

benbarsdell commented Aug 2, 2021

gevtushenko commented Aug 2, 2021

benbarsdell commented Aug 3, 2021

alliepiper commented Aug 5, 2021

canonizer commented May 12, 2022

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

Comments

benbarsdell commented Aug 2, 2021

gevtushenko commented Aug 2, 2021

benbarsdell commented Aug 3, 2021

alliepiper commented Aug 5, 2021

canonizer commented May 12, 2022