distinct_count_estimator aka HyperLogLog++ #429

sleeepyjack · 2024-01-24T01:16:38Z

This PR introduces a new data structure cuco::distinct_count_estimator which implements the famous HyperLogLog++ algorithm for accurately approximating the number of distinct items in a multiset (see paper). This PR will ultimately solve rapidsai/cudf#10652.

examples/distinct_count_estimator/host_bulk_example.cu

jrhemstad · 2024-01-24T14:40:27Z

Could you provide a high-level description of the implementation? I'd discussed some considerations about the implementation in rapidsai/cudf#10652 (comment) and I'm curious what you ended up doing.

include/cuco/distinct_count_estimator.cuh

include/cuco/detail/hyperloglog/tuning.cuh

include/cuco/detail/hyperloglog/storage.cuh

include/cuco/detail/hyperloglog/hyperloglog_ref.cuh

sleeepyjack · 2024-01-24T22:03:47Z

Could you provide a high-level description of the implementation?

Sure! I'll also add the inline docs ASAP to make the PR better reviewable and also get CI unblocked.

The add step is indeed straightforward: Each thread takes an input item, hashes it and then splits the resulting hash value into p lower bits that represent the register/bucket index. The rest of the hash value goes into __clz to count the number of leading 0 bits. The original implementation puts the p index bits at MSB and then counts the remaining zeros starting from MSB-p. We can't use __clz directly in this case but have to issue some additional instructions. So I decided to just flip it around. This of course assumes that the hash function hits high bits and low bits equally random.

The second merge step combines two local histograms using point-wise max. For the current implementation, each block fills a shared memory histogram and then merges it into a single global memory histogram using atomicMax. This of course can lead to some contention on the global array, so we want as few of these merge operations as possible by assigning more work per block, i.e., launch as few blocks as possible but still enough to saturate the GPU.
Instead of this atomic reduction method, we could employ a proper tree reduction scheme. For this, we would need some auxiliary memory to store the intermediate histograms. This is on my performance-tuning TODO list.

The last step called estimate takes the final histogram, computes the geometric mean (simple + reduction) and then runs some single-threaded computations for HLL++'s bias correction.
Here we have several options to split the work between host and device. On the device it makes sense to assign this step to a single thread block so we can perform the reduction in shared memory. We could also use just a warp in case the histograms are small (by-default they have 2048 int registers). Using more threads doesn't really make sense even for larger histograms, since the additional global memory communication will kill performance.
I did compare host vs. device for this step, and surprisingly copying the final histogram to the host and then running the step in a single host thread is ~20% faster than launching a kernel with one block computing the estimate and then copying the final result back to the host. I have to run some profiles, but I think kernel launch overhead might be the culprit here. Also, the CPU is better at doing the single-threaded bias correction work compared to a single CUDA thread.

include/cuco/distinct_count_estimator.cuh

include/cuco/detail/hyperloglog/tuning.cuh

include/cuco/detail/hyperloglog/storage.cuh

include/cuco/detail/hyperloglog/hyperloglog.cuh

PointKernel · 2024-01-30T20:44:24Z

To unblock CI, could create a CUCO_HAS_CG_REDUCE_UPDATE_ASYNC macro similar to

cuCollections/include/cuco/detail/__config

Lines 42 to 44 in 7404bd2

    
           #if defined(CUDART_VERSION) && (CUDART_VERSION >= 11010) 
        
           #define CUCO_HAS_CG_MEMCPY_ASYNC 
        
           #endif

sleeepyjack · 2024-03-22T01:54:39Z

The finish line is in sight. Last thing on my TODO list (apart from addressing review comments) is to match Spark's unit test to make sure we can use our implementation as a drop-in replacement for Spark's CPU implementation.

sleeepyjack · 2024-03-26T03:09:33Z

It occurred to me that we might need a add_if function to deal with null/NaN values in Spark/cudf. Should we move this to a separate PR?

revans2 · 2024-03-26T14:48:42Z

It occurred to me that we might need a add_if function to deal with null/NaN values in Spark/cudf. Should we move this to a separate PR?

For Spark we definitely need null handling, but I am not hung up on how/when it gets in, so long as we can support it. Nulls are essentially ignored, but NaN values are not, and are treated as a part of the unique count.

Be aware that we might then have a HyperLogLog++ with no entries in it, and we would need to make sure that we will output 0 for that.

PointKernel

Looks great. I went through public API docs and they are accurate and well written.

add_if is a nice add-on but not critical thus probably a separate PR to target it.

include/cuco/standard_deviation.cuh

include/cuco/sketch_size.cuh

include/cuco/detail/hyperloglog/hyperloglog_ref.cuh

sleeepyjack · 2024-03-27T15:40:16Z

@revans2 @PointKernel I added Spark parity tests in 3b0da20.

There are two tests that are currently failing where Spark applies some special treatment for different values for NaN and +-0.0 so they are counted as a single item. Those values do have distinct bit representation though, thus our XXHash64 treats them as distinct items. I'm not sure if we should match Sparks behavior here or if this should be addressed upstream by, e.g., providing a wrapper for the hasher that makes sure these values are mapped to the same bit representation before passing them to the XXHash64 function (see godbolt).
I'm lowkey asking myself if it would be the right thing to bake this behavior into all of our hashers: Any zero or NaN representation should hash to the same value.

PointKernel · 2024-04-02T19:47:20Z

Those values do have distinct bit representation though, thus our XXHash64 treats them as distinct items.

A custom hasher provided by the user is the right way to go. I think it's fine to fail in this case.

This PR renames `include/cuco/sentinel.cuh` to `include/cuco/types.cuh` as preparation work for #429. The goal is to have a single header containing all strong types used across cuco. Note: Apparently Doxygen has become even pickier. I had to add some more inline docs to headers which we haven't touched in a while to get the pre-commit test to pass.

PointKernel

Excellent work! ship it.

PointKernel · 2024-04-03T15:56:53Z

README.md

+- [Host-bulk APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/host_bulk_example.cu) (see [live example in godbolt](https://godbolt.org/z/EG7cMssxo))
+- [Device-ref APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/device_ref_example.cu) (see [live example in godbolt](https://godbolt.org/z/va8eE9dqb))


Is the note still valid?

examples/distinct_count_estimator/device_ref_example.cu

sleeepyjack added 4 commits January 24, 2024 00:58

First draft

836e77a

Merge remote-tracking branch 'upstream/dev' into hll

d3a1e2f

Code style

6718560

Resolve merge conflicts

c59744e

sleeepyjack added type: feature request New feature request In Progress Currently a work in progress labels Jan 24, 2024

sleeepyjack self-assigned this Jan 24, 2024