Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct_count_estimator aka HyperLogLog++ #429

Merged
merged 90 commits into from
Apr 4, 2024
Merged

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Jan 24, 2024

This PR introduces a new data structure cuco::distinct_count_estimator which implements the famous HyperLogLog++ algorithm for accurately approximating the number of distinct items in a multiset (see paper). This PR will ultimately solve rapidsai/cudf#10652.

@sleeepyjack sleeepyjack added type: feature request New feature request In Progress Currently a work in progress labels Jan 24, 2024
@sleeepyjack sleeepyjack self-assigned this Jan 24, 2024
@jrhemstad
Copy link
Collaborator

Could you provide a high-level description of the implementation? I'd discussed some considerations about the implementation in rapidsai/cudf#10652 (comment) and I'm curious what you ended up doing.

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Jan 24, 2024

Could you provide a high-level description of the implementation?

Sure! I'll also add the inline docs ASAP to make the PR better reviewable and also get CI unblocked.

The add step is indeed straightforward: Each thread takes an input item, hashes it and then splits the resulting hash value into p lower bits that represent the register/bucket index. The rest of the hash value goes into __clz to count the number of leading 0 bits. The original implementation puts the p index bits at MSB and then counts the remaining zeros starting from MSB-p. We can't use __clz directly in this case but have to issue some additional instructions. So I decided to just flip it around. This of course assumes that the hash function hits high bits and low bits equally random.

The second merge step combines two local histograms using point-wise max. For the current implementation, each block fills a shared memory histogram and then merges it into a single global memory histogram using atomicMax. This of course can lead to some contention on the global array, so we want as few of these merge operations as possible by assigning more work per block, i.e., launch as few blocks as possible but still enough to saturate the GPU.
Instead of this atomic reduction method, we could employ a proper tree reduction scheme. For this, we would need some auxiliary memory to store the intermediate histograms. This is on my performance-tuning TODO list.

The last step called estimate takes the final histogram, computes the geometric mean (simple + reduction) and then runs some single-threaded computations for HLL++'s bias correction.
Here we have several options to split the work between host and device. On the device it makes sense to assign this step to a single thread block so we can perform the reduction in shared memory. We could also use just a warp in case the histograms are small (by-default they have 2048 int registers). Using more threads doesn't really make sense even for larger histograms, since the additional global memory communication will kill performance.
I did compare host vs. device for this step, and surprisingly copying the final histogram to the host and then running the step in a single host thread is ~20% faster than launching a kernel with one block computing the estimate and then copying the final result back to the host. I have to run some profiles, but I think kernel launch overhead might be the culprit here. Also, the CPU is better at doing the single-threaded bias correction work compared to a single CUDA thread.

@NVIDIA NVIDIA deleted a comment from copy-pr-bot bot Jan 26, 2024
@PointKernel
Copy link
Member

To unblock CI, could create a CUCO_HAS_CG_REDUCE_UPDATE_ASYNC macro similar to

#if defined(CUDART_VERSION) && (CUDART_VERSION >= 11010)
#define CUCO_HAS_CG_MEMCPY_ASYNC
#endif

@sleeepyjack
Copy link
Collaborator Author

The finish line is in sight. Last thing on my TODO list (apart from addressing review comments) is to match Spark's unit test to make sure we can use our implementation as a drop-in replacement for Spark's CPU implementation.

@sleeepyjack
Copy link
Collaborator Author

It occurred to me that we might need a add_if function to deal with null/NaN values in Spark/cudf. Should we move this to a separate PR?

@revans2
Copy link

revans2 commented Mar 26, 2024

It occurred to me that we might need a add_if function to deal with null/NaN values in Spark/cudf. Should we move this to a separate PR?

For Spark we definitely need null handling, but I am not hung up on how/when it gets in, so long as we can support it. Nulls are essentially ignored, but NaN values are not, and are treated as a part of the unique count.

Be aware that we might then have a HyperLogLog++ with no entries in it, and we would need to make sure that we will output 0 for that.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I went through public API docs and they are accurate and well written.

add_if is a nice add-on but not critical thus probably a separate PR to target it.

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Mar 27, 2024

@revans2 @PointKernel I added Spark parity tests in 3b0da20.

There are two tests that are currently failing where Spark applies some special treatment for different values for NaN and +-0.0 so they are counted as a single item. Those values do have distinct bit representation though, thus our XXHash64 treats them as distinct items. I'm not sure if we should match Sparks behavior here or if this should be addressed upstream by, e.g., providing a wrapper for the hasher that makes sure these values are mapped to the same bit representation before passing them to the XXHash64 function (see godbolt).
I'm lowkey asking myself if it would be the right thing to bake this behavior into all of our hashers: Any zero or NaN representation should hash to the same value.

@PointKernel
Copy link
Member

PointKernel commented Apr 2, 2024

Those values do have distinct bit representation though, thus our XXHash64 treats them as distinct items.

A custom hasher provided by the user is the right way to go. I think it's fine to fail in this case.

@sleeepyjack sleeepyjack requested a review from PointKernel April 3, 2024 01:00
sleeepyjack added a commit that referenced this pull request Apr 3, 2024
This PR renames `include/cuco/sentinel.cuh` to `include/cuco/types.cuh`
as preparation work for #429. The goal is to have a single header
containing all strong types used across cuco.

Note: Apparently Doxygen has become even pickier. I had to add some more
inline docs to headers which we haven't touched in a while to get the
pre-commit test to pass.
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! ship it.

README.md Outdated
Comment on lines 242 to 243
- [Host-bulk APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/host_bulk_example.cu) (see [live example in godbolt](https://godbolt.org/z/EG7cMssxo))
- [Device-ref APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/device_ref_example.cu) (see [live example in godbolt](https://godbolt.org/z/va8eE9dqb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the note still valid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review Awaiting reviews before merging type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants