-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distinct_count_estimator aka HyperLogLog++ #429
Conversation
Could you provide a high-level description of the implementation? I'd discussed some considerations about the implementation in rapidsai/cudf#10652 (comment) and I'm curious what you ended up doing. |
Sure! I'll also add the inline docs ASAP to make the PR better reviewable and also get CI unblocked. The The second The last step called |
To unblock CI, could create a cuCollections/include/cuco/detail/__config Lines 42 to 44 in 7404bd2
|
The finish line is in sight. Last thing on my TODO list (apart from addressing review comments) is to match Spark's unit test to make sure we can use our implementation as a drop-in replacement for Spark's CPU implementation. |
It occurred to me that we might need a |
For Spark we definitely need null handling, but I am not hung up on how/when it gets in, so long as we can support it. Nulls are essentially ignored, but NaN values are not, and are treated as a part of the unique count. Be aware that we might then have a HyperLogLog++ with no entries in it, and we would need to make sure that we will output 0 for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I went through public API docs and they are accurate and well written.
add_if
is a nice add-on but not critical thus probably a separate PR to target it.
@revans2 @PointKernel I added Spark parity tests in 3b0da20. There are two tests that are currently failing where Spark applies some special treatment for different values for NaN and +-0.0 so they are counted as a single item. Those values do have distinct bit representation though, thus our XXHash64 treats them as distinct items. I'm not sure if we should match Sparks behavior here or if this should be addressed upstream by, e.g., providing a wrapper for the hasher that makes sure these values are mapped to the same bit representation before passing them to the XXHash64 function (see godbolt). |
A custom hasher provided by the user is the right way to go. I think it's fine to fail in this case. |
This PR renames `include/cuco/sentinel.cuh` to `include/cuco/types.cuh` as preparation work for #429. The goal is to have a single header containing all strong types used across cuco. Note: Apparently Doxygen has become even pickier. I had to add some more inline docs to headers which we haven't touched in a while to get the pre-commit test to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! ship it.
README.md
Outdated
- [Host-bulk APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/host_bulk_example.cu) (see [live example in godbolt](https://godbolt.org/z/EG7cMssxo)) | ||
- [Device-ref APIs](https://github.com/NVIDIA/cuCollections/blob/dev/examples/distinct_count_estimator/device_ref_example.cu) (see [live example in godbolt](https://godbolt.org/z/va8eE9dqb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the note still valid?
This PR introduces a new data structure
cuco::distinct_count_estimator
which implements the famous HyperLogLog++ algorithm for accurately approximating the number of distinct items in a multiset (see paper). This PR will ultimately solve rapidsai/cudf#10652.