Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Add unique_count algorithm #1612

Closed
upsj opened this issue Feb 4, 2022 · 3 comments · Fixed by #1619
Closed

Add unique_count algorithm #1612

upsj opened this issue Feb 4, 2022 · 3 comments · Fixed by #1619

Comments

@upsj
Copy link
Contributor

upsj commented Feb 4, 2022

For the common count -> allocate -> fill pattern (e.g. count_if/copy_if) there are some fill algorithms without a corresponding count algorithm, mainly unique_copy, unique_by_key_copy and reduce_by_key. So I want to suggest adding a count_unique algorithm with basically the same interface as unique save the return type:

difference_type thrust::count_unique(
        Policy exec,
        ForwardIterator first,
        ForwardIterator last,
        BinaryPredicate binary_pred)

The name probably needs a bit of discussion, since it doesn't actually count unique elements, but runs of unique elements, but unique has the same issue, so there is some prior art for it.

@allisonvacanti suggested using cub::DeviceRunLengthEncode::Encode to implement it, so I wanted to check whether there are any other approaches to consider before starting work on a PR. Another simple approach might be

auto zip_it = zip(it, it + 1);
if (size > 0) {
    return 1 + count_if(zip_it, zip_it + size - 1, [](auto a) { !binary_pred(get<0>(a), get<1>(a)); });
} else {
    return 0;
}
@jrhemstad
Copy link
Collaborator

CC @codereport on naming bikeshedding. I'd advocate for unique_count so it sorts next to the other unique_* algorithms.

I suppose cub::DeviceRunLengthEncode::Encode/NonTrivialRuns with discard iterators for the unique values/counts would work. I'd benchmark it against just doing count_if.

@codereport
Copy link
Collaborator

Long story short, I agree with @jrhemstad, unique_count is the correct name.


More details: we sort of had the opposite issue at one point in RAPIDS. We had an algorithm/API called unique_count but it was really not the right name in the C++ sense because it was counting the number of "unique" (not in the C++ sense) elements. Basically the equivalent in Python is:

def unique_count(list):
   return len(set(list))

We ended up changing the name to distinct_count.

@upsj
Copy link
Contributor Author

upsj commented Feb 7, 2022

I will go with unique_count then. Should this be part of a new header or in unique/unique_by_key? Also it looks like head_flags already provides the necessary zip functionality wrapped neatly.

@upsj upsj changed the title Add count_unique algorithm Add unique_count algorithm Feb 13, 2022
upsj added a commit to upsj/thrust that referenced this issue Feb 13, 2022
Add a counting equivalent to unique_* algorithms
that can be used to allocate the correct amount of data
before actually filling it.

Addresses issue NVIDIA#1612
upsj added a commit to upsj/thrust that referenced this issue Feb 13, 2022
Add a counting equivalent to unique_* algorithms
that can be used to allocate the correct amount of data
before actually filling it.

Addresses issue NVIDIA#1612
upsj added a commit to upsj/thrust that referenced this issue Mar 11, 2022
Add a counting equivalent to unique_* algorithms
that can be used to allocate the correct amount of data
before actually filling it.

Addresses issue NVIDIA#1612
upsj added a commit to upsj/thrust that referenced this issue Apr 17, 2022
Add a counting equivalent to unique_* algorithms
that can be used to allocate the correct amount of data
before actually filling it.

Addresses issue NVIDIA#1612
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants