Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add ds.distinct() API to get unique values in a column. #32984

Closed
woshiyyya opened this issue Mar 2, 2023 · 0 comments · Fixed by #36655
Closed

[Data] Add ds.distinct() API to get unique values in a column. #32984

woshiyyya opened this issue Mar 2, 2023 · 0 comments · Fixed by #36655
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks Ray-2.6

Comments

@woshiyyya
Copy link
Member

Description

Currently if we want to get distinct values in a Ray dataset column, we have to write the following code.

ds.groupby(column).count().drop_columns(["count()"])

It's kind of complicated and messy for users. I propose to add a distinct() api just like what pyspark does:

df.select_columns(['col_name']).distinct()

Use case

For example, we have a image classification dataset and we want to collect all the unique labels in it. We can call this distinct() method.

@woshiyyya woshiyyya added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Mar 2, 2023
@woshiyyya woshiyyya changed the title [Data] Add ds.distinct() API to get unique values for a column. [Data] Add ds.distinct() API to get unique values in a column. Mar 2, 2023
@woshiyyya woshiyyya added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Mar 2, 2023
@c21 c21 removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Mar 2, 2023
@pcmoritz pcmoritz added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Jun 16, 2023
@raulchen raulchen self-assigned this Jun 16, 2023
pcmoritz added a commit that referenced this issue Jun 26, 2023
It turns out for the use cases in #32984, the previous `.distinct()` API was not quite the right API -- the function is actually being used to get distinct labels or values in a dataset and therefore returning a list is the most convenient. This is very in line with Ray Data being a last mile data processing framework.

Also HuggingFace datasets already has a good API for this, so we are implementing a similar API here, see also https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.unique

Co-authored-by: Eric Liang <ekhliang@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
It turns out for the use cases in ray-project#32984, the previous `.distinct()` API was not quite the right API -- the function is actually being used to get distinct labels or values in a dataset and therefore returning a list is the most convenient. This is very in line with Ray Data being a last mile data processing framework.

Also HuggingFace datasets already has a good API for this, so we are implementing a similar API here, see also https://huggingface.co/docs/datasets/v2.13.1/en/package_reference/main_classes#datasets.Dataset.unique

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks Ray-2.6
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants