Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Implement Dataset.distinct #36655

Merged
merged 10 commits into from
Jun 23, 2023
Merged

Conversation

raulchen
Copy link
Contributor

Why are these changes needed?

Implement Dataset.distinct. Currently this API only supports Datasets with one single column. This is because groupby doesn't support multiple columns yet.

Related issue number

Closes #32984

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for implementing this! :)

raulchen added 3 commits June 21, 2023 14:12
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raulchen! LGTM w/ minor comments.

"`distinct` currently only suports Datasets with one single column, "
"please apply `select_columns` before `distinct`."
)
return self.groupby(columns[0]).count().drop_columns(["count()"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of drop_columns(["count()"]), can we call select_columns(columns[0]), so we don't rely on the implicit naming of count()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?

I considered this initially. but considering count is already very cheap, this is probably no big benefit to implement a standalone distinct function.

@c21
Copy link
Contributor

c21 commented Jun 21, 2023

Oh one more thing - please update the API reference doc to include this new API, thanks.

raulchen added 2 commits June 22, 2023 13:15
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen
Copy link
Contributor Author

Oh one more thing - please update the API reference doc to include this new API, thanks.

@c21 I added a new item in doc/source/data/api/dataset.rst, please let me know if I missed something else.

raulchen and others added 4 commits June 22, 2023 13:21
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
@pcmoritz pcmoritz merged commit 0f9e9f9 into ray-project:master Jun 23, 2023
@raulchen raulchen deleted the distinct branch June 23, 2023 17:13
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Implement `Dataset.distinct`. Currently this API only supports Datasets with one single column. This is because `groupby` doesn't support multiple columns yet.

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Add ds.distinct() API to get unique values in a column.
4 participants