-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Implement Dataset.distinct #36655
Conversation
Signed-off-by: Hao Chen <chenh1024@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for implementing this! :)
Signed-off-by: Hao Chen <chenh1024@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raulchen! LGTM w/ minor comments.
python/ray/data/dataset.py
Outdated
"`distinct` currently only suports Datasets with one single column, " | ||
"please apply `select_columns` before `distinct`." | ||
) | ||
return self.groupby(columns[0]).count().drop_columns(["count()"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of drop_columns(["count()"])
, can we call select_columns(columns[0])
, so we don't rely on the implicit naming of count()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?
I considered this initially. but considering count is already very cheap, this is probably no big benefit to implement a standalone distinct function.
Oh one more thing - please update the API reference doc to include this new API, thanks. |
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@c21 I added a new item in |
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Implement `Dataset.distinct`. Currently this API only supports Datasets with one single column. This is because `groupby` doesn't support multiple columns yet. Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Implement
Dataset.distinct
. Currently this API only supports Datasets with one single column. This is becausegroupby
doesn't support multiple columns yet.Related issue number
Closes #32984
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.