[data] Implement Dataset.distinct #36655

raulchen · 2023-06-21T17:43:37Z

Why are these changes needed?

Implement Dataset.distinct. Currently this API only supports Datasets with one single column. This is because groupby doesn't support multiple columns yet.

Related issue number

Closes #32984

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

pcmoritz

Thanks a lot for implementing this! :)

Signed-off-by: Hao Chen <chenh1024@gmail.com>

c21

Thanks @raulchen! LGTM w/ minor comments.

c21 · 2023-06-21T22:27:29Z

python/ray/data/dataset.py

+                "`distinct` currently only suports Datasets with one single column, "
+                "please apply `select_columns` before `distinct`."
+            )
+        return self.groupby(columns[0]).count().drop_columns(["count()"])


shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?

instead of drop_columns(["count()"]), can we call select_columns(columns[0]), so we don't rely on the implicit naming of count()?

shall we add a TODO to implement an aggregate function for distinct, so we don't need to calculate count?

I considered this initially. but considering count is already very cheap, this is probably no big benefit to implement a standalone distinct function.

c21 · 2023-06-21T22:30:09Z

Oh one more thing - please update the API reference doc to include this new API, thanks.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-06-22T20:20:11Z

Oh one more thing - please update the API reference doc to include this new API, thanks.

@c21 I added a new item in doc/source/data/api/dataset.rst, please let me know if I missed something else.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

Implement `Dataset.distinct`. Currently this API only supports Datasets with one single column. This is because `groupby` doesn't support multiple columns yet. Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Implement Dataset.distinct

fb067e8

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee and bveeramani as code owners June 21, 2023 17:43

raulchen assigned ericl, pcmoritz and c21 Jun 21, 2023

pcmoritz approved these changes Jun 21, 2023

View reviewed changes

raulchen added 3 commits June 21, 2023 14:12

fix

2569947

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Merge branch 'master' into distinct

75fadd8

refine doc

776d512

Signed-off-by: Hao Chen <chenh1024@gmail.com>

c21 approved these changes Jun 21, 2023

View reviewed changes

ericl approved these changes Jun 22, 2023

View reviewed changes

raulchen added 2 commits June 22, 2023 13:15

select_columns

3d59fbd

Signed-off-by: Hao Chen <chenh1024@gmail.com>

api doc

65d88ea

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen and others added 4 commits June 22, 2023 13:21

lint

5a80393

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Merge branch 'master' into distinct

3d62654

fix small typo

1c382e0

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

fix lint and formatting

0b72fa9

pcmoritz merged commit 0f9e9f9 into ray-project:master Jun 23, 2023

raulchen deleted the distinct branch June 23, 2023 17:13

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Implement Dataset.distinct #36655

[data] Implement Dataset.distinct #36655

raulchen commented Jun 21, 2023

pcmoritz left a comment

c21 left a comment

c21 Jun 21, 2023

c21 Jun 21, 2023

raulchen Jun 22, 2023

c21 commented Jun 21, 2023

raulchen commented Jun 22, 2023

[data] Implement Dataset.distinct #36655

[data] Implement Dataset.distinct #36655

Conversation

raulchen commented Jun 21, 2023

Why are these changes needed?

Related issue number

Checks

pcmoritz left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

c21 Jun 21, 2023

Choose a reason for hiding this comment

c21 Jun 21, 2023

Choose a reason for hiding this comment

raulchen Jun 22, 2023

Choose a reason for hiding this comment

c21 commented Jun 21, 2023

raulchen commented Jun 22, 2023