Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Support select_columns to select subset of columns #27667

Closed
c21 opened this issue Aug 8, 2022 · 7 comments · Fixed by #29081
Closed

[Datasets] Support select_columns to select subset of columns #27667

c21 opened this issue Aug 8, 2022 · 7 comments · Fixed by #29081
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks

Comments

@c21
Copy link
Contributor

c21 commented Aug 8, 2022

Description

Datasets have add_column() and drop_columns() to add or drop columns. But it's not flexible enough when user wants to select a subset of existing columns. We can provide a new select_columns API to do it, and also deprecate existing add/drop_column API.

Use case

Help better UX when user manipulates a subset of columns.

@c21 c21 added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Aug 8, 2022
@c21
Copy link
Contributor Author

c21 commented Aug 8, 2022

In addition, as discussed, lazy-first execution + indexing on columns names would be a great UX boost too, but it's not urgent.

@c21
Copy link
Contributor Author

c21 commented Oct 3, 2022

Just FYI one more user request in https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1664802732112309 . This should be prioritized.

@chongxiaoc
Copy link

anyone has bandwidth in near future can feel free to grab this task.

@c21 c21 added the good-first-issue Great starter issue for someone just starting to contribute to Ray label Oct 4, 2022
@heyitsmui
Copy link
Contributor

heyitsmui commented Oct 4, 2022

took a quick look, @c21 for select_columns can we just do something like ds.map_batches(lambda batch: batch.filter(items=[...])) to select the columns if batch_format==pandas.DataFrame and ds.map_batches(lambda batch: batch.select(...)) if batch_format==pyarrow.Table? (and address lazy-first separately)

@jianoaix
Copy link
Contributor

jianoaix commented Oct 4, 2022

Now this should be implemented even easier with this Block API select(): https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/block.py?L279

@heyitsmui
Copy link
Contributor

ah so something like map_batches(lambda batch: BlockAccessor.for_block(batch).select(...)) and block api should handle both batch_formats

@heyitsmui
Copy link
Contributor

heyitsmui commented Oct 4, 2022

@jianoaix can you take a look at this draft PR: 11457e8? have some quick questions on there for you as well, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants