[Datasets] Support `select_columns` to select subset of columns #27667

c21 · 2022-08-08T21:54:27Z

Description

Datasets have add_column() and drop_columns() to add or drop columns. But it's not flexible enough when user wants to select a subset of existing columns. We can provide a new select_columns API to do it, and also deprecate existing add/drop_column API.

Use case

Help better UX when user manipulates a subset of columns.

The text was updated successfully, but these errors were encountered:

c21 · 2022-08-08T21:56:37Z

In addition, as discussed, lazy-first execution + indexing on columns names would be a great UX boost too, but it's not urgent.

c21 · 2022-10-03T15:32:05Z

Just FYI one more user request in https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1664802732112309 . This should be prioritized.

chongxiaoc · 2022-10-03T20:03:39Z

anyone has bandwidth in near future can feel free to grab this task.

heyitsmui · 2022-10-04T16:39:27Z

took a quick look, @c21 for select_columns can we just do something like ds.map_batches(lambda batch: batch.filter(items=[...])) to select the columns if batch_format==pandas.DataFrame and ds.map_batches(lambda batch: batch.select(...)) if batch_format==pyarrow.Table? (and address lazy-first separately)

jianoaix · 2022-10-04T17:05:07Z

Now this should be implemented even easier with this Block API select(): https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/block.py?L279

heyitsmui · 2022-10-04T18:04:56Z

ah so something like map_batches(lambda batch: BlockAccessor.for_block(batch).select(...)) and block api should handle both batch_formats

heyitsmui · 2022-10-04T21:58:30Z

@jianoaix can you take a look at this draft PR: 11457e8? have some quick questions on there for you as well, thanks!

c21 added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Aug 8, 2022

jianoaix assigned chongxiaoc Sep 19, 2022

c21 added the good-first-issue Great starter issue for someone just starting to contribute to Ray label Oct 4, 2022

heyitsmui mentioned this issue Oct 5, 2022

[datasets] Add select_columns API to allow users to select a subset of columns #29081

Merged

7 tasks

jianoaix assigned heyitsmui and unassigned chongxiaoc Oct 14, 2022

clarkzinzow closed this as completed in #29081 Oct 26, 2022

jianoaix mentioned this issue Oct 26, 2022

Fix docstring bug in select_columns() #29728

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Support `select_columns` to select subset of columns #27667

[Datasets] Support `select_columns` to select subset of columns #27667

c21 commented Aug 8, 2022

c21 commented Aug 8, 2022

c21 commented Oct 3, 2022

chongxiaoc commented Oct 3, 2022

heyitsmui commented Oct 4, 2022 •

edited

Loading

jianoaix commented Oct 4, 2022

heyitsmui commented Oct 4, 2022

heyitsmui commented Oct 4, 2022 •

edited

Loading

[Datasets] Support select_columns to select subset of columns #27667

[Datasets] Support select_columns to select subset of columns #27667

Comments

c21 commented Aug 8, 2022

Description

Use case

c21 commented Aug 8, 2022

c21 commented Oct 3, 2022

chongxiaoc commented Oct 3, 2022

heyitsmui commented Oct 4, 2022 • edited Loading

jianoaix commented Oct 4, 2022

heyitsmui commented Oct 4, 2022

heyitsmui commented Oct 4, 2022 • edited Loading

[Datasets] Support `select_columns` to select subset of columns #27667

[Datasets] Support `select_columns` to select subset of columns #27667

heyitsmui commented Oct 4, 2022 •

edited

Loading

heyitsmui commented Oct 4, 2022 •

edited

Loading