[Data] Support partition_cols in write_parquet #49411

gvspraveen · 2024-12-23T18:34:13Z

Why are these changes needed?

Supports hive styled partitioned data in write_parquet

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

richardliaw · 2024-12-23T18:38:14Z

make sure to get your dco setup properly and rebase or else we cant merge

bveeramani

LGTM.

@gvspraveen have we tested the performance of this parameter at a medium or large scale?

bveeramani · 2024-12-23T22:08:34Z

python/ray/data/_internal/datasource/parquet_datasink.py

+        output_schema = pa.schema(
+            [field for field in output_schema if field.name not in self.partition_cols]
+        )


Suggested change

output_schema = pa.schema(

[field for field in output_schema if field.name not in self.partition_cols]

)

output_schema = pa.schema(table_fields)

bveeramani · 2024-12-23T22:10:11Z

python/ray/data/_internal/datasource/parquet_datasink.py

+            values = [
+                groups.column(f"{col.name}_list")[i].values for col in table_fields
+            ]


Was confused because "col" refers to the string column name in non_partition_cols but refers to a field in this context

Suggested change

values = [

groups.column(f"{col.name}_list")[i].values for col in table_fields

]

values = [

groups.column(f"{field.name}_list")[i].values for field in table_fields

]

bveeramani · 2024-12-23T22:11:36Z

python/ray/data/dataset.py

+            partition_cols: Column names by which to partition the dataset.
+                Files are writted in Hive partition style.


Nit: Use active voice (from our style guide: https://developers.google.com/style/voice)

Also, typo with "writted"

Suggested change

partition_cols: Column names by which to partition the dataset.

Files are writted in Hive partition style.

partition_cols: Column names by which to partition the dataset.

This methods writes files in Hive partition style.

bveeramani · 2024-12-23T22:12:35Z

python/ray/data/datasource/file_datasink.py

@@ -79,6 +79,9 @@ def open_output_stream(self, path: str) -> "pyarrow.NativeFile":
        return self.filesystem.open_output_stream(path, **self.open_stream_args)

    def on_write_start(self) -> None:
+        self.has_created_dir = self._create_dir(self.path)
+
+    def _create_dir(self, dest) -> bool:


Do we use the bool return value anywhere? If not, should this just be None?

Suggested change

def _create_dir(self, dest) -> bool:

def _create_dir(self, dest) -> None:

schmidt-ai · 2025-02-12T19:17:50Z

Just out of curiosity, why not use pyarrow.parquet.write_to_dataset?

richardliaw · 2025-02-12T19:33:20Z

Hey @schmidt-ai -- reason was primarily because it didn't support some of the other functions we had exposed (some of the open_stream_args, for example). Was easier for us to support the partitioning manually -- if you run into issues, please help us open a issue!

Signed-off-by: Puyuan Yao <williamyao034@gmail.com>

[Data] Support partition_cols in write_parquet

eade0f3

gvspraveen requested review from raulchen and richardliaw December 23, 2024 18:34

gvspraveen requested a review from a team as a code owner December 23, 2024 18:34

gvspraveen requested a review from bveeramani December 23, 2024 18:35

richardliaw approved these changes Dec 23, 2024

View reviewed changes

richardliaw added the go add ONLY when ready to merge, run all tests label Dec 23, 2024

gvspraveen added 2 commits December 23, 2024 11:08

Fix format

5e2ffa0

Merge branch 'master' into pg-partition-write

8851c27

bveeramani approved these changes Dec 23, 2024

View reviewed changes

Merge branch 'master' into pg-partition-write

3e35bd8

richardliaw merged commit 2f8d35c into master Dec 24, 2024
5 checks passed

richardliaw deleted the pg-partition-write branch December 24, 2024 04:50

richardliaw mentioned this pull request Dec 27, 2024

[Datasets] Write out directory-partitioned datasets #24879

Closed

srinathk10 pushed a commit that referenced this pull request Jan 3, 2025

[Data] Support partition_cols in write_parquet (#49411)

edbb1c2

anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025

[Data] Support partition_cols in write_parquet (ray-project#49411)

cd69dfd

Signed-off-by: Puyuan Yao <williamyao034@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Support partition_cols in write_parquet #49411

[Data] Support partition_cols in write_parquet #49411

gvspraveen commented Dec 23, 2024 •

edited

Loading

richardliaw commented Dec 23, 2024 •

edited

Loading

bveeramani left a comment

bveeramani Dec 23, 2024

bveeramani Dec 23, 2024

bveeramani Dec 23, 2024

bveeramani Dec 23, 2024

schmidt-ai commented Feb 12, 2025

richardliaw commented Feb 12, 2025

		partition_cols: Column names by which to partition the dataset.
		Files are writted in Hive partition style.

	def _create_dir(self, dest) -> bool:
	def _create_dir(self, dest) -> None:

[Data] Support partition_cols in write_parquet #49411

[Data] Support partition_cols in write_parquet #49411

Conversation

gvspraveen commented Dec 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

richardliaw commented Dec 23, 2024 • edited Loading

bveeramani left a comment

Choose a reason for hiding this comment

bveeramani Dec 23, 2024

Choose a reason for hiding this comment

bveeramani Dec 23, 2024

Choose a reason for hiding this comment

bveeramani Dec 23, 2024

Choose a reason for hiding this comment

bveeramani Dec 23, 2024

Choose a reason for hiding this comment

schmidt-ai commented Feb 12, 2025

richardliaw commented Feb 12, 2025

gvspraveen commented Dec 23, 2024 •

edited

Loading

richardliaw commented Dec 23, 2024 •

edited

Loading