Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Support partition_cols in write_parquet #49411

Merged
merged 4 commits into from
Dec 24, 2024
Merged

Conversation

gvspraveen
Copy link
Contributor

@gvspraveen gvspraveen commented Dec 23, 2024

Why are these changes needed?

Supports hive styled partitioned data in write_parquet

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@gvspraveen gvspraveen requested a review from a team as a code owner December 23, 2024 18:34
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Dec 23, 2024
@richardliaw
Copy link
Contributor

richardliaw commented Dec 23, 2024

make sure to get your dco setup properly and rebase or else we cant merge

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gvspraveen have we tested the performance of this parameter at a medium or large scale?

Comment on lines +135 to +137
output_schema = pa.schema(
[field for field in output_schema if field.name not in self.partition_cols]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_schema = pa.schema(
[field for field in output_schema if field.name not in self.partition_cols]
)
output_schema = pa.schema(table_fields)

Comment on lines +153 to +155
values = [
groups.column(f"{col.name}_list")[i].values for col in table_fields
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was confused because "col" refers to the string column name in non_partition_cols but refers to a field in this context

Suggested change
values = [
groups.column(f"{col.name}_list")[i].values for col in table_fields
]
values = [
groups.column(f"{field.name}_list")[i].values for field in table_fields
]

Comment on lines +2970 to +2971
partition_cols: Column names by which to partition the dataset.
Files are writted in Hive partition style.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use active voice (from our style guide: https://developers.google.com/style/voice)

Also, typo with "writted"

Suggested change
partition_cols: Column names by which to partition the dataset.
Files are writted in Hive partition style.
partition_cols: Column names by which to partition the dataset.
This methods writes files in Hive partition style.

@@ -79,6 +79,9 @@ def open_output_stream(self, path: str) -> "pyarrow.NativeFile":
return self.filesystem.open_output_stream(path, **self.open_stream_args)

def on_write_start(self) -> None:
self.has_created_dir = self._create_dir(self.path)

def _create_dir(self, dest) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the bool return value anywhere? If not, should this just be None?

Suggested change
def _create_dir(self, dest) -> bool:
def _create_dir(self, dest) -> None:

@richardliaw richardliaw merged commit 2f8d35c into master Dec 24, 2024
5 checks passed
@richardliaw richardliaw deleted the pg-partition-write branch December 24, 2024 04:50
@schmidt-ai
Copy link

Just out of curiosity, why not use pyarrow.parquet.write_to_dataset?

@richardliaw
Copy link
Contributor

Hey @schmidt-ai -- reason was primarily because it didn't support some of the other functions we had exposed (some of the open_stream_args, for example). Was easier for us to support the partitioning manually -- if you run into issues, please help us open a issue!

anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants