-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Support partition_cols in write_parquet #49411
Conversation
make sure to get your dco setup properly and rebase or else we cant merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@gvspraveen have we tested the performance of this parameter at a medium or large scale?
output_schema = pa.schema( | ||
[field for field in output_schema if field.name not in self.partition_cols] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_schema = pa.schema( | |
[field for field in output_schema if field.name not in self.partition_cols] | |
) | |
output_schema = pa.schema(table_fields) |
values = [ | ||
groups.column(f"{col.name}_list")[i].values for col in table_fields | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was confused because "col" refers to the string column name in non_partition_cols
but refers to a field in this context
values = [ | |
groups.column(f"{col.name}_list")[i].values for col in table_fields | |
] | |
values = [ | |
groups.column(f"{field.name}_list")[i].values for field in table_fields | |
] |
partition_cols: Column names by which to partition the dataset. | ||
Files are writted in Hive partition style. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use active voice (from our style guide: https://developers.google.com/style/voice)
Also, typo with "writted"
partition_cols: Column names by which to partition the dataset. | |
Files are writted in Hive partition style. | |
partition_cols: Column names by which to partition the dataset. | |
This methods writes files in Hive partition style. |
@@ -79,6 +79,9 @@ def open_output_stream(self, path: str) -> "pyarrow.NativeFile": | |||
return self.filesystem.open_output_stream(path, **self.open_stream_args) | |||
|
|||
def on_write_start(self) -> None: | |||
self.has_created_dir = self._create_dir(self.path) | |||
|
|||
def _create_dir(self, dest) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use the bool
return value anywhere? If not, should this just be None
?
def _create_dir(self, dest) -> bool: | |
def _create_dir(self, dest) -> None: |
Just out of curiosity, why not use |
Hey @schmidt-ai -- reason was primarily because it didn't support some of the other functions we had exposed (some of the open_stream_args, for example). Was easier for us to support the partitioning manually -- if you run into issues, please help us open a issue! |
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
Why are these changes needed?
Supports hive styled partitioned data in
write_parquet
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.