Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing hive style partitioned files in DataFrame::write command #9237

Closed
alamb opened this issue Feb 15, 2024 · 3 comments · Fixed by #9316
Closed

Support writing hive style partitioned files in DataFrame::write command #9237

alamb opened this issue Feb 15, 2024 · 3 comments · Fixed by #9316
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 15, 2024

Is your feature request related to a problem or challenge?

@Omega359 asked on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1207458257874984970

Q: Is there a way to write out a dataframe to parquet with hive-style partitioning without having to create a table provider? I am pretty sure that a ListingTableProvider or a custom table provider will work but that seems like a ton of config for this

Describe the solution you'd like

I would like to be able to use DataFrame::write_parquet and the other APIs to write partitioned files

I suggest adding the table_partition_cols from ListingOptions as one of the options on https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrameWriteOptions.html

So way to specify partition information would be as described on ListingOptions::with_table_partition_cols

So that would look something like

let options = DataFrameWriteOptions::new()
  .with_table_partition_cols(vec![
      ("col_a".to_string(), DataType::Utf8),
  ]);

// write the data frame to parquet
// producing files like
// /tmp/my_table/col_a=foo/12345.parquet (data with 'foo' in column a)
// ..
// /tmp/my_table/col_a=zoo/12345.parquet (data with 'zoo' in column a)
df.write_parquet("/tmp/my_table", &options, None).await?

Describe alternatives you've considered

No response

Additional context

Possibly related to #8493

@alamb alamb added the enhancement New feature or request label Feb 15, 2024
@devinjdangelo
Copy link
Contributor

devinjdangelo commented Feb 15, 2024

Dataframe::write_parquet and related methods use the COPY logical/ physical plans under the hood, so if we knock out #8493 this ticket should come almost for free.

@devinjdangelo
Copy link
Contributor

I went ahead and implemented this and #8493 in #9240. Let me know if it looks good to you @alamb .

@alamb
Copy link
Contributor Author

alamb commented Feb 19, 2024

@devinjdangelo implemented the code in #9240

In order to close this ticket we just need to add test coverage for writing partitioned parquet in DataFrame::write_parquet

My suggestion is:

  1. Move the existing tests https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070 tests into the dataframe tests https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs
  2. Add a new test in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs following the same model, to verify the parquet files were written:

The new test could basically do the same thing as the tests added in
https://github.com/apache/arrow-datafusion/pull/9240/files#diff-b7d6c89870d082cac4ecc6de05f2ec393559327472fc4a846986f02c812f661fR34

  1. write to a partitioned table
  2. read back from the table to ensure all data went there
  3. Read back from one of the partitions to ensure the data was actually partitioned

tshauck added a commit to tshauck/arrow-datafusion that referenced this issue Feb 22, 2024
tshauck added a commit to tshauck/arrow-datafusion that referenced this issue Feb 22, 2024
alamb pushed a commit that referenced this issue Feb 26, 2024
* tests: adds tests associated with #9237

* style: clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants