Support writing hive style partitioned files in `DataFrame::write` command #9237

alamb · 2024-02-15T11:27:19Z

Is your feature request related to a problem or challenge?

@Omega359 asked on discord: https://discord.com/channels/885562378132000778/1166447479609376850/1207458257874984970

Q: Is there a way to write out a dataframe to parquet with hive-style partitioning without having to create a table provider? I am pretty sure that a ListingTableProvider or a custom table provider will work but that seems like a ton of config for this

Describe the solution you'd like

I would like to be able to use DataFrame::write_parquet and the other APIs to write partitioned files

I suggest adding the table_partition_cols from ListingOptions as one of the options on https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrameWriteOptions.html

So way to specify partition information would be as described on ListingOptions::with_table_partition_cols

So that would look something like

let options = DataFrameWriteOptions::new()
  .with_table_partition_cols(vec![
      ("col_a".to_string(), DataType::Utf8),
  ]);

// write the data frame to parquet
// producing files like
// /tmp/my_table/col_a=foo/12345.parquet (data with 'foo' in column a)
// ..
// /tmp/my_table/col_a=zoo/12345.parquet (data with 'zoo' in column a)
df.write_parquet("/tmp/my_table", &options, None).await?

Describe alternatives you've considered

No response

Additional context

Possibly related to #8493

The text was updated successfully, but these errors were encountered:

devinjdangelo · 2024-02-15T13:04:47Z

Dataframe::write_parquet and related methods use the COPY logical/ physical plans under the hood, so if we knock out #8493 this ticket should come almost for free.

devinjdangelo · 2024-02-15T15:09:06Z

I went ahead and implemented this and #8493 in #9240. Let me know if it looks good to you @alamb .

alamb · 2024-02-19T07:28:29Z

@devinjdangelo implemented the code in #9240

In order to close this ticket we just need to add test coverage for writing partitioned parquet in DataFrame::write_parquet

My suggestion is:

Move the existing tests https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070 tests into the dataframe tests https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs
Add a new test in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/mod.rs following the same model, to verify the parquet files were written:

The new test could basically do the same thing as the tests added in
https://github.com/apache/arrow-datafusion/pull/9240/files#diff-b7d6c89870d082cac4ecc6de05f2ec393559327472fc4a846986f02c812f661fR34

write to a partitioned table
read back from the table to ensure all data went there
Read back from one of the partitions to ensure the data was actually partitioned

* tests: adds tests associated with #9237 * style: clippy

alamb added the enhancement New feature or request label Feb 15, 2024

alamb mentioned this issue Feb 15, 2024

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

devinjdangelo mentioned this issue Feb 15, 2024

Support Copy To Partitioned Files #9240

Merged

alamb mentioned this issue Feb 19, 2024

I think we need a test for this new feature DataFrame::write_parquet #9267

Closed

tshauck added a commit to tshauck/arrow-datafusion that referenced this issue Feb 22, 2024

tests: adds tests associated with apache#9237

a0110ed

tshauck added a commit to tshauck/arrow-datafusion that referenced this issue Feb 22, 2024

tests: adds tests associated with apache#9237

728ef82

tshauck mentioned this issue Feb 22, 2024

tests: add tests for writing hive-partitioned parquet #9316

Merged

alamb closed this as completed in #9316 Feb 26, 2024

alamb pushed a commit that referenced this issue Feb 26, 2024

tests: add tests for writing hive-partitioned parquet (#9316)

a26f583

* tests: adds tests associated with #9237 * style: clippy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support writing hive style partitioned files in `DataFrame::write` command #9237

Support writing hive style partitioned files in `DataFrame::write` command #9237

alamb commented Feb 15, 2024

devinjdangelo commented Feb 15, 2024 •

edited

Loading

devinjdangelo commented Feb 15, 2024

alamb commented Feb 19, 2024

Support writing hive style partitioned files in DataFrame::write command #9237

Support writing hive style partitioned files in DataFrame::write command #9237

Comments

alamb commented Feb 15, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

devinjdangelo commented Feb 15, 2024 • edited Loading

devinjdangelo commented Feb 15, 2024

alamb commented Feb 19, 2024

Support writing hive style partitioned files in `DataFrame::write` command #9237

Support writing hive style partitioned files in `DataFrame::write` command #9237

devinjdangelo commented Feb 15, 2024 •

edited

Loading