Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore: refactor DataSink traits to avoid duplication #14121

Merged
merged 13 commits into from
Jan 15, 2025

Conversation

mertak-synnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

This PR creates an abstraction for FileSinkers' generic behaviors and centralizes some of the functionalities. The main intention is to separate only file type-specific implementations and follow the same routes on FileSink operators.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate proto Related to proto crate labels Jan 14, 2025
@@ -90,13 +109,23 @@ pub struct FileSinkConfig {
pub insert_op: InsertOp,
/// Controls whether partition columns are kept for the file
pub keep_partition_by_columns: bool,
/// File extension without a dot(.)
pub file_extension: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSinkConfig did not carry any file-type related information before but giving file_extension explicitly to the helper functions also bothered us. That's why we've added this to the FileSinkConfig. Could not get a consensus on keeping FileSinkConfig file-type agnostic or adding this please let us know what do you think :) @alamb

CC: @berkaysynnada @ozankabak

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked around with this PR and since you need to pass the extension into the demuxer tasks I didn't see any better way to handle this

So TLDR is I think this is fine and I don't have any better suggestion

@ozankabak
Copy link
Contributor

Due to Rust's incomplete specialization implementation, we have some trivial duplication regarding write_all, but they will go away when Rust catches up

@alamb alamb changed the title Chore/single sink exec Chore: refactor DataSink traits to avoid duplication Jan 14, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice improvement to the API to me. I am sure we can improve it more but this seems like a stepforward to me

/// Retrieves the file sink configuration.
fn config(&self) -> &FileSinkConfig;

/// Spawns writer tasks and uses `tokio::join` to collect results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to document what a Demux task is here on this trait

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's do this @mertak-synnada

@@ -90,13 +109,23 @@ pub struct FileSinkConfig {
pub insert_op: InsertOp,
/// Controls whether partition columns are kept for the file
pub keep_partition_by_columns: bool,
/// File extension without a dot(.)
pub file_extension: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked around with this PR and since you need to pass the extension into the demuxer tasks I didn't see any better way to handle this

So TLDR is I think this is fine and I don't have any better suggestion

@alamb alamb added the api change Changes the API exposed to users of the crate label Jan 14, 2025
@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

Thank you @mertak-synnada

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR twice and sent some commits for further DRY. Let's enrich the demuxer comments per @alamb's suggestion and then merge this.

@berkaysynnada berkaysynnada merged commit c7c200a into apache:main Jan 15, 2025
25 checks passed
logan-keede added a commit to logan-keede/datafusion that referenced this pull request Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate physical-expr Changes to the physical-expr crates proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants