-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Conditionally allow to keep partition_by columns when using PARTITIONED BY enhancement #11107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @hveiga -- I think this PR looks quite close.
I think we need should make sure this configuration setting matches the pattern of the other settings (it is somewhat special as it isn't a format specific option, and it doesn't have special DML syntax either...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hveiga, this looks great! I think we should clean up the fact that "format." is being added as a prefix to this option and create a different name space for it instead. This could help avoid the TableOptions
raising an error for an unexpected format option as well.
One last thing we could add in this PR (or a ticket for a follow on) would be to add a session level configuration setting for this, so users could control the default behavior. |
I think I have addressed all the comments in the PR, the only lingering one is about flowing the new option from SQL when using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @hveiga ! The new session parameter is a great addition. I do think we should avoid changing the CopyTo struct.
Updates to create external table can wait for a future PR in my view.
- separate options by prefix 'hive.' - add hive_options to CopyTo struct - add more documentation - add session execution flag to enable feature, false by default
Thanks for the review. I reverted the |
I took the liberty of pushing some commits to this PR to fix some CI errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hveiga and @devinjdangelo and @berkaysynnada for the reviews
It would be great if @devinjdangelo and @berkaysynnada could give this another review, but I think it is good enough to merge now.
I agree we should file a follow on ticket to figure out how to support this as part of the CREATE EXTERNAL TABLE
syntax
datafusion/sql/src/statement.rs
Outdated
@@ -888,7 +888,15 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
} | |||
Some(v) => v, | |||
}; | |||
if !(&key.contains('.')) { | |||
|
|||
if key.to_lowercase().contains("keep_partition_by_columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this special case is unfortunate, but I don't have a great idea of how to make it better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further consideration, IMO we should restrict this handling to format options. For other configurations, users must specify the prefix. Otherwise, this list will continue to grow longer, and using those prefixes would lose their meaning (which is why all these refactors were done to have a structured configuration).
Additionally, instead of using "hive," we need a more general term. Perhaps the "execution" prefix would be a better alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what you mean by "we should restrict this handling to format options"? I am unsure if you are suggesting using a different struct for these non-format.
options or continue using the HashMap with a mix of options and then extract out the logic to filter out the non-format.
options and do not pass those down through table_options.alter_with_string_hash_map
.
I agree on changing the prefix to execution.
which also aligns with the ExecutionOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave a try to this in e91397a . Let me know if that's what you had in mind. Thanks for feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I had in my mind. Thank you for the collaboration. I just would like to mention two more points:
- I guess you missed to add
execution
prefix to the example indatafusion/sql/src/parser.rs
.
let keep_partition_by_columns = source_option_tuples
.get("execution.keep_partition_by_columns")
.map(|v| v.trim() == "true")
.unwrap_or(...
If the user provides anything other than "true," it is interpreted as "false." It might be wiser to give an error if the value is neither "true" nor "false."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I just fixed that test case and also slightly modified the handling of the value to account for what we discussed:
- Give preference to what is explicitly provided.
- If what is provided is invalid, through a config error. Added a test in copy.slt for this.
- If not provided, fallback to
ExecutionOptions
value, false by default.
Hopefully e352203 is finally ready. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a good first step towards adding settings other than format. Thanks for your effort, @hveiga. I think the PR is ready to be merged once conflicts are resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one minor issue left. Once that is resolved, we can merge the PR. Thanks, @hveiga.
I merged up from main to resolve some conflicts (and also now so that CI will run automatically). Once CI passes I plan to merge this PR |
Thanks again @hveiga and @berkaysynnada |
Thanks everyone for the help to get this over the finish line! Excited to see this in the next release of Datafusion 🎉 |
🚀 |
…TITIONED BY enhancement (apache#11107) * feat: conditionally allow to keep partition_by columns * feat: add flag to file sink config, add tests * this commit contains: - separate options by prefix 'hive.' - add hive_options to CopyTo struct - add more documentation - add session execution flag to enable feature, false by default * do not add hive_options to CopyTo * npx prettier * fmt * change prefix to execution. , update override order for condition. * improve handling of flag, added test for config error * trying to make CI happier * prettier * Update test * update doc --------- Co-authored-by: Héctor Veiga Ortiz <hveigaortiz@tesla.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…TITIONED BY enhancement (apache#11107) * feat: conditionally allow to keep partition_by columns * feat: add flag to file sink config, add tests * this commit contains: - separate options by prefix 'hive.' - add hive_options to CopyTo struct - add more documentation - add session execution flag to enable feature, false by default * do not add hive_options to CopyTo * npx prettier * fmt * change prefix to execution. , update override order for condition. * improve handling of flag, added test for config error * trying to make CI happier * prettier * Update test * update doc --------- Co-authored-by: Héctor Veiga Ortiz <hveigaortiz@tesla.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #10971
What changes are included in this PR?
FileSinkConfig
to conditionally enable this feature. Disabled by default.Are these changes tested?
Test is included as part of
copy.slt
.Are there any user-facing changes?
No breaking changes.
Added some documentation for this new option.