-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce unified DataSourceExec
for provided datasources, remove ParquetExec
, CsvExec
, etc
#14224
Conversation
…ne DataSourceExec plan
fix csv_json example
# Conflicts: # datafusion/physical-plan/src/aggregates/mod.rs
# Conflicts: # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/physical-plan/src/memory.rs
# Conflicts: # datafusion/core/src/datasource/file_format/arrow.rs # datafusion/core/src/datasource/file_format/csv.rs # datafusion/core/src/datasource/file_format/json.rs # datafusion/core/src/datasource/file_format/parquet.rs # datafusion/core/src/datasource/memory.rs # datafusion/core/src/test/mod.rs # datafusion/physical-plan/src/memory.rs
Something else I noticed is the import paths for ParquetExec are gone (so I got a lot of compile errors about missing ParquetExec):
Suggested action is reexport the file here --- a/datafusion/core/src/datasource/physical_plan/mod.rs
+++ b/datafusion/core/src/datasource/physical_plan/mod.rs
@@ -33,7 +33,7 @@ pub(crate) use self::json::plan_to_json;
#[cfg(feature = "parquet")]
pub use self::parquet::source::ParquetSource;
#[cfg(feature = "parquet")]
-pub use self::parquet::{ParquetFileMetrics, ParquetFileReaderFactory};
+pub use self::parquet::{ParquetFileMetrics, ParquetFileReaderFactory, ParquetExec}; |
To be clear, I think we can do these kinds of changes in a follow on PR too -- I haven't found any blockers to this PR being merged yet but I don't have our tests running yet to be sure. But all in all things are going well from my perspective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got enough of our code to compile and tests running that i think this PR is ok to merge. Thank you @mertak-synnada @ozankabak and @berkaysynnada -- this is pretty epic
I believe this will be a non trivial change for any system that directly creates and manipulates ParquetExec
but I think the new structures seem to support everything we need.
It would be really great to write up a document (maybe a blog post) that explains this change and gives help for people upgrading. Specifically some examples of creating ParquetExec before and after (and something similar to CsvExec
and AvroExec
.
As I work through the rest of our upgrade, I may have additional ideas to make it easier.
I can't realistically review this PR in detail, bug I trust that other committers have done so 👍
Great, let's go ahead 🚀
This is a terrific idea -- we will be happy to write up a blog post that will serve a dual purpose: (1) help people with upgrading, and (2) brag about this epic PR 🙂
Great, we will be happy to collaborate on any follow-on PRs. |
I just merged the PR -- I will check back in half an hour to see if there are any problems |
Yes 100% that would be perfect. I think such a post would be find to put on the datafusion blog but content marketing wise if you write it putting it on a synnada blog would make total sense too |
Here is one small follow up from this PR: |
As I have been working through actually upgrading delta.rs with these changes it turns out it was more effort than expected (like everything in software) I have created a few PRs that I think would be nice to get in before we release this change downstream |
Also I started writing down an upgrade guide (not yet ready for review) |
Which issue does this PR close?
ExecutionPlan
Across AllTableProviders
#13838.MemoryExec
#14502fetch
limit for MemoryExec #14337Rationale for this change
This PR merges all Data sources into one Execution Plan, named DataSourceExec, and a single trait named DataSource which is inspired by DataSink. File-related shared behaviors are merged into FileSourceConfig, and a MemorySourceConfig is created. MemoryExec, ArrowExec, AvroExec, CsvExec, NdJsonExec, and ParquetExec are deprecated and merged into DataSourceExec.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?