You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
Currently FileSinks (e.g. ParquetSink), output 1 file for each input partition. E.g. the FileSink::write_all method accepts a Vec<RecordBatchStream> and independently serializes and writes each RecordBatchStream to an ObjectStore. This setup is easy to implement and efficient for parallelization (trivial to spawn a task to process each RecordBatchStream in parallel), but there are a few drawbacks:
We provide an option single_file_output that enables forcing only 1 file output, but there is no finer grained control than that unless you explicitly repartition the plan (e.g. add a RoundRobinPartition(4) to get 4 output files).
It is also unclear in the current setup how FileSink can or should handle writes to hive style partitioned tables Allow Inserts to Partitioned Listing Table #7744 since we cannot know the correct number of output files up front, and thus cannot construct a Vec<RecordBatchStream>
Describe the solution you'd like
I would like to provide users with options such as the following which will determine the number of output files:
Maximum rows per file
Maximum file size bytes
To respect these settings, the execution plan will need to dynamically create new file writers as execution proceeds, rather than all up front. Enabling this is a challenge similar to the one discussed in #7744. Ultimately, the input signature of FileSinks will need to change. Perhaps an upstream execution plan (FileSinkRepartionExec) could be responsible for dividing a single incoming RecordBatchStream into a dynamic number of output streams Stream<Item= RecordBatchStream>. FileSink then consume each stream as it arrives, spawning a new task to write each file.
FileSinkRepartitionExec could also have specialized logic for handling writes to hive style partitioned tables.
Describe alternatives you've considered
FileSink could also be reworked to accept a single RecordBatchStream and handle repartitioning logic within its own execution plan, rather than creating a new upstream plan.
Additional context
The proposed changes will likely reduce write performance somewhat. The efforts to parallelize individual file writing will help offset this performance impact, and ultimately, the improved UX are worth a slight performance regression in my opinion.
The text was updated successfully, but these errors were encountered:
I would like to provide users with options such as the following which will determine the number of output files:
Maximum rows per file
Maximum
I agree this makes a lot of sense
FileSinkRepartitionExec could also have specialized logic for handling writes to hive style partitioned tables.
I think this is what makes the most sense to me. Maybe we could combine some of the same logic to avoid writing files unless they actually have data.
FileSink could also be reworked to accept a single RecordBatchStream and handle repartitioning logic within its own execution plan, rather than creating a new upstream plan.
I remember @tustvold@metesynnada and @ozankabak and I discussed the various tradeoffss between where the write partitoning would be determine (plan or in the writer) and i believe the conclusion was "it depends"
Is your feature request related to a problem or challenge?
Currently FileSinks (e.g. ParquetSink), output 1 file for each input partition. E.g. the
FileSink::write_all
method accepts aVec<RecordBatchStream>
and independently serializes and writes eachRecordBatchStream
to anObjectStore
. This setup is easy to implement and efficient for parallelization (trivial to spawn a task to process each RecordBatchStream in parallel), but there are a few drawbacks:single_file_output
that enables forcing only 1 file output, but there is no finer grained control than that unless you explicitly repartition the plan (e.g. add a RoundRobinPartition(4) to get 4 output files).FileSink
can or should handle writes to hive style partitioned tables Allow Inserts to Partitioned Listing Table #7744 since we cannot know the correct number of output files up front, and thus cannot construct aVec<RecordBatchStream>
Describe the solution you'd like
I would like to provide users with options such as the following which will determine the number of output files:
To respect these settings, the execution plan will need to dynamically create new file writers as execution proceeds, rather than all up front. Enabling this is a challenge similar to the one discussed in #7744. Ultimately, the input signature of FileSinks will need to change. Perhaps an upstream execution plan (FileSinkRepartionExec) could be responsible for dividing a single incoming
RecordBatchStream
into a dynamic number of output streamsStream<Item= RecordBatchStream>
. FileSink then consume each stream as it arrives, spawning a new task to write each file.FileSinkRepartitionExec could also have specialized logic for handling writes to hive style partitioned tables.
Describe alternatives you've considered
FileSink could also be reworked to accept a single
RecordBatchStream
and handle repartitioning logic within its own execution plan, rather than creating a new upstream plan.Additional context
The proposed changes will likely reduce write performance somewhat. The efforts to parallelize individual file writing will help offset this performance impact, and ultimately, the improved UX are worth a slight performance regression in my opinion.
The text was updated successfully, but these errors were encountered: