-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requirements for Async Parquet API #1473
Comments
#1474 is likely also related as |
Thank you for writing this up. I agree with most of it. As a validation of any new API / modification to an existing API, might I suggest we prototype using it (using datafusion) to ensure we have at least one reference implementation? I am thinking especially that the interaction with the |
Ok so an update on this. I implemented this proposal in #1509, along with a number of modifications to help improve its performance - e.g. avoid copying buffers, etc... I then created an updated DataFusion branch to make use of this. Unfortunately the performance was still significantly worse than master 😢 Scheduler BenchThe DataFusion compile times were starting to grate, and there were far too many variables, so I created scheduler-bench which is a reduced version of one of the queries that was exhibiting problematic runtime characteristics in the DataFusion benchmark. Running this we get the following
Immediately there is an obvious disparity between I therefore added a load of instrumentation and noticed something strange, the performance variance in
Similarly restricting the tokio worker pool to contain a single worker thread (note spawn_blocking will still use a separate thread)
The performance across the board is actually better than when the worker pool had more threads, and the performance disparity between the approaches is largely eliminated. Whilst We can clearly see work ping-ponging between threads for Removing the second tokio task also results in the same improvements to workload variability and therefore average runtime performance.
Manual ThreadingSo it certainly looks like tokio is not scheduling our tasks well, but perhaps there is something else at play. I therefore experimented without using tokio and manually threading the workload Here we can see the performance of a single-threaded execution against a file, and against data already in memory
And for completeness the performance of a blocking implementation using two threads
ConclusionsI think this data is consistent with the following conclusions:
I will come back to this next week with fresh eyes, but if the above is correct it would have the following implications:
|
Thanks @tustvold for the great writeup! |
I know it's not ready yet, but does https://github.com/tokio-rs/tokio-uring behave differently? (Given it supposed to be a true async file reading API) |
I haven't tried tokio-uring, I suspect it might still suffer from poor thread-locality, just this time between the workers, but I haven't confirmed this. |
Unfortunately, I can not reproduce the test result in my local Mac.
|
Interesting, what hardware are you running on? I am very surprised to see For comparison some M1 benchmarks collected by @e-dard
And an Intel Mac from @alamb
Admittedly if memory serves these were collected before I made some further optimisations to the async reader, but I would be surprised. Perhaps one of them might be kind enough to re-run the benchmarks 😄 |
Latest results from @e-dard, have tokio_par_async_spawn_blocking_test performing the roughly the same as par_sync_file_test 🎉
So I guess I did a better job making the async version faster than I thought I did 😅. Now to work out which machine is the special snowflake 😆 |
Ok running on a c2-standard-16 on GCP I get
We can see that Just to check I haven't just been a muppet, I re-ran on my local machine and still get the same behaviour of
Unfortunately comparing the performance of the parquet SQL benchmarks of Datafusion master, against parquet-async-wip on the GCP instance, there is still a non-trivial performance hit, although it is less severe than on my local machine.
I need to think a bit further on this, being able to separate IO from decode is pretty compelling on various levels if we can do it, but it is unfortunate if it comes with a performance regression... |
Closing in favor of #1605 |
Background
In #1154 I added an
async
parquet API in the form ofParquetRecordBatchStream
. This was maximally async, that is it made use of tokio's async IO traits to be as generic as possible. However, having experimented with this I'm not sure that this design is quite right.In particular apache/datafusion#1617 showed non-trivial performance regressions operating on local files. This is caused by three major factors:
tokio::fs::File
callingspawn_blocking
for every IO operationThis last point is pretty important and touches on something I was not aware of, tokio does not use an IO reactor for file IO like say boost::asio, instead it just calls
tokio::task::spawn_blocking
for every IO call. This somewhat undermines the concept of async file IO, as all you're doing is moving where thetokio::task::spawn_blocking
is called, and in fact you're moving it lower in the call chain where its overheads are less amortized.As part of further exploring this design space I created #1472 which instead of using the tokio IO traits, uses the non-async
ChunkReader
trait andtokio::task::spawn_blocking
. Effectively this just upstreams logic from DataFusion's ParquetExec operator, and so perhaps unsurprisingly does not represent a performance regression.This is still technically an
async
API, however, I am aware that a number of people expressed interest in anasync
version ofChunkReader
which suggests they want lower-level async-ness. It is also unclear thatChunkReader
is quite right either - see #1163 and apache/datafusion#1905.To further complicate matters, differing storage media have different trade-offs, in particular when fetching from local disk or memory it may make sense to perform the most granular reads possible, potentially filtering out individual pages, columns, etc... However, when fetching data from object storage this is less clear cut. As each request costs and comes with non-trivial latency, there is likely a desire to coalesce proximate byte ranges into a single request, even if this results in reading more data then needed. As a result there is likely no general-purpose strategy for fetching data, and we therefore need the flexibility to allow this to be customized downstream.
Finally, there is ongoing effort to introduce more parallelism into the parquet scan - apache/datafusion#1990, and whilst async is a concurrency primitive and not a parallelism primitive, the two concepts are closely related in practice.
Requirements
I think the requirements are therefore as follows
async
API that yields a stream ofResult<RecordBatch>
async
trait, allowing downstream customisation of the fetch strategyProposal
An intentionally vague proposal would be to extend apache/datafusion#1617 replacing the use of
ChunkReader
with aStorage
trait that might look something likeParquetRecordBatchStreamBuilder
would use this trait to first read the footer, and then as part ofbuild()
invokeprefetch()
with the determined byte ranges to scan. FinallyParquetRecordBatchStream
would driveStorage::read
with the individual column chunk ranges as needed by the stream.This will likely require some minor alterations to
SerializedPageReader
in order to avoid copying the data returned fromStorage::read
but I think this is worthwhile and will also benefit reading data from in-memory.FYI @rdettai @yjshen @alamb @sunchao
The text was updated successfully, but these errors were encountered: