Requirements for Async Parquet API #1473

tustvold · 2022-03-22T19:07:01Z

Background

In #1154 I added an async parquet API in the form of ParquetRecordBatchStream. This was maximally async, that is it made use of tokio's async IO traits to be as generic as possible. However, having experimented with this I'm not sure that this design is quite right.

In particular apache/datafusion#1617 showed non-trivial performance regressions operating on local files. This is caused by three major factors:

Additional copying of buffers in tokio and the ParquetRecordBatchStream necessary to convert an async Read to a sync Read
Less parallelism due to parquet decode taking place in a separate blocking thread on master
Overheads due to tokio::fs::File calling spawn_blocking for every IO operation

This last point is pretty important and touches on something I was not aware of, tokio does not use an IO reactor for file IO like say boost::asio, instead it just calls tokio::task::spawn_blocking for every IO call. This somewhat undermines the concept of async file IO, as all you're doing is moving where the tokio::task::spawn_blocking is called, and in fact you're moving it lower in the call chain where its overheads are less amortized.

As part of further exploring this design space I created #1472 which instead of using the tokio IO traits, uses the non-async ChunkReader trait and tokio::task::spawn_blocking. Effectively this just upstreams logic from DataFusion's ParquetExec operator, and so perhaps unsurprisingly does not represent a performance regression.

This is still technically an async API, however, I am aware that a number of people expressed interest in an async version of ChunkReader which suggests they want lower-level async-ness. It is also unclear that ChunkReader is quite right either - see #1163 and apache/datafusion#1905.

To further complicate matters, differing storage media have different trade-offs, in particular when fetching from local disk or memory it may make sense to perform the most granular reads possible, potentially filtering out individual pages, columns, etc... However, when fetching data from object storage this is less clear cut. As each request costs and comes with non-trivial latency, there is likely a desire to coalesce proximate byte ranges into a single request, even if this results in reading more data then needed. As a result there is likely no general-purpose strategy for fetching data, and we therefore need the flexibility to allow this to be customized downstream.

Finally, there is ongoing effort to introduce more parallelism into the parquet scan - apache/datafusion#1990, and whilst async is a concurrency primitive and not a parallelism primitive, the two concepts are closely related in practice.

Requirements

I think the requirements are therefore as follows

Provide an async API that yields a stream of Result<RecordBatch>
Use predicate and projection pushdown to filter the data to scan
Separate identifying the byte ranges of column data to scan, from actually performing the scan
Delegate fetching the corresponding byte ranges to an async trait, allowing downstream customisation of the fetch strategy
Avoid copying the page data between buffers
Avoid calling spawn_blocking where the read implementation will not block (e.g. already in-memory)
Be extensible to support parallel column decoding (#TBD)
Be extensible to support more advanced predicate pushdown (Parquet Scan Filter #1191)

Proposal

An intentionally vague proposal would be to extend apache/datafusion#1617 replacing the use of ChunkReader with a Storage trait that might look something like

#[async_trait]
pub trait Storage {
  async fn prefetch(&mut self, ranges: Vec<std::ops::Range<usize>>) -> Result<()>,

  async fn read(&mut self, range: std::ops::Range<usize>) -> Result<ByteBufferPtr>
}

ParquetRecordBatchStreamBuilder would use this trait to first read the footer, and then as part of build() invoke prefetch() with the determined byte ranges to scan. Finally ParquetRecordBatchStream would drive Storage::read with the individual column chunk ranges as needed by the stream.

This will likely require some minor alterations to SerializedPageReader in order to avoid copying the data returned from Storage::read but I think this is worthwhile and will also benefit reading data from in-memory.

FYI @rdettai @yjshen @alamb @sunchao

The text was updated successfully, but these errors were encountered:

tustvold · 2022-03-22T19:19:38Z

#1474 is likely also related as ByteBufferPtr is currently an experimental API and it would be nice to not leak this into the public interface

alamb · 2022-03-22T22:21:33Z

Thank you for writing this up. I agree with most of it. As a validation of any new API / modification to an existing API, might I suggest we prototype using it (using datafusion) to ensure we have at least one reference implementation?

I am thinking especially that the interaction with the ObjectStore interface and Storage may be somewhat tricky / non obvious, especially with the async nature of it all

tustvold · 2022-04-01T13:38:47Z

Ok so an update on this. I implemented this proposal in #1509, along with a number of modifications to help improve its performance - e.g. avoid copying buffers, etc... I then created an updated DataFusion branch to make use of this. Unfortunately the performance was still significantly worse than master 😢

Scheduler Bench

The DataFusion compile times were starting to grate, and there were far too many variables, so I created scheduler-bench which is a reduced version of one of the queries that was exhibiting problematic runtime characteristics in the DataFusion benchmark.

Running this we get the following

tokio_sync_file_test (2048): min: 0.0534s, max: 0.0660s, avg: 0.0548s, p95: 0.0557s
tokio_par_async_spawn_blocking_test (2048): min: 0.0484s, max: 0.0813s, avg: 0.0585s, p95: 0.0754s
tokio_par_async_blocking_test (2048): min: 0.0563s, max: 0.0721s, avg: 0.0584s, p95: 0.0600s
tokio_par_sync_test (2048): min: 0.0536s, max: 0.0563s, avg: 0.0546s, p95: 0.0554s

tokio_sync_file_test runs a spawn_blocking task that decodes and sends RecordBatch over an mpsc channel, a separate tokio task then performs the rest of the query. tokio_par_sync_test is broadly similar, but uses a regular tokio task instead of spawn_blocking for the reader.

tokio_par_async_spawn_blocking_test and tokio_par_async_blocking_test use a tokio task to run the async ParquetRecordBatchStream, which communicates over an mpsc channel with a separate tokio task that performs the rest of the query. The difference is tokio_par_async_spawn_blocking uses a separate spawn_blocking to read the column chunk data into the in-memory buffers, whereas the other cheekily just uses blocking IO in a tokio worker thread.

Immediately there is an obvious disparity between tokio_par_async_spawn_blocking_test and the other methods. Initially I thought this was overheads of spawn_blocking, and in earlier incarnations it certainly was, but with the switch to the Storage interface there is now only a single spawn_blocking per row group (of which there are 2 in this file).

I therefore added a load of instrumentation and noticed something strange, the performance variance in tokio_par_async_spawn_blocking_test is entirely in the synchronous section that decodes RecordBatch from the already read byte arrays. This immediately suggested some sort of socket-locality funkiness so I ran the test restricted to a single CPU core.

$  taskset 0x1 cargo run --release
tokio_sync_file_test (2048): min: 0.0593s, max: 0.0609s, avg: 0.0600s, p95: 0.0605s
tokio_par_async_spawn_blocking_test (2048): min: 0.0619s, max: 0.0653s, avg: 0.0627s, p95: 0.0633s
tokio_par_async_blocking_test (2048): min: 0.0619s, max: 0.0635s, avg: 0.0627s, p95: 0.0632s
tokio_par_sync_test (2048): min: 0.0588s, max: 0.0609s, avg: 0.0595s, p95: 0.0600s

Similarly restricting the tokio worker pool to contain a single worker thread (note spawn_blocking will still use a separate thread)

tokio_sync_file_test (2048): min: 0.0495s, max: 0.0510s, avg: 0.0499s, p95: 0.0503s
tokio_par_async_spawn_blocking_test (2048): min: 0.0469s, max: 0.0618s, avg: 0.0491s, p95: 0.0495s
tokio_par_async_blocking_test (2048): min: 0.0522s, max: 0.0534s, avg: 0.0526s, p95: 0.0531s
tokio_par_sync_test (2048): min: 0.0497s, max: 0.0507s, avg: 0.0501s, p95: 0.0505s

The performance across the board is actually better than when the worker pool had more threads, and the performance disparity between the approaches is largely eliminated.

Whilst perf has to be taken with a grain of salt, as it doesn't properly understand tokio's userland threading, I captured a profile of a run that first performed tokio_par_async_spawn_blocking_test and then tokio_par_sync_test.

We can clearly see work ping-ponging between threads for tokio_par_async_spawn_blocking_test and transitioning to a significantly more stable pattern for tokio_par_sync_test, which perhaps unsurprisingly performs better.

Removing the second tokio task also results in the same improvements to workload variability and therefore average runtime performance.

tokio_async_spawn_blocking_test (2048): min: 0.0557s, max: 0.0587s, avg: 0.0565s, p95: 0.0582s
tokio_async_blocking_test (2048): min: 0.0576s, max: 0.0599s, avg: 0.0584s, p95: 0.0593s

Manual Threading

So it certainly looks like tokio is not scheduling our tasks well, but perhaps there is something else at play. I therefore experimented without using tokio and manually threading the workload

Here we can see the performance of a single-threaded execution against a file, and against data already in memory

sync_file_test (2048): min: 0.0541s, max: 0.0758s, avg: 0.0548s, p95: 0.0550s
sync_mem_test (2048): min: 0.0585s, max: 0.0616s, avg: 0.0598s, p95: 0.0609s

And for completeness the performance of a blocking implementation using two threads

par_sync_file_test (2048): min: 0.0542s, max: 0.0584s, avg: 0.0563s, p95: 0.0572s

Conclusions

I think this data is consistent with the following conclusions:

The performance of this query is largely dominated by the CPU-bound task of decoding the parquet bytes
There are overheads associated with buffering up parquet data ahead of time, on the order of ~5ms
There are CPU-locality effects on the order of ~30ms that impact the performance of decoding the parquet bytes

I will come back to this next week with fresh eyes, but if the above is correct it would have the following implications:

The advantages of a fully async parquet implementation are likely to be undermined by the performance cost associated with the loss of thread-locality
Using tokio to schedule CPU-bound work within DataFusion is likely significantly more sub-optimal than we anticipated, especially for stateful operators

yjshen · 2022-04-02T04:28:38Z

Thanks @tustvold for the great writeup!

alippai · 2022-04-12T04:50:16Z

I know it's not ready yet, but does https://github.com/tokio-rs/tokio-uring behave differently? (Given it supposed to be a true async file reading API)

tustvold · 2022-04-12T05:46:27Z

I haven't tried tokio-uring, I suspect it might still suffer from poor thread-locality, just this time between the workers, but I haven't confirmed this.

mingmwang · 2022-04-19T09:59:04Z

Unfortunately, I can not reproduce the test result in my local Mac.

tokio_sync_file_test (2048): min: 0.1052s, max: 0.1178s, avg: 0.1117s, p95: 0.1142s
tokio_par_async_spawn_blocking_test (2048): min: 0.0916s, max: 0.1064s, avg: 0.0963s, p95: 0.1006s
tokio_par_async_blocking_test (2048): min: 0.1070s, max: 0.1165s, avg: 0.1104s, p95: 0.1146s
tokio_par_sync_test (2048): min: 0.1080s, max: 0.1274s, avg: 0.1120s, p95: 0.1157s

tustvold · 2022-04-19T11:01:02Z

Interesting, what hardware are you running on? I am very surprised to see tokio_par_async_spawn_blocking_test performing better than tokio_sync_file_test.

For comparison some M1 benchmarks collected by @e-dard

sync_file_test (1024): min: 0.0746s, max: 0.0773s, avg: 0.0752s
sync_mem_test (1024): min: 0.0717s, max: 0.0731s, avg: 0.0727s
par_sync_file_test (1024): min: 0.0623s, max: 0.0898s, avg: 0.0634s
tokio_sync_file_test (1024): min: 0.0626s, max: 0.0667s, avg: 0.0644s
tokio_spawn_file_test (1024): min: 0.0899s, max: 0.1071s, avg: 0.0936s
tokio_spawn_file_buffer_test (1024): min: 0.1009s, max: 0.1082s, avg: 0.1018s
tokio_async_test (1024): min: 0.0915s, max: 0.0967s, avg: 0.0932s
tokio_par_async_test (1024): min: 0.0841s, max: 0.0974s, avg: 0.0874s
tokio_par_sync_test (1024): min: 0.0626s, max: 0.0835s, avg: 0.0668s
sync_file_test (2048): min: 0.0653s, max: 0.1107s, avg: 0.0694s
sync_mem_test (2048): min: 0.0622s, max: 0.0664s, avg: 0.0627s
par_sync_file_test (2048): min: 0.0542s, max: 0.0568s, avg: 0.0551s
tokio_sync_file_test (2048): min: 0.0547s, max: 0.0741s, avg: 0.0569s
tokio_spawn_file_test (2048): min: 0.0724s, max: 0.0852s, avg: 0.0759s
tokio_spawn_file_buffer_test (2048): min: 0.0907s, max: 0.1365s, avg: 0.0928s
tokio_async_test (2048): min: 0.0814s, max: 0.0879s, avg: 0.0835s
tokio_par_async_test (2048): min: 0.0761s, max: 0.0859s, avg: 0.0787s
tokio_par_sync_test (2048): min: 0.0547s, max: 0.0968s, avg: 0.0584s
sync_file_test (4096): min: 0.0605s, max: 0.0898s, avg: 0.0637s
sync_mem_test (4096): min: 0.0579s, max: 0.0753s, avg: 0.0595s
par_sync_file_test (4096): min: 0.0498s, max: 0.0596s, avg: 0.0543s
tokio_sync_file_test (4096): min: 0.0498s, max: 0.0623s, avg: 0.0537s
tokio_spawn_file_test (4096): min: 0.0639s, max: 0.0738s, avg: 0.0687s
tokio_spawn_file_buffer_test (4096): min: 0.0856s, max: 0.1042s, avg: 0.0869s
tokio_async_test (4096): min: 0.0759s, max: 0.0808s, avg: 0.0779s
tokio_par_async_test (4096): min: 0.0714s, max: 0.0793s, avg: 0.0743s
tokio_par_sync_test (4096): min: 0.0499s, max: 0.0610s, avg: 0.0552s
sync_file_test (8192): min: 0.0576s, max: 0.0650s, avg: 0.0604s
sync_mem_test (8192): min: 0.0544s, max: 0.0594s, avg: 0.0555s
par_sync_file_test (8192): min: 0.0473s, max: 0.0571s, avg: 0.0514s
tokio_sync_file_test (8192): min: 0.0475s, max: 0.0594s, avg: 0.0517s
tokio_spawn_file_test (8192): min: 0.0587s, max: 0.0701s, avg: 0.0600s
tokio_spawn_file_buffer_test (8192): min: 0.0821s, max: 0.1121s, avg: 0.0835s
tokio_async_test (8192): min: 0.0728s, max: 0.0781s, avg: 0.0746s
tokio_par_async_test (8192): min: 0.0686s, max: 0.0745s, avg: 0.0710s
tokio_par_sync_test (8192): min: 0.0476s, max: 0.0568s, avg: 0.0514s
sync_file_test (16384): min: 0.0566s, max: 0.0663s, avg: 0.0604s
sync_mem_test (16384): min: 0.0543s, max: 0.0889s, avg: 0.0555s
par_sync_file_test (16384): min: 0.0469s, max: 0.0603s, avg: 0.0509s
tokio_sync_file_test (16384): min: 0.0471s, max: 0.0566s, avg: 0.0513s
tokio_spawn_file_test (16384): min: 0.0624s, max: 0.0674s, avg: 0.0658s
tokio_spawn_file_buffer_test (16384): min: 0.0871s, max: 0.0916s, avg: 0.0887s
tokio_async_test (16384): min: 0.0726s, max: 0.0804s, avg: 0.0762s
tokio_par_async_test (16384): min: 0.0683s, max: 0.0826s, avg: 0.0714s
tokio_par_sync_test (16384): min: 0.0470s, max: 0.0567s, avg: 0.0510s

And an Intel Mac from @alamb

sync_file_test (1024): min: 0.0968s, max: 0.1207s, avg: 0.1027s
sync_mem_test (1024): min: 0.0941s, max: 0.1220s, avg: 0.0996s
par_sync_file_test (1024): min: 0.0830s, max: 0.1085s, avg: 0.0875s
tokio_sync_file_test (1024): min: 0.0835s, max: 0.1080s, avg: 0.0917s
tokio_spawn_file_test (1024): min: 0.1341s, max: 0.1747s, avg: 0.1460s
tokio_spawn_file_buffer_test (1024): min: 0.1478s, max: 0.2100s, avg: 0.1600s
tokio_async_test (1024): min: 0.1431s, max: 0.1804s, avg: 0.1526s
tokio_par_async_test (1024): min: 0.1352s, max: 0.1612s, avg: 0.1465s
tokio_par_sync_test (1024): min: 0.0880s, max: 0.1382s, avg: 0.1071s
sync_file_test (2048): min: 0.0948s, max: 0.1768s, avg: 0.1318s
sync_mem_test (2048): min: 0.0916s, max: 0.1666s, avg: 0.1206s
par_sync_file_test (2048): min: 0.0794s, max: 0.1630s, avg: 0.1145s
tokio_sync_file_test (2048): min: 0.1023s, max: 0.2510s, avg: 0.1902s
tokio_spawn_file_test (2048): min: 0.2702s, max: 0.4230s, avg: 0.3572s
tokio_spawn_file_buffer_test (2048): min: 0.2147s, max: 0.3401s, avg: 0.2680s
tokio_async_test (2048): min: 0.2072s, max: 0.3323s, avg: 0.2596s
tokio_par_async_test (2048): min: 0.1819s, max: 0.2990s, avg: 0.2395s
tokio_par_sync_test (2048): min: 0.1382s, max: 0.2506s, avg: 0.1960s
sync_file_test (4096): min: 0.1420s, max: 0.2399s, avg: 0.1813s
sync_mem_test (4096): min: 0.1311s, max: 0.2399s, avg: 0.1782s
par_sync_file_test (4096): min: 0.1079s, max: 0.1974s, avg: 0.1534s
tokio_sync_file_test (4096): min: 0.1147s, max: 0.2118s, avg: 0.1638s
tokio_spawn_file_test (4096): min: 0.2236s, max: 0.3454s, avg: 0.2782s
tokio_spawn_file_buffer_test (4096): min: 0.1838s, max: 0.2941s, avg: 0.2340s
tokio_async_test (4096): min: 0.1808s, max: 0.2900s, avg: 0.2255s
tokio_par_async_test (4096): min: 0.1526s, max: 0.2556s, avg: 0.2048s
tokio_par_sync_test (4096): min: 0.1141s, max: 0.2171s, avg: 0.1619s
sync_file_test (8192): min: 0.1505s, max: 0.3689s, avg: 0.2261s
sync_mem_test (8192): min: 0.1558s, max: 0.4034s, avg: 0.2375s
par_sync_file_test (8192): min: 0.1149s, max: 0.2036s, avg: 0.1571s
tokio_sync_file_test (8192): min: 0.1135s, max: 0.2022s, avg: 0.1614s
tokio_spawn_file_test (8192): min: 0.3430s, max: 0.5689s, avg: 0.4529s
tokio_spawn_file_buffer_test (8192): min: 0.1879s, max: 0.3810s, avg: 0.2620s
tokio_async_test (8192): min: 0.1600s, max: 2.4110s, avg: 0.3616s
tokio_par_async_test (8192): min: 0.1781s, max: 0.4306s, avg: 0.2548s
tokio_par_sync_test (8192): min: 0.1223s, max: 0.3303s, avg: 0.2035s
sync_file_test (16384): min: 0.1890s, max: 0.4439s, avg: 0.3216s
sync_mem_test (16384): min: 0.1603s, max: 0.4605s, avg: 0.2934s
par_sync_file_test (16384): min: 0.1184s, max: 0.3450s, avg: 0.2003s
tokio_sync_file_test (16384): min: 0.1197s, max: 0.3192s, avg: 0.1841s
tokio_spawn_file_test (16384): min: 0.2144s, max: 0.5118s, avg: 0.3662s
tokio_spawn_file_buffer_test (16384): min: 0.2080s, max: 0.5335s, avg: 0.3274s
tokio_async_test (16384): min: 0.1867s, max: 0.4569s, avg: 0.2526s
tokio_par_async_test (16384): min: 0.1711s, max: 0.2684s, avg: 0.2118s
tokio_par_sync_test (16384): min: 0.0868s, max: 0.2195s, avg: 0.1579s

Admittedly if memory serves these were collected before I made some further optimisations to the async reader, but I would be surprised. Perhaps one of them might be kind enough to re-run the benchmarks 😄

tustvold · 2022-04-19T11:40:08Z

Latest results from @e-dard, have tokio_par_async_spawn_blocking_test performing the roughly the same as par_sync_file_test 🎉

sync_file_test (2048): min: 0.0630s, max: 0.1333s, avg: 0.0660s, p95: 0.0702s
sync_mem_test (2048): min: 0.0599s, max: 0.1805s, avg: 0.0634s, p95: 0.0705s
par_sync_file_test (2048): min: 0.0544s, max: 0.0679s, avg: 0.0554s, p95: 0.0566s
tokio_sync_file_test (2048): min: 0.0545s, max: 0.0677s, avg: 0.0563s, p95: 0.0610s
tokio_spawn_file_test (2048): min: 0.0703s, max: 0.0988s, avg: 0.0727s, p95: 0.0815s
tokio_spawn_file_buffer_test (2048): min: 0.0892s, max: 0.2146s, avg: 0.0974s, p95: 0.1212s
tokio_async_spawn_blocking_test (2048): min: 0.0600s, max: 0.0786s, avg: 0.0640s, p95: 0.0715s
tokio_async_blocking_test (2048): min: 0.0633s, max: 0.0746s, avg: 0.0655s, p95: 0.0688s
tokio_par_async_spawn_blocking_test (2048): min: 0.0518s, max: 0.0552s, avg: 0.0528s, p95: 0.0540s
tokio_par_async_blocking_test (2048): min: 0.0552s, max: 0.0765s, avg: 0.0567s, p95: 0.0582s
tokio_par_sync_test (2048): min: 0.0546s, max: 0.0597s, avg: 0.0553s, p95: 0.0562s

So I guess I did a better job making the async version faster than I thought I did 😅.

Now to work out which machine is the special snowflake 😆

tustvold · 2022-04-19T14:43:26Z

Ok running on a c2-standard-16 on GCP I get

sync_file_test (2048): min: 0.0903s, max: 0.0995s, avg: 0.0912s, p95: 0.0916s
sync_mem_test (2048): min: 0.0922s, max: 0.0952s, avg: 0.0931s, p95: 0.0937s
par_sync_file_test (2048): min: 0.0852s, max: 0.0897s, avg: 0.0866s, p95: 0.0878s
tokio_sync_file_test (2048): min: 0.0858s, max: 0.0897s, avg: 0.0869s, p95: 0.0879s
tokio_spawn_file_test (2048): min: 0.1114s, max: 0.1144s, avg: 0.1125s, p95: 0.1136s
tokio_spawn_file_buffer_test (2048): min: 0.1641s, max: 0.1689s, avg: 0.1663s, p95: 0.1683s
tokio_async_spawn_blocking_test (2048): min: 0.0787s, max: 0.0810s, avg: 0.0794s, p95: 0.0798s
tokio_async_blocking_test (2048): min: 0.0887s, max: 0.0917s, avg: 0.0899s, p95: 0.0909s
tokio_par_async_spawn_blocking_test (2048): min: 0.0758s, max: 0.0792s, avg: 0.0769s, p95: 0.0778s
tokio_par_async_blocking_test (2048): min: 0.0863s, max: 0.0886s, avg: 0.0874s, p95: 0.0881s
tokio_par_sync_test (2048): min: 0.0859s, max: 0.0885s, avg: 0.0871s, p95: 0.0880s

We can see that tokio_async_spawn_blocking_test is performing slightly better than par_sync_file_test 🎉 This result was repeatable even if I moved the parquet file to a tmpfs backed filesystem.

Just to check I haven't just been a muppet, I re-ran on my local machine and still get the same behaviour of tokio_par_async_spawn_blocking_test being significantly slower. I'm not really sure what is going on here, perhaps some CPU turbo shenanigans or something...

sync_file_test (2048): min: 0.0568s, max: 0.1135s, avg: 0.0584s, p95: 0.0592s
sync_mem_test (2048): min: 0.0582s, max: 0.0653s, avg: 0.0606s, p95: 0.0616s
par_sync_file_test (2048): min: 0.0558s, max: 0.0603s, avg: 0.0575s, p95: 0.0584s
tokio_sync_file_test (2048): min: 0.0559s, max: 0.0605s, avg: 0.0576s, p95: 0.0581s
tokio_spawn_file_test (2048): min: 0.0854s, max: 0.0881s, avg: 0.0864s, p95: 0.0874s
tokio_spawn_file_buffer_test (2048): min: 0.1163s, max: 0.1460s, avg: 0.1189s, p95: 0.1201s
tokio_async_spawn_blocking_test (2048): min: 0.0566s, max: 0.0598s, avg: 0.0576s, p95: 0.0592s
tokio_async_blocking_test (2048): min: 0.0599s, max: 0.0649s, avg: 0.0606s, p95: 0.0615s
tokio_par_async_spawn_blocking_test (2048): min: 0.0520s, max: 0.0876s, avg: 0.0584s, p95: 0.0780s
tokio_par_async_blocking_test (2048): min: 0.0543s, max: 0.0651s, avg: 0.0568s, p95: 0.0626s
tokio_par_sync_test (2048): min: 0.0509s, max: 0.0598s, avg: 0.0526s, p95: 0.0596s

Unfortunately comparing the performance of the parquet SQL benchmarks of Datafusion master, against parquet-async-wip on the GCP instance, there is still a non-trivial performance hit, although it is less severe than on my local machine.

select dict_10_optional from t                                                                            
                        time:   [9.4814 ms 9.5137 ms 9.5400 ms]
                        change: [+10.989% +11.387% +11.754%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_100_optional from t                                                                            
                        time:   [9.5785 ms 9.5907 ms 9.6034 ms]
                        change: [+8.6348% +8.8496% +9.0629%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_1000_optional from t                                                                            
                        time:   [9.6751 ms 9.6910 ms 9.7066 ms]
                        change: [+8.3536% +8.6366% +8.9063%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_10_required = 'prefix#0'                                                                            
                        time:   [8.3035 ms 8.3429 ms 8.3836 ms]
                        change: [+12.266% +14.359% +16.585%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_100_required = 'prefix#0'                                                                            
                        time:   [8.4855 ms 8.5323 ms 8.5800 ms]
                        change: [+16.358% +17.383% +18.456%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_1000_required = 'prefix#0'                                                                            
                        time:   [8.9166 ms 8.9591 ms 9.0018 ms]
                        change: [+16.238% +17.416% +18.634%] (p = 0.00 < 0.05)
                        Performance has regressed.

select i64_optional from t where dict_10_required = 'prefix#2' and dict_1000_required = 'prefix#10'                                                                            
                        time:   [22.442 ms 22.688 ms 22.946 ms]
                        change: [+21.372% +23.369% +25.354%] (p = 0.00 < 0.05)
                        Performance has regressed.

select i64_required from t where dict_10_required = 'prefix#2' and dict_1000_required = 'prefix#10'                                                                            
                        time:   [18.864 ms 19.025 ms 19.194 ms]
                        change: [+1.7532% +3.9070% +6.0920%] (p = 0.00 < 0.05)
                        Performance has regressed.

select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1...                                                                            
                        time:   [69.645 ms 70.503 ms 71.486 ms]
                        change: [-0.9622% +0.2143% +1.7061%] (p = 0.77 > 0.05)
                        No change in performance detected.

select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1...                                                                            
                        time:   [103.78 ms 104.87 ms 106.12 ms]
                        change: [+1.0176% +2.2273% +3.5223%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional > 0                                                                            
                        time:   [26.913 ms 27.031 ms 27.150 ms]
                        change: [+20.687% +21.646% +22.596%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional > 0 #2                                                                            
                        time:   [26.962 ms 27.079 ms 27.197 ms]
                        change: [+21.371% +22.212% +23.070%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required > 0                                                                            
                        time:   [24.484 ms 24.595 ms 24.707 ms]
                        change: [+17.505% +18.513% +19.520%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required > 0 #2                                                                            
                        time:   [24.540 ms 24.645 ms 24.752 ms]
                        change: [+18.075% +18.954% +19.857%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, count(*) from t group by dict_10_optional                                                                            
                        time:   [15.281 ms 15.336 ms 15.392 ms]
                        change: [+19.444% +19.861% +20.336%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, dict_100_opti...                                                                            
                        time:   [35.443 ms 35.511 ms 35.581 ms]
                        change: [+19.094% +19.483% +19.885%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64_required) ...                                                                            
                        time:   [56.518 ms 56.620 ms 56.723 ms]
                        change: [+9.8955% +10.367% +10.835%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                                                                            
                        time:   [62.324 ms 62.449 ms 62.579 ms]
                        change: [+9.8747% +10.485% +11.087%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                                                                            
                        time:   [57.326 ms 57.434 ms 57.542 ms]
                        change: [+6.9184% +7.5079% +8.0877%] (p = 0.00 < 0.05)
                        Performance has regressed.

I need to think a bit further on this, being able to separate IO from decode is pretty compelling on various levels if we can do it, but it is unfortunate if it comes with a performance regression...

tustvold · 2022-04-22T12:57:28Z

Closing in favor of #1605

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 22, 2022

This was referenced Mar 23, 2022

Less async ParquetRecordBatchStream #1472

Closed

RFC: More Granular File Operators apache/datafusion#2079

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 30, 2022

Parquet Storage Interface (apache#1473)

1a45258

tustvold mentioned this issue Mar 30, 2022

Parquet Storage Interface (#1473) #1509

Closed

tustvold mentioned this issue Apr 1, 2022

Use Standard Library IO Abstractions in Parquet #1163

Closed

tustvold mentioned this issue Apr 2, 2022

Update quarterly roadmap for Q2 apache/datafusion#2133

Merged

tustvold mentioned this issue Apr 11, 2022

Experiment with rust-s3 and caching datafusion-contrib/datafusion-objectstore-s3#53

Draft

tustvold mentioned this issue Apr 12, 2022

RFC: Spill-To-Disk Object Storage Download apache/datafusion#2205

Closed

tustvold mentioned this issue Apr 22, 2022

Push-Based Parquet Reader #1605

Closed

tustvold closed this as completed Apr 22, 2022

alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Apr 27, 2022

tustvold mentioned this issue May 10, 2022

[EPIC]: Morsel-Driven Scheduler IO apache/datafusion#2504

Closed

tustvold mentioned this issue May 24, 2022

feat: consistently handle blocking IO influxdata/object_store_rs#17

Merged

tustvold mentioned this issue Jun 8, 2022

Switch to object_store crate (#2489) apache/datafusion#2677

Merged

tustvold mentioned this issue Oct 24, 2022

Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader #2916

Closed

tustvold mentioned this issue Jan 30, 2023

ObjectStore::get_range returns Bytes instead of GetResult #3632

Closed

tustvold mentioned this issue Feb 21, 2023

Integration with object_store crate martindurant/rfsspec#3

Open

tustvold mentioned this issue Sep 1, 2023

object-store: Expose an async reader API for object store #4762

Closed

tustvold mentioned this issue Feb 26, 2024

DISCUSS: Should we integrate Apache OpenDAL support in Parquet? #5427

Open

Jefffrey mentioned this issue Apr 3, 2024

Consider async API design datafusion-contrib/orc-rust#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements for Async Parquet API #1473

Requirements for Async Parquet API #1473

tustvold commented Mar 22, 2022

tustvold commented Mar 22, 2022

alamb commented Mar 22, 2022

tustvold commented Apr 1, 2022

yjshen commented Apr 2, 2022

alippai commented Apr 12, 2022

tustvold commented Apr 12, 2022 •

edited

Loading

mingmwang commented Apr 19, 2022

tustvold commented Apr 19, 2022 •

edited

Loading

tustvold commented Apr 19, 2022 •

edited

Loading

tustvold commented Apr 19, 2022

tustvold commented Apr 22, 2022

Requirements for Async Parquet API #1473

Requirements for Async Parquet API #1473

Comments

tustvold commented Mar 22, 2022

tustvold commented Mar 22, 2022

alamb commented Mar 22, 2022

tustvold commented Apr 1, 2022

Scheduler Bench

Manual Threading

Conclusions

yjshen commented Apr 2, 2022

alippai commented Apr 12, 2022

tustvold commented Apr 12, 2022 • edited Loading

mingmwang commented Apr 19, 2022

tustvold commented Apr 19, 2022 • edited Loading

tustvold commented Apr 19, 2022 • edited Loading

tustvold commented Apr 19, 2022

tustvold commented Apr 22, 2022

tustvold commented Apr 12, 2022 •

edited

Loading

tustvold commented Apr 19, 2022 •

edited

Loading

tustvold commented Apr 19, 2022 •

edited

Loading