Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for Async Parquet API #1473

Closed
tustvold opened this issue Mar 22, 2022 · 11 comments
Closed

Requirements for Async Parquet API #1473

tustvold opened this issue Mar 22, 2022 · 11 comments
Labels
parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Background

In #1154 I added an async parquet API in the form of ParquetRecordBatchStream. This was maximally async, that is it made use of tokio's async IO traits to be as generic as possible. However, having experimented with this I'm not sure that this design is quite right.

In particular apache/datafusion#1617 showed non-trivial performance regressions operating on local files. This is caused by three major factors:

  • Additional copying of buffers in tokio and the ParquetRecordBatchStream necessary to convert an async Read to a sync Read
  • Less parallelism due to parquet decode taking place in a separate blocking thread on master
  • Overheads due to tokio::fs::File calling spawn_blocking for every IO operation

This last point is pretty important and touches on something I was not aware of, tokio does not use an IO reactor for file IO like say boost::asio, instead it just calls tokio::task::spawn_blocking for every IO call. This somewhat undermines the concept of async file IO, as all you're doing is moving where the tokio::task::spawn_blocking is called, and in fact you're moving it lower in the call chain where its overheads are less amortized.

As part of further exploring this design space I created #1472 which instead of using the tokio IO traits, uses the non-async ChunkReader trait and tokio::task::spawn_blocking. Effectively this just upstreams logic from DataFusion's ParquetExec operator, and so perhaps unsurprisingly does not represent a performance regression.

This is still technically an async API, however, I am aware that a number of people expressed interest in an async version of ChunkReader which suggests they want lower-level async-ness. It is also unclear that ChunkReader is quite right either - see #1163 and apache/datafusion#1905.

To further complicate matters, differing storage media have different trade-offs, in particular when fetching from local disk or memory it may make sense to perform the most granular reads possible, potentially filtering out individual pages, columns, etc... However, when fetching data from object storage this is less clear cut. As each request costs and comes with non-trivial latency, there is likely a desire to coalesce proximate byte ranges into a single request, even if this results in reading more data then needed. As a result there is likely no general-purpose strategy for fetching data, and we therefore need the flexibility to allow this to be customized downstream.

Finally, there is ongoing effort to introduce more parallelism into the parquet scan - apache/datafusion#1990, and whilst async is a concurrency primitive and not a parallelism primitive, the two concepts are closely related in practice.

Requirements

I think the requirements are therefore as follows

  1. Provide an async API that yields a stream of Result<RecordBatch>
  2. Use predicate and projection pushdown to filter the data to scan
  3. Separate identifying the byte ranges of column data to scan, from actually performing the scan
  4. Delegate fetching the corresponding byte ranges to an async trait, allowing downstream customisation of the fetch strategy
  5. Avoid copying the page data between buffers
  6. Avoid calling spawn_blocking where the read implementation will not block (e.g. already in-memory)
  7. Be extensible to support parallel column decoding (#TBD)
  8. Be extensible to support more advanced predicate pushdown (Parquet Scan Filter #1191)

Proposal

An intentionally vague proposal would be to extend apache/datafusion#1617 replacing the use of ChunkReader with a Storage trait that might look something like

#[async_trait]
pub trait Storage {
  async fn prefetch(&mut self, ranges: Vec<std::ops::Range<usize>>) -> Result<()>,

  async fn read(&mut self, range: std::ops::Range<usize>) -> Result<ByteBufferPtr>
}

ParquetRecordBatchStreamBuilder would use this trait to first read the footer, and then as part of build() invoke prefetch() with the determined byte ranges to scan. Finally ParquetRecordBatchStream would drive Storage::read with the individual column chunk ranges as needed by the stream.

This will likely require some minor alterations to SerializedPageReader in order to avoid copying the data returned from Storage::read but I think this is worthwhile and will also benefit reading data from in-memory.

FYI @rdettai @yjshen @alamb @sunchao

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 22, 2022
@tustvold
Copy link
Contributor Author

#1474 is likely also related as ByteBufferPtr is currently an experimental API and it would be nice to not leak this into the public interface

@alamb
Copy link
Contributor

alamb commented Mar 22, 2022

Thank you for writing this up. I agree with most of it. As a validation of any new API / modification to an existing API, might I suggest we prototype using it (using datafusion) to ensure we have at least one reference implementation?

I am thinking especially that the interaction with the ObjectStore interface and Storage may be somewhat tricky / non obvious, especially with the async nature of it all

@tustvold
Copy link
Contributor Author

tustvold commented Apr 1, 2022

Ok so an update on this. I implemented this proposal in #1509, along with a number of modifications to help improve its performance - e.g. avoid copying buffers, etc... I then created an updated DataFusion branch to make use of this. Unfortunately the performance was still significantly worse than master 😢

Scheduler Bench

The DataFusion compile times were starting to grate, and there were far too many variables, so I created scheduler-bench which is a reduced version of one of the queries that was exhibiting problematic runtime characteristics in the DataFusion benchmark.

Running this we get the following

tokio_sync_file_test (2048): min: 0.0534s, max: 0.0660s, avg: 0.0548s, p95: 0.0557s
tokio_par_async_spawn_blocking_test (2048): min: 0.0484s, max: 0.0813s, avg: 0.0585s, p95: 0.0754s
tokio_par_async_blocking_test (2048): min: 0.0563s, max: 0.0721s, avg: 0.0584s, p95: 0.0600s
tokio_par_sync_test (2048): min: 0.0536s, max: 0.0563s, avg: 0.0546s, p95: 0.0554s

tokio_sync_file_test runs a spawn_blocking task that decodes and sends RecordBatch over an mpsc channel, a separate tokio task then performs the rest of the query. tokio_par_sync_test is broadly similar, but uses a regular tokio task instead of spawn_blocking for the reader.

tokio_par_async_spawn_blocking_test and tokio_par_async_blocking_test use a tokio task to run the async ParquetRecordBatchStream, which communicates over an mpsc channel with a separate tokio task that performs the rest of the query. The difference is tokio_par_async_spawn_blocking uses a separate spawn_blocking to read the column chunk data into the in-memory buffers, whereas the other cheekily just uses blocking IO in a tokio worker thread.

Immediately there is an obvious disparity between tokio_par_async_spawn_blocking_test and the other methods. Initially I thought this was overheads of spawn_blocking, and in earlier incarnations it certainly was, but with the switch to the Storage interface there is now only a single spawn_blocking per row group (of which there are 2 in this file).

I therefore added a load of instrumentation and noticed something strange, the performance variance in tokio_par_async_spawn_blocking_test is entirely in the synchronous section that decodes RecordBatch from the already read byte arrays. This immediately suggested some sort of socket-locality funkiness so I ran the test restricted to a single CPU core.

$  taskset 0x1 cargo run --release
tokio_sync_file_test (2048): min: 0.0593s, max: 0.0609s, avg: 0.0600s, p95: 0.0605s
tokio_par_async_spawn_blocking_test (2048): min: 0.0619s, max: 0.0653s, avg: 0.0627s, p95: 0.0633s
tokio_par_async_blocking_test (2048): min: 0.0619s, max: 0.0635s, avg: 0.0627s, p95: 0.0632s
tokio_par_sync_test (2048): min: 0.0588s, max: 0.0609s, avg: 0.0595s, p95: 0.0600s

Similarly restricting the tokio worker pool to contain a single worker thread (note spawn_blocking will still use a separate thread)

tokio_sync_file_test (2048): min: 0.0495s, max: 0.0510s, avg: 0.0499s, p95: 0.0503s
tokio_par_async_spawn_blocking_test (2048): min: 0.0469s, max: 0.0618s, avg: 0.0491s, p95: 0.0495s
tokio_par_async_blocking_test (2048): min: 0.0522s, max: 0.0534s, avg: 0.0526s, p95: 0.0531s
tokio_par_sync_test (2048): min: 0.0497s, max: 0.0507s, avg: 0.0501s, p95: 0.0505s

The performance across the board is actually better than when the worker pool had more threads, and the performance disparity between the approaches is largely eliminated.

Whilst perf has to be taken with a grain of salt, as it doesn't properly understand tokio's userland threading, I captured a profile of a run that first performed tokio_par_async_spawn_blocking_test and then tokio_par_sync_test.

image

We can clearly see work ping-ponging between threads for tokio_par_async_spawn_blocking_test and transitioning to a significantly more stable pattern for tokio_par_sync_test, which perhaps unsurprisingly performs better.

Removing the second tokio task also results in the same improvements to workload variability and therefore average runtime performance.

tokio_async_spawn_blocking_test (2048): min: 0.0557s, max: 0.0587s, avg: 0.0565s, p95: 0.0582s
tokio_async_blocking_test (2048): min: 0.0576s, max: 0.0599s, avg: 0.0584s, p95: 0.0593s

Manual Threading

So it certainly looks like tokio is not scheduling our tasks well, but perhaps there is something else at play. I therefore experimented without using tokio and manually threading the workload

Here we can see the performance of a single-threaded execution against a file, and against data already in memory

sync_file_test (2048): min: 0.0541s, max: 0.0758s, avg: 0.0548s, p95: 0.0550s
sync_mem_test (2048): min: 0.0585s, max: 0.0616s, avg: 0.0598s, p95: 0.0609s

And for completeness the performance of a blocking implementation using two threads

par_sync_file_test (2048): min: 0.0542s, max: 0.0584s, avg: 0.0563s, p95: 0.0572s

Conclusions

I think this data is consistent with the following conclusions:

  • The performance of this query is largely dominated by the CPU-bound task of decoding the parquet bytes
  • There are overheads associated with buffering up parquet data ahead of time, on the order of ~5ms
  • There are CPU-locality effects on the order of ~30ms that impact the performance of decoding the parquet bytes

I will come back to this next week with fresh eyes, but if the above is correct it would have the following implications:

  • The advantages of a fully async parquet implementation are likely to be undermined by the performance cost associated with the loss of thread-locality
  • Using tokio to schedule CPU-bound work within DataFusion is likely significantly more sub-optimal than we anticipated, especially for stateful operators

@yjshen
Copy link
Member

yjshen commented Apr 2, 2022

Thanks @tustvold for the great writeup!

@alippai
Copy link
Contributor

alippai commented Apr 12, 2022

I know it's not ready yet, but does https://github.com/tokio-rs/tokio-uring behave differently? (Given it supposed to be a true async file reading API)

@tustvold
Copy link
Contributor Author

tustvold commented Apr 12, 2022

I haven't tried tokio-uring, I suspect it might still suffer from poor thread-locality, just this time between the workers, but I haven't confirmed this.

@mingmwang
Copy link

Unfortunately, I can not reproduce the test result in my local Mac.

tokio_sync_file_test (2048): min: 0.1052s, max: 0.1178s, avg: 0.1117s, p95: 0.1142s
tokio_par_async_spawn_blocking_test (2048): min: 0.0916s, max: 0.1064s, avg: 0.0963s, p95: 0.1006s
tokio_par_async_blocking_test (2048): min: 0.1070s, max: 0.1165s, avg: 0.1104s, p95: 0.1146s
tokio_par_sync_test (2048): min: 0.1080s, max: 0.1274s, avg: 0.1120s, p95: 0.1157s

@tustvold
Copy link
Contributor Author

tustvold commented Apr 19, 2022

Interesting, what hardware are you running on? I am very surprised to see tokio_par_async_spawn_blocking_test performing better than tokio_sync_file_test.

For comparison some M1 benchmarks collected by @e-dard

sync_file_test (1024): min: 0.0746s, max: 0.0773s, avg: 0.0752s
sync_mem_test (1024): min: 0.0717s, max: 0.0731s, avg: 0.0727s
par_sync_file_test (1024): min: 0.0623s, max: 0.0898s, avg: 0.0634s
tokio_sync_file_test (1024): min: 0.0626s, max: 0.0667s, avg: 0.0644s
tokio_spawn_file_test (1024): min: 0.0899s, max: 0.1071s, avg: 0.0936s
tokio_spawn_file_buffer_test (1024): min: 0.1009s, max: 0.1082s, avg: 0.1018s
tokio_async_test (1024): min: 0.0915s, max: 0.0967s, avg: 0.0932s
tokio_par_async_test (1024): min: 0.0841s, max: 0.0974s, avg: 0.0874s
tokio_par_sync_test (1024): min: 0.0626s, max: 0.0835s, avg: 0.0668s
sync_file_test (2048): min: 0.0653s, max: 0.1107s, avg: 0.0694s
sync_mem_test (2048): min: 0.0622s, max: 0.0664s, avg: 0.0627s
par_sync_file_test (2048): min: 0.0542s, max: 0.0568s, avg: 0.0551s
tokio_sync_file_test (2048): min: 0.0547s, max: 0.0741s, avg: 0.0569s
tokio_spawn_file_test (2048): min: 0.0724s, max: 0.0852s, avg: 0.0759s
tokio_spawn_file_buffer_test (2048): min: 0.0907s, max: 0.1365s, avg: 0.0928s
tokio_async_test (2048): min: 0.0814s, max: 0.0879s, avg: 0.0835s
tokio_par_async_test (2048): min: 0.0761s, max: 0.0859s, avg: 0.0787s
tokio_par_sync_test (2048): min: 0.0547s, max: 0.0968s, avg: 0.0584s
sync_file_test (4096): min: 0.0605s, max: 0.0898s, avg: 0.0637s
sync_mem_test (4096): min: 0.0579s, max: 0.0753s, avg: 0.0595s
par_sync_file_test (4096): min: 0.0498s, max: 0.0596s, avg: 0.0543s
tokio_sync_file_test (4096): min: 0.0498s, max: 0.0623s, avg: 0.0537s
tokio_spawn_file_test (4096): min: 0.0639s, max: 0.0738s, avg: 0.0687s
tokio_spawn_file_buffer_test (4096): min: 0.0856s, max: 0.1042s, avg: 0.0869s
tokio_async_test (4096): min: 0.0759s, max: 0.0808s, avg: 0.0779s
tokio_par_async_test (4096): min: 0.0714s, max: 0.0793s, avg: 0.0743s
tokio_par_sync_test (4096): min: 0.0499s, max: 0.0610s, avg: 0.0552s
sync_file_test (8192): min: 0.0576s, max: 0.0650s, avg: 0.0604s
sync_mem_test (8192): min: 0.0544s, max: 0.0594s, avg: 0.0555s
par_sync_file_test (8192): min: 0.0473s, max: 0.0571s, avg: 0.0514s
tokio_sync_file_test (8192): min: 0.0475s, max: 0.0594s, avg: 0.0517s
tokio_spawn_file_test (8192): min: 0.0587s, max: 0.0701s, avg: 0.0600s
tokio_spawn_file_buffer_test (8192): min: 0.0821s, max: 0.1121s, avg: 0.0835s
tokio_async_test (8192): min: 0.0728s, max: 0.0781s, avg: 0.0746s
tokio_par_async_test (8192): min: 0.0686s, max: 0.0745s, avg: 0.0710s
tokio_par_sync_test (8192): min: 0.0476s, max: 0.0568s, avg: 0.0514s
sync_file_test (16384): min: 0.0566s, max: 0.0663s, avg: 0.0604s
sync_mem_test (16384): min: 0.0543s, max: 0.0889s, avg: 0.0555s
par_sync_file_test (16384): min: 0.0469s, max: 0.0603s, avg: 0.0509s
tokio_sync_file_test (16384): min: 0.0471s, max: 0.0566s, avg: 0.0513s
tokio_spawn_file_test (16384): min: 0.0624s, max: 0.0674s, avg: 0.0658s
tokio_spawn_file_buffer_test (16384): min: 0.0871s, max: 0.0916s, avg: 0.0887s
tokio_async_test (16384): min: 0.0726s, max: 0.0804s, avg: 0.0762s
tokio_par_async_test (16384): min: 0.0683s, max: 0.0826s, avg: 0.0714s
tokio_par_sync_test (16384): min: 0.0470s, max: 0.0567s, avg: 0.0510s

And an Intel Mac from @alamb

sync_file_test (1024): min: 0.0968s, max: 0.1207s, avg: 0.1027s
sync_mem_test (1024): min: 0.0941s, max: 0.1220s, avg: 0.0996s
par_sync_file_test (1024): min: 0.0830s, max: 0.1085s, avg: 0.0875s
tokio_sync_file_test (1024): min: 0.0835s, max: 0.1080s, avg: 0.0917s
tokio_spawn_file_test (1024): min: 0.1341s, max: 0.1747s, avg: 0.1460s
tokio_spawn_file_buffer_test (1024): min: 0.1478s, max: 0.2100s, avg: 0.1600s
tokio_async_test (1024): min: 0.1431s, max: 0.1804s, avg: 0.1526s
tokio_par_async_test (1024): min: 0.1352s, max: 0.1612s, avg: 0.1465s
tokio_par_sync_test (1024): min: 0.0880s, max: 0.1382s, avg: 0.1071s
sync_file_test (2048): min: 0.0948s, max: 0.1768s, avg: 0.1318s
sync_mem_test (2048): min: 0.0916s, max: 0.1666s, avg: 0.1206s
par_sync_file_test (2048): min: 0.0794s, max: 0.1630s, avg: 0.1145s
tokio_sync_file_test (2048): min: 0.1023s, max: 0.2510s, avg: 0.1902s
tokio_spawn_file_test (2048): min: 0.2702s, max: 0.4230s, avg: 0.3572s
tokio_spawn_file_buffer_test (2048): min: 0.2147s, max: 0.3401s, avg: 0.2680s
tokio_async_test (2048): min: 0.2072s, max: 0.3323s, avg: 0.2596s
tokio_par_async_test (2048): min: 0.1819s, max: 0.2990s, avg: 0.2395s
tokio_par_sync_test (2048): min: 0.1382s, max: 0.2506s, avg: 0.1960s
sync_file_test (4096): min: 0.1420s, max: 0.2399s, avg: 0.1813s
sync_mem_test (4096): min: 0.1311s, max: 0.2399s, avg: 0.1782s
par_sync_file_test (4096): min: 0.1079s, max: 0.1974s, avg: 0.1534s
tokio_sync_file_test (4096): min: 0.1147s, max: 0.2118s, avg: 0.1638s
tokio_spawn_file_test (4096): min: 0.2236s, max: 0.3454s, avg: 0.2782s
tokio_spawn_file_buffer_test (4096): min: 0.1838s, max: 0.2941s, avg: 0.2340s
tokio_async_test (4096): min: 0.1808s, max: 0.2900s, avg: 0.2255s
tokio_par_async_test (4096): min: 0.1526s, max: 0.2556s, avg: 0.2048s
tokio_par_sync_test (4096): min: 0.1141s, max: 0.2171s, avg: 0.1619s
sync_file_test (8192): min: 0.1505s, max: 0.3689s, avg: 0.2261s
sync_mem_test (8192): min: 0.1558s, max: 0.4034s, avg: 0.2375s
par_sync_file_test (8192): min: 0.1149s, max: 0.2036s, avg: 0.1571s
tokio_sync_file_test (8192): min: 0.1135s, max: 0.2022s, avg: 0.1614s
tokio_spawn_file_test (8192): min: 0.3430s, max: 0.5689s, avg: 0.4529s
tokio_spawn_file_buffer_test (8192): min: 0.1879s, max: 0.3810s, avg: 0.2620s
tokio_async_test (8192): min: 0.1600s, max: 2.4110s, avg: 0.3616s
tokio_par_async_test (8192): min: 0.1781s, max: 0.4306s, avg: 0.2548s
tokio_par_sync_test (8192): min: 0.1223s, max: 0.3303s, avg: 0.2035s
sync_file_test (16384): min: 0.1890s, max: 0.4439s, avg: 0.3216s
sync_mem_test (16384): min: 0.1603s, max: 0.4605s, avg: 0.2934s
par_sync_file_test (16384): min: 0.1184s, max: 0.3450s, avg: 0.2003s
tokio_sync_file_test (16384): min: 0.1197s, max: 0.3192s, avg: 0.1841s
tokio_spawn_file_test (16384): min: 0.2144s, max: 0.5118s, avg: 0.3662s
tokio_spawn_file_buffer_test (16384): min: 0.2080s, max: 0.5335s, avg: 0.3274s
tokio_async_test (16384): min: 0.1867s, max: 0.4569s, avg: 0.2526s
tokio_par_async_test (16384): min: 0.1711s, max: 0.2684s, avg: 0.2118s
tokio_par_sync_test (16384): min: 0.0868s, max: 0.2195s, avg: 0.1579s

Admittedly if memory serves these were collected before I made some further optimisations to the async reader, but I would be surprised. Perhaps one of them might be kind enough to re-run the benchmarks 😄

@tustvold
Copy link
Contributor Author

tustvold commented Apr 19, 2022

Latest results from @e-dard, have tokio_par_async_spawn_blocking_test performing the roughly the same as par_sync_file_test 🎉

sync_file_test (2048): min: 0.0630s, max: 0.1333s, avg: 0.0660s, p95: 0.0702s
sync_mem_test (2048): min: 0.0599s, max: 0.1805s, avg: 0.0634s, p95: 0.0705s
par_sync_file_test (2048): min: 0.0544s, max: 0.0679s, avg: 0.0554s, p95: 0.0566s
tokio_sync_file_test (2048): min: 0.0545s, max: 0.0677s, avg: 0.0563s, p95: 0.0610s
tokio_spawn_file_test (2048): min: 0.0703s, max: 0.0988s, avg: 0.0727s, p95: 0.0815s
tokio_spawn_file_buffer_test (2048): min: 0.0892s, max: 0.2146s, avg: 0.0974s, p95: 0.1212s
tokio_async_spawn_blocking_test (2048): min: 0.0600s, max: 0.0786s, avg: 0.0640s, p95: 0.0715s
tokio_async_blocking_test (2048): min: 0.0633s, max: 0.0746s, avg: 0.0655s, p95: 0.0688s
tokio_par_async_spawn_blocking_test (2048): min: 0.0518s, max: 0.0552s, avg: 0.0528s, p95: 0.0540s
tokio_par_async_blocking_test (2048): min: 0.0552s, max: 0.0765s, avg: 0.0567s, p95: 0.0582s
tokio_par_sync_test (2048): min: 0.0546s, max: 0.0597s, avg: 0.0553s, p95: 0.0562s

So I guess I did a better job making the async version faster than I thought I did 😅.

Now to work out which machine is the special snowflake 😆

@tustvold
Copy link
Contributor Author

Ok running on a c2-standard-16 on GCP I get

sync_file_test (2048): min: 0.0903s, max: 0.0995s, avg: 0.0912s, p95: 0.0916s
sync_mem_test (2048): min: 0.0922s, max: 0.0952s, avg: 0.0931s, p95: 0.0937s
par_sync_file_test (2048): min: 0.0852s, max: 0.0897s, avg: 0.0866s, p95: 0.0878s
tokio_sync_file_test (2048): min: 0.0858s, max: 0.0897s, avg: 0.0869s, p95: 0.0879s
tokio_spawn_file_test (2048): min: 0.1114s, max: 0.1144s, avg: 0.1125s, p95: 0.1136s
tokio_spawn_file_buffer_test (2048): min: 0.1641s, max: 0.1689s, avg: 0.1663s, p95: 0.1683s
tokio_async_spawn_blocking_test (2048): min: 0.0787s, max: 0.0810s, avg: 0.0794s, p95: 0.0798s
tokio_async_blocking_test (2048): min: 0.0887s, max: 0.0917s, avg: 0.0899s, p95: 0.0909s
tokio_par_async_spawn_blocking_test (2048): min: 0.0758s, max: 0.0792s, avg: 0.0769s, p95: 0.0778s
tokio_par_async_blocking_test (2048): min: 0.0863s, max: 0.0886s, avg: 0.0874s, p95: 0.0881s
tokio_par_sync_test (2048): min: 0.0859s, max: 0.0885s, avg: 0.0871s, p95: 0.0880s

We can see that tokio_async_spawn_blocking_test is performing slightly better than par_sync_file_test 🎉 This result was repeatable even if I moved the parquet file to a tmpfs backed filesystem.

Just to check I haven't just been a muppet, I re-ran on my local machine and still get the same behaviour of tokio_par_async_spawn_blocking_test being significantly slower. I'm not really sure what is going on here, perhaps some CPU turbo shenanigans or something...

sync_file_test (2048): min: 0.0568s, max: 0.1135s, avg: 0.0584s, p95: 0.0592s
sync_mem_test (2048): min: 0.0582s, max: 0.0653s, avg: 0.0606s, p95: 0.0616s
par_sync_file_test (2048): min: 0.0558s, max: 0.0603s, avg: 0.0575s, p95: 0.0584s
tokio_sync_file_test (2048): min: 0.0559s, max: 0.0605s, avg: 0.0576s, p95: 0.0581s
tokio_spawn_file_test (2048): min: 0.0854s, max: 0.0881s, avg: 0.0864s, p95: 0.0874s
tokio_spawn_file_buffer_test (2048): min: 0.1163s, max: 0.1460s, avg: 0.1189s, p95: 0.1201s
tokio_async_spawn_blocking_test (2048): min: 0.0566s, max: 0.0598s, avg: 0.0576s, p95: 0.0592s
tokio_async_blocking_test (2048): min: 0.0599s, max: 0.0649s, avg: 0.0606s, p95: 0.0615s
tokio_par_async_spawn_blocking_test (2048): min: 0.0520s, max: 0.0876s, avg: 0.0584s, p95: 0.0780s
tokio_par_async_blocking_test (2048): min: 0.0543s, max: 0.0651s, avg: 0.0568s, p95: 0.0626s
tokio_par_sync_test (2048): min: 0.0509s, max: 0.0598s, avg: 0.0526s, p95: 0.0596s

Unfortunately comparing the performance of the parquet SQL benchmarks of Datafusion master, against parquet-async-wip on the GCP instance, there is still a non-trivial performance hit, although it is less severe than on my local machine.

select dict_10_optional from t                                                                            
                        time:   [9.4814 ms 9.5137 ms 9.5400 ms]
                        change: [+10.989% +11.387% +11.754%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_100_optional from t                                                                            
                        time:   [9.5785 ms 9.5907 ms 9.6034 ms]
                        change: [+8.6348% +8.8496% +9.0629%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_1000_optional from t                                                                            
                        time:   [9.6751 ms 9.6910 ms 9.7066 ms]
                        change: [+8.3536% +8.6366% +8.9063%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_10_required = 'prefix#0'                                                                            
                        time:   [8.3035 ms 8.3429 ms 8.3836 ms]
                        change: [+12.266% +14.359% +16.585%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_100_required = 'prefix#0'                                                                            
                        time:   [8.4855 ms 8.5323 ms 8.5800 ms]
                        change: [+16.358% +17.383% +18.456%] (p = 0.00 < 0.05)
                        Performance has regressed.

select count(*) from t where dict_1000_required = 'prefix#0'                                                                            
                        time:   [8.9166 ms 8.9591 ms 9.0018 ms]
                        change: [+16.238% +17.416% +18.634%] (p = 0.00 < 0.05)
                        Performance has regressed.

select i64_optional from t where dict_10_required = 'prefix#2' and dict_1000_required = 'prefix#10'                                                                            
                        time:   [22.442 ms 22.688 ms 22.946 ms]
                        change: [+21.372% +23.369% +25.354%] (p = 0.00 < 0.05)
                        Performance has regressed.

select i64_required from t where dict_10_required = 'prefix#2' and dict_1000_required = 'prefix#10'                                                                            
                        time:   [18.864 ms 19.025 ms 19.194 ms]
                        change: [+1.7532% +3.9070% +6.0920%] (p = 0.00 < 0.05)
                        Performance has regressed.

select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1...                                                                            
                        time:   [69.645 ms 70.503 ms 71.486 ms]
                        change: [-0.9622% +0.2143% +1.7061%] (p = 0.77 > 0.05)
                        No change in performance detected.

select string_required from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1...                                                                            
                        time:   [103.78 ms 104.87 ms 106.12 ms]
                        change: [+1.0176% +2.2273% +3.5223%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional > 0                                                                            
                        time:   [26.913 ms 27.031 ms 27.150 ms]
                        change: [+20.687% +21.646% +22.596%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_optional > 0 #2                                                                            
                        time:   [26.962 ms 27.079 ms 27.197 ms]
                        change: [+21.371% +22.212% +23.070%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required > 0                                                                            
                        time:   [24.484 ms 24.595 ms 24.707 ms]
                        change: [+17.505% +18.513% +19.520%] (p = 0.00 < 0.05)
                        Performance has regressed.

select distinct dict_10_required from t where dict_1000_optional is not NULL and i64_required > 0 #2                                                                            
                        time:   [24.540 ms 24.645 ms 24.752 ms]
                        change: [+18.075% +18.954% +19.857%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, count(*) from t group by dict_10_optional                                                                            
                        time:   [15.281 ms 15.336 ms 15.392 ms]
                        change: [+19.444% +19.861% +20.336%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, count(*) from t group by dict_10_optional, dict_100_opti...                                                                            
                        time:   [35.443 ms 35.511 ms 35.581 ms]
                        change: [+19.094% +19.483% +19.885%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, MIN(f64_required), MAX(f64_required), AVG(f64_required) ...                                                                            
                        time:   [56.518 ms 56.620 ms 56.723 ms]
                        change: [+9.8955% +10.367% +10.835%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_optional, dict_100_optional, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                                                                            
                        time:   [62.324 ms 62.449 ms 62.579 ms]
                        change: [+9.8747% +10.485% +11.087%] (p = 0.00 < 0.05)
                        Performance has regressed.

select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                                                                            
                        time:   [57.326 ms 57.434 ms 57.542 ms]
                        change: [+6.9184% +7.5079% +8.0877%] (p = 0.00 < 0.05)
                        Performance has regressed.

I need to think a bit further on this, being able to separate IO from decode is pretty compelling on various levels if we can do it, but it is unfortunate if it comes with a performance regression...

@tustvold
Copy link
Contributor Author

Closing in favor of #1605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants