This is a reproducer for apache/datafusion#14286 (comment).
- Run the server
cargo run
- Trigger the bug with
cargo run --bin client
in a different terminal. This default to 500 record batches which triggers the bug on my system. You can send different numbers of record batches withcargo run client <N_BATCHES>
- Can we trigger this without the networking object store? LocalFileSystem or even Mock? -> it doesn't seem so.
Results:
- the error is triggered in a flightsql server with a network-based object store and a sufficiently large input data set
- not triggered with MockIO object store even for large data set
- not triggered with real object store with small (single recordbatch) data set
- not triggered with real object store with multiple (~10) artifically small flight data
- the issue is triggered above a certain threshold of data or record batches... may be system dependent? Is it crossing a threshold for datafusion to start parallelizing? Is it crossing some threshold of executor usage such that tokio spawns new workers or moves things between workers or...?
This application is instrumented for tokio console (cargo install --locked tokio-console
). You need to put the following in .cargo/config.toml
:
[build]
rustflags = ["--cfg", "tokio_unstable"]
and, after running the server, in another terminal, do tokio-console
.
Incidentally, by disabling the dedicated executor, this repo also demonstrates the problem we're looking to solve in the first place:
-
Disable the
dedicated-executor
feature (a default feature) on the server:cargo run --no-default-features
-
Run the client. On my machine, this survived much longer than the decicated executor, but consistenly displayed a timeout between client and server with
cargo run --bin client 5000
. -
This even shows a failure with both server and client running in release mode:
cargo run --no-default-features --release
, and for release mode I needed to transfer a bit more data withcargo run --release --bin client 50000
, however if this command succeeded it would result in ~3.5GB parquet file in the object store, so nothing outrageous.
The symptoms in both cases are a a client timeout while waiting for a server response, and a failed upload to minio (complete data loss).