DataFusion dedicated executor investigation

Quickstart

Run the server cargo run
Trigger the bug with cargo run --bin client in a different terminal. This default to 500 record batches which triggers the bug on my system. You can send different numbers of record batches with cargo run client <N_BATCHES>

Can we trigger this without the networking object store? LocalFileSystem or even Mock? -> it doesn't seem so.

Results:

the error is triggered in a flightsql server with a network-based object store and a sufficiently large input data set
not triggered with MockIO object store even for large data set
not triggered with real object store with small (single recordbatch) data set
not triggered with real object store with multiple (~10) artifically small flight data
the issue is triggered above a certain threshold of data or record batches... may be system dependent? Is it crossing a threshold for datafusion to start parallelizing? Is it crossing some threshold of executor usage such that tokio spawns new workers or moves things between workers or...?

This application is instrumented for tokio console (cargo install --locked tokio-console). You need to put the following in .cargo/config.toml:

[build]
rustflags = ["--cfg", "tokio_unstable"]

and, after running the server, in another terminal, do tokio-console.

Incidentally, by disabling the dedicated executor, this repo also demonstrates the problem we're looking to solve in the first place:

Disable the dedicated-executor feature (a default feature) on the server: cargo run --no-default-features
Run the client. On my machine, this survived much longer than the decicated executor, but consistenly displayed a timeout between client and server with cargo run --bin client 5000.
This even shows a failure with both server and client running in release mode: cargo run --no-default-features --release, and for release mode I needed to transfer a bit more data with cargo run --release --bin client 50000, however if this command succeeded it would result in ~3.5GB parquet file in the object store, so nothing outrageous.

The symptoms in both cases are a a client timeout while waiting for a server response, and a failed upload to minio (complete data loss).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.cargo		.cargo
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rustfmt.toml		rustfmt.toml