Report on timeout errors #1

alamb · 2025-03-07T14:33:17Z

This issue is some notes I have while looking at the code in this repr from @djanderson

TLDR I think the timeout errors you are from the gRPC CLIENT -- and basically has nothing to do with how the server is configured. I didn't see any difference in behavior related to DedicatedExecutor or not.

Background

As a background, gRPC uses http requests / responses and doesn't rely on long lived (tcp) connections. Typically gRPC clients, including tonic, have a maximum duration for any particular request. Even if the connection is actively consuming data, once the timeout is reached the client will close the connection.

Running example as is

Without other modification, I see this on my local machine:

cargo run --release --bin client -- 500000
error: Ipc error: Status { code: Cancelled, message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }

This is the classic "client timed out" error from tonic (rust gRPC stack)

Increased client timeoutput

When I cranked up the tonic / client timeout like this:

diff --git a/src/bin/client.rs b/src/bin/client.rs
index ca967d1..7a0f9e3 100644
--- a/src/bin/client.rs
+++ b/src/bin/client.rs
@@ -31,7 +31,7 @@ async fn main() {
     let endpoint = Endpoint::new("http://localhost:50051")
         .unwrap()
         .connect_timeout(Duration::from_secs(20))
-        .timeout(Duration::from_secs(20))
+        .timeout(Duration::from_secs(2000))
         .tcp_nodelay(true) // Disable Nagle's Algorithm since we don't want packets to wait
         .tcp_keepalive(Option::Some(Duration::from_secs(3600)))
         .http2_keep_alive_interval(Duration::from_secs(300))

And then ran the client, it does eventually error with an h2 error:

cargo run --release --bin client -- 500000
...
error: Ipc error: Status { code: Cancelled, message: "h2 protocol error: http2 error", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(1), CANCEL, Remote) }))) }

But looking at the server, I believe the problem is that the local stack container ran out of disk space. The server panic's like this:

called `Result::unwrap()` on an `Err` value: Panic { msg: "called `Result::unwrap()` on an `Err` value: External(External(Generic { store: \"S3\", source: Reqwest { retries: 10, max_retries: 10, elapsed: 2.661978125s, retry_timeout: 180s, source: reqwest::Error { kind: Status(500), url: \"http://localhost:55207/warehouse/test.parquet?partNumber=271&uploadId=J6FMT9AUw8vDynB7KZhIqB76Ym2xDnX4nhJrPRZc1zsy5wwR293qVHXqP4TtChGEvAryN7Kd0i_4-Oag9g3AGA7jkoSA6QhUGSabpWukW2S5vGSABZ3COfkvKVDHQ1Dk\" } } }))" }
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed

But when I try to restart the server it produces an error like:

essage: "mkdir /var/lib/docker/overlay2/b160374a6b94911559c76c3ee6a29e6cc2c8d7201094e9db99d48e32dc1ae837-init: no space left on device

thread 'main' panicked at src/localstack.rs:15:10:
called `Result::unwrap()` on an `Err` value: Client(CreateContainer(DockerResponseServerError { status_code: 500, message: "mkdir /var/lib/docker/overlay2/b160374a6b94911559c76c3ee6a29e6cc2c8d7201094e9db99d48e32dc1ae837-init: no space left on device" }))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: parquet_sink_dedicated_exec_repro::localstack::localstack_container::{{closure}}
   4: parquet_sink_dedicated_exec_repro::main::{{closure}}
   5: tokio::runtime::park::CachedParkThread::block_on
   6: tokio::runtime::context::runtime::enter_runtime
   7: tokio::runtime::runtime::Runtime::block_on
   8: parquet_sink_dedicated_exec_repro::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-07T17:47:55Z

BTW I think the solution for the timeouts you are hitting is to make multiple put requests rather than try to do the entire thing in a single call to put

That might be more complicated on the server side if you need to track state or something . However it would make it more resilent to network errors during transport as well

djanderson · 2025-03-07T18:09:11Z

First, thanks for the second set of eyes, this is super helpful. I previously was seeing both client timeouts and timeouts between the object store and server (more in-line with executor starvation the server) but totally agree that I haven't seen the object store <-> server timeout with dedicated executor, so I was likely just attributing the client timeouts to what was causing them before.

Interesting idea about the multi-put calls. The major downside to that would require re-inventing most of the bulk ingest capability of the flight sql API. I also understand that gRPC is supposed to be usable for both bulk and streaming paradigms, so likely I just need to more aggressively configure the client to handle longer-lived streams?

alamb · 2025-03-07T20:55:00Z

Interesting idea about the multi-put calls. The major downside to that would require re-inventing most of the bulk ingest capability of the flight sql API. I also understand that gRPC is supposed to be usable for both bulk and streaming paradigms, so likely I just need to more aggressively configure the client to handle longer-lived streams?

Yes I think if you can control the client to allow longer timeouts that would work well

FWIW we found that much of the gRPC stacks (like envoy in k8s as I recall) had the same aggressive 30 second timeouts (so we had to adjust timeouts not just in rust but also in golang ones as well)

alamb mentioned this issue Mar 7, 2025

Example for using a separate threadpool for CPU bound work (try 2) apache/datafusion#14286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on timeout errors #1

Report on timeout errors #1

alamb commented Mar 7, 2025

alamb commented Mar 7, 2025

djanderson commented Mar 7, 2025 •

edited

Loading

alamb commented Mar 7, 2025

Report on timeout errors #1

Report on timeout errors #1

Comments

alamb commented Mar 7, 2025

Background

Running example as is

Increased client timeoutput

alamb commented Mar 7, 2025

djanderson commented Mar 7, 2025 • edited Loading

alamb commented Mar 7, 2025

djanderson commented Mar 7, 2025 •

edited

Loading