Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report on timeout errors #1

Open
alamb opened this issue Mar 7, 2025 · 3 comments
Open

Report on timeout errors #1

alamb opened this issue Mar 7, 2025 · 3 comments

Comments

@alamb
Copy link

alamb commented Mar 7, 2025

This issue is some notes I have while looking at the code in this repr from @djanderson

TLDR I think the timeout errors you are from the gRPC CLIENT -- and basically has nothing to do with how the server is configured. I didn't see any difference in behavior related to DedicatedExecutor or not.

Background

As a background, gRPC uses http requests / responses and doesn't rely on long lived (tcp) connections. Typically gRPC clients, including tonic, have a maximum duration for any particular request. Even if the connection is actively consuming data, once the timeout is reached the client will close the connection.

Running example as is

Without other modification, I see this on my local machine:

cargo run --release --bin client -- 500000
error: Ipc error: Status { code: Cancelled, message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }

This is the classic "client timed out" error from tonic (rust gRPC stack)

Increased client timeoutput

When I cranked up the tonic / client timeout like this:

diff --git a/src/bin/client.rs b/src/bin/client.rs
index ca967d1..7a0f9e3 100644
--- a/src/bin/client.rs
+++ b/src/bin/client.rs
@@ -31,7 +31,7 @@ async fn main() {
     let endpoint = Endpoint::new("http://localhost:50051")
         .unwrap()
         .connect_timeout(Duration::from_secs(20))
-        .timeout(Duration::from_secs(20))
+        .timeout(Duration::from_secs(2000))
         .tcp_nodelay(true) // Disable Nagle's Algorithm since we don't want packets to wait
         .tcp_keepalive(Option::Some(Duration::from_secs(3600)))
         .http2_keep_alive_interval(Duration::from_secs(300))

And then ran the client, it does eventually error with an h2 error:

cargo run --release --bin client -- 500000
...
error: Ipc error: Status { code: Cancelled, message: "h2 protocol error: http2 error", source: Some(tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: Reset(StreamId(1), CANCEL, Remote) }))) }

But looking at the server, I believe the problem is that the local stack container ran out of disk space. The server panic's like this:

called `Result::unwrap()` on an `Err` value: Panic { msg: "called `Result::unwrap()` on an `Err` value: External(External(Generic { store: \"S3\", source: Reqwest { retries: 10, max_retries: 10, elapsed: 2.661978125s, retry_timeout: 180s, source: reqwest::Error { kind: Status(500), url: \"http://localhost:55207/warehouse/test.parquet?partNumber=271&uploadId=J6FMT9AUw8vDynB7KZhIqB76Ym2xDnX4nhJrPRZc1zsy5wwR293qVHXqP4TtChGEvAryN7Kd0i_4-Oag9g3AGA7jkoSA6QhUGSabpWukW2S5vGSABZ3COfkvKVDHQ1Dk\" } } }))" }
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed

But when I try to restart the server it produces an error like:

essage: "mkdir /var/lib/docker/overlay2/b160374a6b94911559c76c3ee6a29e6cc2c8d7201094e9db99d48e32dc1ae837-init: no space left on device

thread 'main' panicked at src/localstack.rs:15:10:
called `Result::unwrap()` on an `Err` value: Client(CreateContainer(DockerResponseServerError { status_code: 500, message: "mkdir /var/lib/docker/overlay2/b160374a6b94911559c76c3ee6a29e6cc2c8d7201094e9db99d48e32dc1ae837-init: no space left on device" }))
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: parquet_sink_dedicated_exec_repro::localstack::localstack_container::{{closure}}
   4: parquet_sink_dedicated_exec_repro::main::{{closure}}
   5: tokio::runtime::park::CachedParkThread::block_on
   6: tokio::runtime::context::runtime::enter_runtime
   7: tokio::runtime::runtime::Runtime::block_on
   8: parquet_sink_dedicated_exec_repro::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
@alamb
Copy link
Author

alamb commented Mar 7, 2025

BTW I think the solution for the timeouts you are hitting is to make multiple put requests rather than try to do the entire thing in a single call to put

That might be more complicated on the server side if you need to track state or something . However it would make it more resilent to network errors during transport as well

@djanderson
Copy link
Owner

djanderson commented Mar 7, 2025

First, thanks for the second set of eyes, this is super helpful. I previously was seeing both client timeouts and timeouts between the object store and server (more in-line with executor starvation the server) but totally agree that I haven't seen the object store <-> server timeout with dedicated executor, so I was likely just attributing the client timeouts to what was causing them before.

Interesting idea about the multi-put calls. The major downside to that would require re-inventing most of the bulk ingest capability of the flight sql API. I also understand that gRPC is supposed to be usable for both bulk and streaming paradigms, so likely I just need to more aggressively configure the client to handle longer-lived streams?

@alamb
Copy link
Author

alamb commented Mar 7, 2025

Interesting idea about the multi-put calls. The major downside to that would require re-inventing most of the bulk ingest capability of the flight sql API. I also understand that gRPC is supposed to be usable for both bulk and streaming paradigms, so likely I just need to more aggressively configure the client to handle longer-lived streams?

Yes I think if you can control the client to allow longer timeouts that would work well

FWIW we found that much of the gRPC stacks (like envoy in k8s as I recall) had the same aggressive 30 second timeouts (so we had to adjust timeouts not just in rust but also in golang ones as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants