-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_cli_start_stop
instability
#9801
Comments
I thought this would be a race where either the pageserver or storage controller was shutting down, causing the request to be cancelled and dropped. But that doesn't appear to be the case. The pageserver was in the middle of starting up:
As was the storage controller, when it saw a failing request:
|
Ah, the storage controller hit a 1 second timeout when listing shards, and cancelled the request: neon/storage_controller/src/service.rs Lines 790 to 791 in aaee713
@jcsp Before doing anything here, I'd like to understand the motivation a bit better.
We should also improve the error message to say that the request failed because of a timeout. |
It's meant to be fast -- should not have to wait for any async locks, and main cost should be serialization. The short timeout is a bit weird though: This is probably a combination of the timeout being very short and the test machines being overloaded. It's probably fine to bump the timeout.
It's not essential, but it helps to make it more obvious what's happening if we see a request start and then just never complete because the client went away. Otherwise when reading a log we might think it was stuck in flight |
reqwest doesn't include the source error when displaying errors. It used to, but it was removed in seanmonstar/reqwest#2199. That seems unfortunate, I'll add a custom formatter for it. |
## Problem Reqwest errors don't include details about the inner source error. This means that we get opaque errors like: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config) ``` Instead of the more helpful: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out ``` Touches #9801. ## Summary of changes Include the source error for `reqwest::Error` wherever it's displayed.
## Problem Reqwest errors don't include details about the inner source error. This means that we get opaque errors like: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config) ``` Instead of the more helpful: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out ``` Touches #9801. ## Summary of changes Include the source error for `reqwest::Error` wherever it's displayed.
## Problem The node shard scan timeout of 1 second is a bit too aggressive, and we've seen this cause test failures. The scans are performed in parallel across nodes, and the entire operation has a 15 second timeout. Resolves #9801. ## Summary of changes Increase the timeout to 5 seconds. This is still enough to time out on a network failure and retry successfully within 15 seconds.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9797/11905853337/index.html#/testresult/dabeadf7e5796556
Something wrong with controller's behavior on startup?
The text was updated successfully, but these errors were encountered: