`test_cli_start_stop` instability #9801

jcsp · 2024-11-19T15:20:47Z

First log error on pageserver_1: (165, '2024-11-19T04:07:35.840135Z  WARN request{method=GET path=/v1/location_config request_id=2e46b7ae-30c8-4dec-8318-283fc0d8003e}: request was dropped before completing\n')
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9797/11905853337/index.html#/testresult/dabeadf7e5796556

Something wrong with controller's behavior on startup?

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2024-12-03T16:15:46Z

I thought this would be a race where either the pageserver or storage controller was shutting down, causing the request to be cancelled and dropped. But that doesn't appear to be the case.

The pageserver was in the middle of starting up:

2024-11-19T04:07:35.829521Z  INFO initial_size_calculation{tenant_id=29bccf2cb5b632f310d8cf37d38c917d shard_id=0000 timeline_id=e26d5c943b7b063eb20ed1942ed6916e}:logical_size_calculation_task:tokio_epoll_uring_ext::thread_local_system{thread_local=0 attempt_no=0}: successfully launched system
2024-11-19T04:07:35.840135Z  WARN request{method=GET path=/v1/location_config request_id=2e46b7ae-30c8-4dec-8318-283fc0d8003e}: request was dropped before completing
2024-11-19T04:07:35.867943Z  INFO attach{tenant_id=29bccf2cb5b632f310d8cf37d38c917d shard_id=0000 gen=00000002}: Warm-up complete
2024-11-19T04:07:35.867994Z  INFO Remote part of initial load completed (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868009Z  INFO Initial load completed (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868030Z  INFO Starting background jobs (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868038Z  INFO Startup complete (1.168s since start) elapsed_ms=1168

As was the storage controller, when it saw a failing request:

2024-11-19T04:07:35.837141Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API failed, will retry (attempt 0): receive body: error sending request for url (http://localhost:19543/v1/location_config)
2024-11-19T04:07:35.840325Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API succeeded after 1 retries
2024-11-19T04:07:35.840394Z  INFO startup_reconcile: Received 1 shard statuses from pageserver 1

erikgrinaker · 2024-12-03T16:42:59Z

Ah, the storage controller hit a 1 second timeout when listing shards, and cancelled the request:

neon/storage_controller/src/service.rs

Lines 790 to 791 in aaee713

    
           tracing::info!("Scanning shards on node {node}..."); 
        
           let timeout = Duration::from_secs(1);

2024-11-19T04:07:34.835113Z  INFO startup_reconcile: Scanning shards on node 1 (localhost)...
[...]
2024-11-19T04:07:35.837141Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API failed, will retry (attempt 0): receive body: error sending request for url (http://localhost:19543/v1/location_config)

@jcsp Before doing anything here, I'd like to understand the motivation a bit better.

Is a 1 second timeout too aggressive? Listing thousands of shards could take a while.
Why do we log warnings when the request was dropped because the client went away? That seems benign.

We should also improve the error message to say that the request failed because of a timeout.

jcsp · 2024-12-03T17:47:50Z

Is a 1 second timeout too aggressive? Listing thousands of shards could take a while.

It's meant to be fast -- should not have to wait for any async locks, and main cost should be serialization. The short timeout is a bit weird though: scan_node_locations has that whole deadline-driven wrapper that should avoid holding anything up if a node isn't responsive (which maybe didn't exist when that timeout was first set?).

This is probably a combination of the timeout being very short and the test machines being overloaded. It's probably fine to bump the timeout.

Why do we log warnings when the request was dropped because the client went away? That seems benign.

It's not essential, but it helps to make it more obvious what's happening if we see a request start and then just never complete because the client went away. Otherwise when reading a log we might think it was stuck in flight

erikgrinaker · 2024-12-04T10:38:12Z

We should also improve the error message to say that the request failed because of a timeout.

reqwest doesn't include the source error when displaying errors. It used to, but it was removed in seanmonstar/reqwest#2199. That seems unfortunate, I'll add a custom formatter for it.

## Problem Reqwest errors don't include details about the inner source error. This means that we get opaque errors like: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config) ``` Instead of the more helpful: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out ``` Touches #9801. ## Summary of changes Include the source error for `reqwest::Error` wherever it's displayed.

## Problem The node shard scan timeout of 1 second is a bit too aggressive, and we've seen this cause test failures. The scans are performed in parallel across nodes, and the entire operation has a 15 second timeout. Resolves #9801. ## Summary of changes Increase the timeout to 5 seconds. This is still enough to time out on a network failure and retry successfully within 15 seconds.

jcsp added a/test Area: related to testing t/bug Issue Type: Bug labels Nov 19, 2024

jcsp assigned erikgrinaker Nov 19, 2024

erikgrinaker mentioned this issue Dec 4, 2024

storage_controller: increase shard scan timeout #10000

Merged

erikgrinaker mentioned this issue Dec 4, 2024

Display reqwest error source #10004

Merged

erikgrinaker closed this as completed in #10000 Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_cli_start_stop` instability #9801

`test_cli_start_stop` instability #9801

jcsp commented Nov 19, 2024

erikgrinaker commented Dec 3, 2024

erikgrinaker commented Dec 3, 2024 •

edited

Loading

jcsp commented Dec 3, 2024

erikgrinaker commented Dec 4, 2024

test_cli_start_stop instability #9801

test_cli_start_stop instability #9801

Comments

jcsp commented Nov 19, 2024

erikgrinaker commented Dec 3, 2024

erikgrinaker commented Dec 3, 2024 • edited Loading

jcsp commented Dec 3, 2024

erikgrinaker commented Dec 4, 2024

`test_cli_start_stop` instability #9801

`test_cli_start_stop` instability #9801

erikgrinaker commented Dec 3, 2024 •

edited

Loading