Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_cli_start_stop instability #9801

Closed
jcsp opened this issue Nov 19, 2024 · 4 comments · Fixed by #10000
Closed

test_cli_start_stop instability #9801

jcsp opened this issue Nov 19, 2024 · 4 comments · Fixed by #10000
Assignees
Labels
a/test Area: related to testing t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Nov 19, 2024

First log error on pageserver_1: (165, '2024-11-19T04:07:35.840135Z  WARN request{method=GET path=/v1/location_config request_id=2e46b7ae-30c8-4dec-8318-283fc0d8003e}: request was dropped before completing\n')
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9797/11905853337/index.html#/testresult/dabeadf7e5796556

Something wrong with controller's behavior on startup?

@jcsp jcsp added a/test Area: related to testing t/bug Issue Type: Bug labels Nov 19, 2024
@erikgrinaker
Copy link
Contributor

I thought this would be a race where either the pageserver or storage controller was shutting down, causing the request to be cancelled and dropped. But that doesn't appear to be the case.

The pageserver was in the middle of starting up:

2024-11-19T04:07:35.829521Z  INFO initial_size_calculation{tenant_id=29bccf2cb5b632f310d8cf37d38c917d shard_id=0000 timeline_id=e26d5c943b7b063eb20ed1942ed6916e}:logical_size_calculation_task:tokio_epoll_uring_ext::thread_local_system{thread_local=0 attempt_no=0}: successfully launched system
2024-11-19T04:07:35.840135Z  WARN request{method=GET path=/v1/location_config request_id=2e46b7ae-30c8-4dec-8318-283fc0d8003e}: request was dropped before completing
2024-11-19T04:07:35.867943Z  INFO attach{tenant_id=29bccf2cb5b632f310d8cf37d38c917d shard_id=0000 gen=00000002}: Warm-up complete
2024-11-19T04:07:35.867994Z  INFO Remote part of initial load completed (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868009Z  INFO Initial load completed (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868030Z  INFO Starting background jobs (1.168s since start) elapsed_ms=1168
2024-11-19T04:07:35.868038Z  INFO Startup complete (1.168s since start) elapsed_ms=1168

As was the storage controller, when it saw a failing request:

2024-11-19T04:07:35.837141Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API failed, will retry (attempt 0): receive body: error sending request for url (http://localhost:19543/v1/location_config)
2024-11-19T04:07:35.840325Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API succeeded after 1 retries
2024-11-19T04:07:35.840394Z  INFO startup_reconcile: Received 1 shard statuses from pageserver 1

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 3, 2024

Ah, the storage controller hit a 1 second timeout when listing shards, and cancelled the request:

tracing::info!("Scanning shards on node {node}...");
let timeout = Duration::from_secs(1);

2024-11-19T04:07:34.835113Z  INFO startup_reconcile: Scanning shards on node 1 (localhost)...
[...]
2024-11-19T04:07:35.837141Z  INFO startup_reconcile: Call to node 1 (localhost:19543) management API failed, will retry (attempt 0): receive body: error sending request for url (http://localhost:19543/v1/location_config)

@jcsp Before doing anything here, I'd like to understand the motivation a bit better.

  • Is a 1 second timeout too aggressive? Listing thousands of shards could take a while.
  • Why do we log warnings when the request was dropped because the client went away? That seems benign.

We should also improve the error message to say that the request failed because of a timeout.

@jcsp
Copy link
Collaborator Author

jcsp commented Dec 3, 2024

Is a 1 second timeout too aggressive? Listing thousands of shards could take a while.

It's meant to be fast -- should not have to wait for any async locks, and main cost should be serialization. The short timeout is a bit weird though: scan_node_locations has that whole deadline-driven wrapper that should avoid holding anything up if a node isn't responsive (which maybe didn't exist when that timeout was first set?).

This is probably a combination of the timeout being very short and the test machines being overloaded. It's probably fine to bump the timeout.

Why do we log warnings when the request was dropped because the client went away? That seems benign.

It's not essential, but it helps to make it more obvious what's happening if we see a request start and then just never complete because the client went away. Otherwise when reading a log we might think it was stuck in flight

@erikgrinaker
Copy link
Contributor

We should also improve the error message to say that the request failed because of a timeout.

reqwest doesn't include the source error when displaying errors. It used to, but it was removed in seanmonstar/reqwest#2199. That seems unfortunate, I'll add a custom formatter for it.

github-merge-queue bot pushed a commit that referenced this issue Dec 4, 2024
## Problem

Reqwest errors don't include details about the inner source error. This
means that we get opaque errors like:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config)
```

Instead of the more helpful:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out
```

Touches #9801.

## Summary of changes

Include the source error for `reqwest::Error` wherever it's displayed.
awarus pushed a commit that referenced this issue Dec 5, 2024
## Problem

Reqwest errors don't include details about the inner source error. This
means that we get opaque errors like:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config)
```

Instead of the more helpful:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out
```

Touches #9801.

## Summary of changes

Include the source error for `reqwest::Error` wherever it's displayed.
github-merge-queue bot pushed a commit that referenced this issue Dec 5, 2024
## Problem

The node shard scan timeout of 1 second is a bit too aggressive, and
we've seen this cause test failures. The scans are performed in parallel
across nodes, and the entire operation has a 15 second timeout.

Resolves #9801.

## Summary of changes

Increase the timeout to 5 seconds. This is still enough to time out on a
network failure and retry successfully within 15 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants