Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add on_disconnect scopeguard for cancelled requests #965

Merged

Conversation

deathbyknowledge
Copy link
Contributor

This PR should resolve #914 and remove the need for the change in my original PR as well (#958).

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

dropshot/src/server.rs Outdated Show resolved Hide resolved
// In the case the client disconnects early, the scopeguard allows us
// to perform extra housekeeping before this task is dropped.
let on_disconnect = guard((), |_| {
warn!(request_log, "client disconnected before response returned");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being really nitty, but that's because these log messages are most often used when debugging really tricky issues so I want to make sure they're precise about what they say:

Suggested change
warn!(request_log, "client disconnected before response returned");
warn!(request_log, "request handling cancelled (client disconnected)");

Could you add latency_us to this log message as well? Similar to L840:

let latency_us = start_time.elapsed().as_micros();

and then include that in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not nitty at all, makes perfect sense. Done 👍

let result = handler.handle_request(rqctx, request).await;

// If this send fails, our spawning task has been cancelled in
// the `rx.await` below; log such a result.
if tx.send(result).is_err() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep a log message here to help debug cross-system issues. (It can be useful to know when the server finished doing something, even if the client long ago gave up, so you can put together a timeline of events.) How about something like "request completed after handler was already cancelled"? I wonder if we can include the result here, too -- status code and/or error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've kept the log and added the status code and error messages to it.

@davepacheco
Copy link
Collaborator

Sorry for the delay on this -- I was out for a while and had a lot to catch up on. This PR looks promising and I think it obviates #958 so let's just do this one?

@davepacheco
Copy link
Collaborator

Okay, this is looking great. Now I'm just wondering about testing. We said it'd be hard to do automated tests for the logs and DTrace probes. Were you able to verify at least the log stuff by hand? Any chance you've been able to try out the DTrace probe?

@deathbyknowledge
Copy link
Contributor Author

Heyo! Sorry for the delay, I got busy debugging some of this.
I was able to test the log entries by hand and had no problem at all, I had USDT probes compiled out at the time but I can test again.

Trying to test this programatically from the tests (if only just to make sure it works) made me realise that the test_config_handler_task_mode_detached is not doing what's intended, seems like the request does not get cancelled (at least on my machine).
I understand the desired behaviour is to drop the connection by aborting the client task before it completes and then resuming the handler through the mpsc. I found 2 issues with this:

  1. Currently client_task.abort() is used and immediately after the handler is resumed with release_endpoint_tx.send(()).await.unwrap();. This does not guarantee that the cancellation of the task has happened.

    Be aware that calls to JoinHandle::abort just schedule the task for cancellation, and will return before the cancellation has completed. To wait for cancellation to complete, wait for the task to finish by awaiting the JoinHandle. Similarly, the JoinHandle::is_finished method does not return true until the cancellation has finished.

    From the docs

  2. Ok, then, even if the client task has been cancelled and the connection is dropped, if the handling task executes fast enough, hyper might not have enough time to read/write from connection to check whether it's alive, so the handler will continue execution under normal conditions and only realise the connection is dead when trying to write the response back.

This is very racy and I'm struggling on writing a deterministic test for it. I don't mind trying to resolve this issue but I believe it might not be important either.
So far, adding a 10 millisecond sleep to the test endpoint to fake some work and give extra breathing room to hyper seems to solve the issue.

Let me know what the preferred way of solving this (if any) is!

Hyper Logs:

Successful connection drop before handler completion:

2024-04-29T12:49:34.733615Z TRACE hyper::client::pool: checkout waiting for idle connection: ("http", localhost:59246)
2024-04-29T12:49:34.733643Z TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("localhost"), port=Some(Port(59246))
2024-04-29T12:49:34.733682Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2024-04-29T12:49:34.734426Z DEBUG hyper::client::connect::http: connecting to [::1]:59246
2024-04-29T12:49:34.734533Z TRACE hyper::client::connect::http: connect error for [::1]:59246: ConnectError("tcp connect error", Os { code: 61, kind: ConnectionRefused, message: "Connection refused" })
2024-04-29T12:49:34.734543Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:59246
2024-04-29T12:49:34.734691Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:59246
2024-04-29T12:49:34.734715Z TRACE hyper::client::conn: client handshake Http1
2024-04-29T12:49:34.734734Z TRACE hyper::client::client: handshake complete, spawning background dispatcher task
2024-04-29T12:49:34.734749Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Busy }
2024-04-29T12:49:34.734760Z TRACE hyper::client::pool: checkout dropped for ("http", localhost:59246)
2024-04-29T12:49:34.734811Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=GET, body=None
2024-04-29T12:49:34.734840Z DEBUG hyper::proto::h1::io: flushed 41 bytes
2024-04-29T12:49:34.734846Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-04-29T12:49:34.734881Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-04-29T12:49:34.734889Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Busy }
2024-04-29T12:49:34.734900Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-04-29T12:49:34.734907Z TRACE hyper::proto::h1::io: received 41 bytes
2024-04-29T12:49:34.734917Z TRACE parse_headers: hyper::proto::h1::role: Request.parse bytes=41
2024-04-29T12:49:34.734929Z TRACE parse_headers: hyper::proto::h1::role: Request.parse Complete(41)
2024-04-29T12:49:34.734943Z DEBUG hyper::proto::h1::io: parsed 1 headers
2024-04-29T12:49:34.734948Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-04-29T12:49:34.734969Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-04-29T12:49:34.734995Z TRACE hyper::proto::h1::dispatch: callback receiver has dropped
2024-04-29T12:49:34.735001Z TRACE hyper::proto::h1::dispatch: dispatch no longer receiving messages
2024-04-29T12:49:34.735006Z TRACE hyper::proto::h1::conn: State::close_read()
2024-04-29T12:49:34.735010Z TRACE hyper::proto::h1::conn: State::close_write()
2024-04-29T12:49:34.735014Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2024-04-29T12:49:34.735023Z TRACE hyper::proto::h1::conn: shut down IO complete
2024-04-29T12:49:34.735051Z TRACE hyper::proto::h1::io: received 0 bytes
2024-04-29T12:49:34.735057Z TRACE hyper::proto::h1::conn: found unexpected EOF on busy connection: State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-04-29T12:49:34.735061Z TRACE hyper::proto::h1::conn: State::close_read()
2024-04-29T12:49:34.735066Z DEBUG hyper::server::server::new_svc: connection error: connection closed before message completed
2024-04-29T12:49:34.735108Z DEBUG hyper::server::shutdown: signal received, starting graceful shutdown

Unsuccessful connection drop before handler completion:

2024-04-29T12:49:15.196579Z TRACE hyper::client::pool: checkout waiting for idle connection: ("http", localhost:59198)                                                                               
2024-04-29T12:49:15.196608Z TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("localhost"), port=Some(Port(59198))                                                   
2024-04-29T12:49:15.196652Z DEBUG hyper::client::connect::dns: resolving host="localhost"                                                                                                            
2024-04-29T12:49:15.197332Z DEBUG hyper::client::connect::http: connecting to [::1]:59198                                                                                                            
2024-04-29T12:49:15.197455Z TRACE hyper::client::connect::http: connect error for [::1]:59198: ConnectError("tcp connect error", Os { code: 61, kind: ConnectionRefused, message: "Connection refused" })                                                                                                                                                                                                 
2024-04-29T12:49:15.197464Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:59198                                                                                                        
2024-04-29T12:49:15.197633Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:59198                                                                                                         
2024-04-29T12:49:15.197650Z TRACE hyper::client::conn: client handshake Http1                                                                                                                        
2024-04-29T12:49:15.197670Z TRACE hyper::client::client: handshake complete, spawning background dispatcher task                                                                                     
2024-04-29T12:49:15.197691Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: Init, keep_alive: Busy }                                                           
2024-04-29T12:49:15.197704Z TRACE hyper::client::pool: checkout dropped for ("http", localhost:59198)
2024-04-29T12:49:15.197766Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=GET, body=None
2024-04-29T12:49:15.197801Z DEBUG hyper::proto::h1::io: flushed 41 bytes
2024-04-29T12:49:15.197805Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-04-29T12:49:15.197848Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-04-29T12:49:15.197857Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Busy }
2024-04-29T12:49:15.197868Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-04-29T12:49:15.197875Z TRACE hyper::proto::h1::io: received 41 bytes
2024-04-29T12:49:15.197883Z TRACE parse_headers: hyper::proto::h1::role: Request.parse bytes=41
2024-04-29T12:49:15.197896Z TRACE parse_headers: hyper::proto::h1::role: Request.parse Complete(41)
2024-04-29T12:49:15.197910Z DEBUG hyper::proto::h1::io: parsed 1 headers
2024-04-29T12:49:15.197915Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-04-29T12:49:15.197941Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-04-29T12:49:15.197971Z TRACE hyper::proto::h1::dispatch: callback receiver has dropped
2024-04-29T12:49:15.197976Z TRACE hyper::proto::h1::dispatch: dispatch no longer receiving messages
2024-04-29T12:49:15.197981Z TRACE hyper::proto::h1::conn: State::close_read()
2024-04-29T12:49:15.197984Z TRACE hyper::proto::h1::conn: State::close_write()
2024-04-29T12:49:15.197988Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2024-04-29T12:49:15.197996Z TRACE hyper::proto::h1::conn: shut down IO complete
2024-04-29T12:49:15.198058Z TRACE encode_headers: hyper::proto::h1::role: Server::encode status=200, body=Some(Known(4)), req_method=Some(GET) <<<<< RESPONSE COMPUTED ALREADY
2024-04-29T12:49:15.198076Z TRACE hyper::proto::h1::io: buffer.queue self.len=159 buf.len=4
2024-04-29T12:49:15.198091Z DEBUG hyper::proto::h1::io: flushed 163 bytes
2024-04-29T12:49:15.198094Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Idle }
2024-04-29T12:49:15.198108Z TRACE hyper::proto::h1::conn: Conn::read_head
2024-04-29T12:49:15.198113Z TRACE hyper::proto::h1::io: received 0 bytes
2024-04-29T12:49:15.198117Z TRACE hyper::proto::h1::io: parse eof
2024-04-29T12:49:15.198120Z TRACE hyper::proto::h1::conn: State::close_read()
2024-04-29T12:49:15.198124Z DEBUG hyper::proto::h1::conn: read eof
2024-04-29T12:49:15.198126Z TRACE hyper::proto::h1::conn: State::close_write()
2024-04-29T12:49:15.198129Z TRACE hyper::proto::h1::conn: State::close_read()
2024-04-29T12:49:15.198132Z TRACE hyper::proto::h1::conn: State::close_write()
2024-04-29T12:49:15.198136Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2024-04-29T12:49:15.198161Z DEBUG hyper::server::shutdown: signal received, starting graceful shutdown 

@davepacheco
Copy link
Collaborator

Trying to test this programatically from the tests (if only just to make sure it works) made me realise that the test_config_handler_task_mode_detached is not doing what's intended, seems like the request does not get cancelled (at least on my machine).
I understand the desired behaviour is to drop the connection by aborting the client task before it completes and then resuming the handler through the mpsc.

That test is exercising HandlerTaskMode::Detached, in which case the request should not get cancelled even if the client aborts.

I was not assuming that we would write automated tests for this. mostly because it's annoying to verify the log contents and USDT probes that get fired, but this wouldn't be a bad thing to do! If you did, I think you'd want to look at test_config_handler_task_mode_cancel, which is the one that tests the case where the handler is supposed to get cancelled. That one takes care to avoid races like the ones you mentioned. Neither of the tests assumes that the task is cancelled as soon as client_task.abort() returns. Both end up waiting on endpoint_finished_rx.recv(). This causes the test task to block until either the handler task is cancelled (in which case the DropReporter explicitly sends HandlerCompletionMode::Cancelled message) or completes normally (in which case the handler sends HandlerCompletionMode::CompletedNormally directly). The second problem you mentioned is prevented by having the handler function wait on receiving a message from the test task itself, so if that task is not cancelled, it still cannot finish executing so quickly that we get the wrong result.

@deathbyknowledge
Copy link
Contributor Author

All makes sense, I did not mean to add any automated tests for this use case. I was making some changes locally to test whether the on_disconnect scopeguard fired or not (which it sometimes didn't on the Detached mode, because there's not enough time for the client to disconnect before the handler worker completes).

I was able to confirm that on_disconnect gets triggered every time though. Are there any other changes I should make?

@davepacheco
Copy link
Collaborator

I did some quick checking by applying this diff:

dap@ivanova dropshot $ git diff
diff --git a/dropshot/examples/basic.rs b/dropshot/examples/basic.rs
index dbd7ea4..28720db 100644
--- a/dropshot/examples/basic.rs
+++ b/dropshot/examples/basic.rs
@@ -86,6 +86,8 @@ async fn example_api_get_counter(
 ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
     let api_context = rqctx.context();
 
+    tokio::time::sleep(std::time::Duration::from_secs(5)).await;
+
     Ok(HttpResponseOk(CounterValue {
         counter: api_context.counter.load(Ordering::SeqCst),
     }))

I ran it on Helios like this:

$ cargo run -p dropshot --features=usdt-probes --example=basic
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/examples/basic`
May 03 17:17:44.258 INFO listening, local_addr: 127.0.0.1:33212

I got the pid with pgrep -f example and ran this DTrace script to watch the probes:

$ pfexec dtrace -n 'dropshot$target:::{ trace(timestamp); }' -p 1086
dtrace: description 'dropshot$target:::' matched 2 probes
...

and I did two tests:

  1. Run curl -i http://127.0.0.1:48297/counter and wait the ~5 seconds for it to complete (normal request)
  2. Run curl -i http://127.0.0.1:33212/counter (the same thing) and ^C it after a second or so

I'd expect to see:

  1. For the one that I let complete, there should be one "request completed" log message with a latency of about 5s. There should be two DTrace probes (request-start and request-done) about 5s apart.
  2. For the one that I ^C'd early, there should be a "request handling cancelled" message with latency about a second or two (whenever I ^C'd it) and one request completed after handler was already cancelled message with a latency of about 5s.

Indeed, here's the log output:

$ cargo run -p dropshot --features=usdt-probes --example=basic
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/examples/basic`
May 03 17:17:44.258 INFO listening, local_addr: 127.0.0.1:33212

May 03 17:18:16.634 INFO accepted connection, remote_addr: 127.0.0.1:60450, local_addr: 127.0.0.1:33212
May 03 17:18:21.636 INFO request completed, latency_us: 5001659, response_code: 200, uri: /counter, method: GET, req_id: cdf2daa8-4947-4e06-a123-f2f3184faa60, remote_addr: 127.0.0.1:60450, local_addr: 127.0.0.1:33212

May 03 17:18:32.306 INFO accepted connection, remote_addr: 127.0.0.1:57000, local_addr: 127.0.0.1:33212
May 03 17:18:33.599 WARN request handling cancelled (client disconnected), latency_us: 1293379, uri: /counter, method: GET, req_id: 9378b4ce-7537-46a9-8b7d-39e79ffec66f, remote_addr: 127.0.0.1:57000, local_addr: 127.0.0.1:33212
May 03 17:18:37.308 WARN request completed after handler was already cancelled, response_code: 200, uri: /counter, method: GET, req_id: 9378b4ce-7537-46a9-8b7d-39e79ffec66f, remote_addr: 127.0.0.1:57000, local_addr: 127.0.0.1:33212

and here's the DTrace output:

$ pfexec dtrace -n 'dropshot$target:::{ trace(timestamp); }' -p 1086
dtrace: description 'dropshot$target:::' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  91229 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-start  4300032437849702
  3  91228 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-done  4300037439430478
  4  91229 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-start  4300048109238602
 11  91228 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-done  4300049402627145

Now, that example uses HandlerTaskMode::Detached, so I ran the same tests with this diff:

dap@ivanova dropshot $ git diff
diff --git a/dropshot/examples/basic.rs b/dropshot/examples/basic.rs
index dbd7ea4..38b14e0 100644
--- a/dropshot/examples/basic.rs
+++ b/dropshot/examples/basic.rs
@@ -6,6 +6,7 @@ use dropshot::ApiDescription;
 use dropshot::ConfigDropshot;
 use dropshot::ConfigLogging;
 use dropshot::ConfigLoggingLevel;
+use dropshot::HandlerTaskMode;
 use dropshot::HttpError;
 use dropshot::HttpResponseOk;
 use dropshot::HttpResponseUpdatedNoContent;
@@ -24,7 +25,10 @@ async fn main() -> Result<(), String> {
     // since it's available and won't expose this server outside the host.  We
     // request port 0, which allows the operating system to pick any available
     // port.
-    let config_dropshot: ConfigDropshot = Default::default();
+    let config_dropshot = ConfigDropshot {
+        default_handler_task_mode: HandlerTaskMode::CancelOnDisconnect,
+        ..Default::default()
+    };
 
     // For simplicity, we'll configure an "info"-level logger that writes to
     // stderr assuming that it's a terminal.
@@ -86,6 +90,8 @@ async fn example_api_get_counter(
 ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
     let api_context = rqctx.context();
 
+    tokio::time::sleep(std::time::Duration::from_secs(5)).await;
+
     Ok(HttpResponseOk(CounterValue {
         counter: api_context.counter.load(Ordering::SeqCst),
     }))

Here we get this output for the same two tests:

$ cargo run -p dropshot --features=usdt-probes --example=basic
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/examples/basic`
May 03 17:26:45.982 INFO listening, local_addr: 127.0.0.1:38579

May 03 17:27:22.910 INFO accepted connection, remote_addr: 127.0.0.1:62633, local_addr: 127.0.0.1:38579
May 03 17:27:27.912 INFO request completed, latency_us: 5001519, response_code: 200, uri: /counter, method: GET, req_id: 06722176-6fdd-4f48-8854-c283d3a947dc, remote_addr: 127.0.0.1:62633, local_addr: 127.0.0.1:38579

May 03 17:27:32.245 INFO accepted connection, remote_addr: 127.0.0.1:62068, local_addr: 127.0.0.1:38579
May 03 17:27:34.038 WARN request handling cancelled (client disconnected), latency_us: 1791917, uri: /counter, method: GET, req_id: c8561550-a5fc-446f-af60-df55c7d2c35c, remote_addr: 127.0.0.1:62068, local_addr: 127.0.0.1:38579
$ pfexec dtrace -n 'dropshot$target:::{ trace(timestamp); }' -p 1361
dtrace: description 'dropshot$target:::' matched 2 probes
CPU     ID                    FUNCTION:NAME
  9  91229 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-start  4300578715514130
  9  91228 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-done  4300583716947592
  7  91229 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-start  4300588050756554
  8  91228 _ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17hcca25e74d9b91386E:request-done  4300589842676223

I think all this matches what I'd expect so I'm going to land this.

@davepacheco davepacheco merged commit f10e42a into oxidecomputer:main May 3, 2024
10 checks passed
@davepacheco
Copy link
Collaborator

Thanks again for doing this!

@deathbyknowledge
Copy link
Contributor Author

Thanks for taking the time to review this!
Those diffs match quite closely what I was using locally for testing. I'll make sure to share them next time!

@deathbyknowledge deathbyknowledge deleted the disconnect-scopeguard branch May 3, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

end-of-request USDT probes, log entries don't always happen
2 participants