Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to hyper v1 #1028

Merged
merged 31 commits into from
Sep 26, 2024
Merged

upgrade to hyper v1 #1028

merged 31 commits into from
Sep 26, 2024

Conversation

seanmonstar
Copy link
Contributor

This defines a dropshot::Body type that is used for both requests and
responses, in place of the now-gone hyper::Body.

Most of the changes are just renames. The interesting files (at least) are:

  • body.rs: This contains the new multi-purpose Body type. It has a few constructors to make things nice. Otherwise, it's not that complex, but could be tweaked or optimized with more usage and/or measurements.
  • server.rs: hyper got rid of the Accept trait. So now an accept loop is in this file manually. It also changed how graceful shutdown is coordinated, with a less-magical helper. It allows for adding a "graceful period timeout" if you eventually want.

@jclulow
Copy link
Collaborator

jclulow commented May 28, 2024

I currently make fairly extensive use of:

Body::wrap_stream(tokio_stream::wrappers::ReceiverStream::new(rx))

Will that still work with the new code?

@seanmonstar
Copy link
Contributor Author

It'd be pretty straight forward to add such a constructor to dropshot::Body.

@seanmonstar seanmonstar force-pushed the hyper-v1-2.4-own-body branch 2 times, most recently from ffe15a3 to 6e1ed6d Compare June 4, 2024 18:39
Copy link
Collaborator

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very promising. I'd love to get @sunshowers and @davepacheco to take a look. Can you please include the design document here? Also I would suggest taking a stab at migration content in the changelog as we try to help consumers figure out the path from previous versions. Thank you!

dropshot/src/body.rs Show resolved Hide resolved
Comment on lines +51 to +58
impl Default for Body {
fn default() -> Body {
Body::empty()
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may prefer people use ::empty() in order to be explicit about their intention... what do you think?

Copy link
Contributor

@adamchalmers adamchalmers Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a Default impl makes it easier to compose Body in other types that impl Default. For example, if you have this struct:

#[derive(Default)]
struct Bodies {
  inner: HashMap<i32, dropshot::Body>
}

You couldn't use the derive(Default) without the impl Default for Body here. You'd have to write a manual impl Default and set inner to HashMap::new().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true: can you envision a use case for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also used by Client::get(url), where it creates a "default" body that is assumed to be empty. And I believe that was used in some tests or docs.

In hindsight, default doesn't necessarily mean empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I too prefer the explicitness of empty() over default() if the semantics are "an empty body". I went and looked at what would break if we just removed this Default impl and I see what you were saying @seanmonstar. It's stuff like:

error: could not compile `dropshot` (test "test_config") due to 9 previous errors
warning: build failed, waiting for other jobs to finish...
error[E0277]: the trait bound `dropshot::Body: Default` is not satisfied
   --> dropshot/tests/test_tls.rs:388:12
    |
388 |     client.get(uri.clone()).await.unwrap_err();
    |            ^^^ the trait `Default` is not implemented for `dropshot::Body`
    |
note: required by a bound in `hyper_util::client::legacy::Client::<C, B>::get`
   --> /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/client.rs:164:12
    |
162 |     pub fn get(&self, uri: Uri) -> ResponseFuture
    |            --- required by a bound in this associated function
163 |     where
164 |         B: Default,
    |            ^^^^^^^ required by this bound in `Client::<C, B>::get`

Since GET requests must have an empty body, hyper's already implicitly assuming that "default" means "empty", and I see that hyper::Body in 0.14 impl'd Default in the same way. So I don't love this but it's probably not worse than before.

If we did want to avoid impl'ing Default here, would there be an easy way to fix up the tests? I don't really see one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can convert all calls to .get() to .request(req_with_explicitly_empty_body).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's good to know. I could go either way on this -- @ahl what do you think?

dropshot/src/extractor/body.rs Show resolved Hide resolved
dropshot/src/body.rs Show resolved Hide resolved
dropshot/src/handler.rs Outdated Show resolved Hide resolved
dropshot/src/handler.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
@ahl ahl requested review from sunshowers and davepacheco June 6, 2024 17:38
dropshot/src/body.rs Outdated Show resolved Hide resolved
dropshot/src/http_util.rs Show resolved Hide resolved
let fr = maybefr?;
if let Ok(buf) = fr.into_data() {
nbytesread += buf.len();
}
}

// TODO-correctness why does the is_end_stream() assertion fail?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the commented-out assertion here was probably failing because, after discarding all the DATA frames, an HTTP/2 body was terminated by a TRAILERS, so the stream was not EOF. I wonder if this assertion can be put back, now that the new implementation will also read and discard any TRAILERS frame that's sent as well as the DATA frames.

dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/tests/test_tls.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work on this!

dropshot/Cargo.toml Outdated Show resolved Hide resolved
dropshot/Cargo.toml Outdated Show resolved Hide resolved
dropshot/src/body.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/src/test_util.rs Outdated Show resolved Hide resolved
@ahl
Copy link
Collaborator

ahl commented Jun 12, 2024

@seanmonstar this is looking great. I would like @davepacheco to take a pass before we merge it, but, @davepacheco, feel free to interpret that with whatever degree of rigor you consider appropriate.

@davepacheco
Copy link
Collaborator

Sorry I'm late to this one. I haven't had a chance to look yet. If there are breaking changes here, as it sounds like there are, then I'd like to get the changelog update in this PR. Our general approach is to provide super clear instructions for people to understand how to know if they're affected by a breaking change and exactly what changes to make, as concretely as we can. There are some examples here, from release 0.9.0.

@ahl also mentioned a design doc? That might help review?

@davepacheco
Copy link
Collaborator

I forgot to add: for breaking changes, I usually like to do a test update of one or more of our important consumers. We've found a lot of issues by doing that. Have we converted any consumers? Ideally it'd be great to convert Omicron's consumers -- is that too much to ask?

@ahl
Copy link
Collaborator

ahl commented Jun 13, 2024

@ahl also mentioned a design doc? That might help review?

I followed up with dave privately with the document. We'll want to make sure it's recorded in this PR once we have appropriate ownership and permissions.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I'm still working through server.rs but wanted to post these in the meantime.

dropshot/src/body.rs Outdated Show resolved Hide resolved
Comment on lines +51 to +58
impl Default for Body {
fn default() -> Body {
Body::empty()
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I too prefer the explicitness of empty() over default() if the semantics are "an empty body". I went and looked at what would break if we just removed this Default impl and I see what you were saying @seanmonstar. It's stuff like:

error: could not compile `dropshot` (test "test_config") due to 9 previous errors
warning: build failed, waiting for other jobs to finish...
error[E0277]: the trait bound `dropshot::Body: Default` is not satisfied
   --> dropshot/tests/test_tls.rs:388:12
    |
388 |     client.get(uri.clone()).await.unwrap_err();
    |            ^^^ the trait `Default` is not implemented for `dropshot::Body`
    |
note: required by a bound in `hyper_util::client::legacy::Client::<C, B>::get`
   --> /home/dap/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/client.rs:164:12
    |
162 |     pub fn get(&self, uri: Uri) -> ResponseFuture
    |            --- required by a bound in this associated function
163 |     where
164 |         B: Default,
    |            ^^^^^^^ required by this bound in `Client::<C, B>::get`

Since GET requests must have an empty body, hyper's already implicitly assuming that "default" means "empty", and I see that hyper::Body in 0.14 impl'd Default in the same way. So I don't love this but it's probably not worse than before.

If we did want to avoid impl'ing Default here, would there be an easy way to fix up the tests? I don't really see one.

dropshot/src/handler.rs Outdated Show resolved Hide resolved
dropshot/src/test_util.rs Outdated Show resolved Hide resolved
dropshot/src/websocket.rs Outdated Show resolved Hide resolved
dropshot/tests/test_tls.rs Outdated Show resolved Hide resolved
tokio::spawn(graceful)
loop {
tokio::select! {
Ok((sock, remote_addr)) = self.0.accept() => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if accept() returns an error? The docs aren't super explicit but I gather this could happen any time the underlying accept(2) syscall returns an error, which can include transient errors like ECONNABORTED (on illumos, anyway). EMFILE might also be considered transient. (The docs also say "These would terminate the stream if not handled in any way." I'm not sure what that means. This struct is not a Stream and I don't know how it could know whether you handled the error.)

If this happens, according to the tokio::select! docs:

Once an returns a value, attempt to apply the value to the provided , if the pattern matches, evaluate and return. If the pattern does not match, disable the current branch and for the remainder of the current call to select!. Continue from step 3.

Step 3 waits for another of the futures to complete. So my interpretation of all of this is: if we encounter a transient error from accept(), this program will stop accepting new connections. This would be very bad.

Besides fixing this, it makes me a little nervous that we're implementing something so delicate and hard-to-test here. Is there some existing implementation we could use here? Would it help to turn this into a https://docs.rs/tokio-stream/0.1.15/tokio_stream/wrappers/struct.TcpListenerStream.html?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, hyper did handle all those before. But as part of the removal of the Accept trait, due to all of it's subtle issues, that was removed as well. Something generic might appear again in hyper-util, and when it does, Dropshot could use it.

For now, I can inline the behavior that used to exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an HttpAcceptor type here that does what hyper used to do.

TokioIo::new(sock),
self.1.make_svc(remote_addr),
);
let fut = graceful.watch(fut.into_owned());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding the various futures involved here and how the new Graceful mechanism works. Would it be reasonable to add a comment describing this in more detail? And/or using more descriptive names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments.

dropshot/src/server.rs Show resolved Hide resolved
dropshot/src/server.rs Show resolved Hide resolved
@seanmonstar
Copy link
Contributor Author

I believe I've addressed all the feedback. I added some breaking changes notes to the changelog. Upgrading Omicron is currently out of scope for me.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

Comment on lines +347 to +363
// These are errors on the individual socket that we
// tried to accept, and so can be ignored.
std::io::ErrorKind::ConnectionRefused
| std::io::ErrorKind::ConnectionAborted
| std::io::ErrorKind::ConnectionReset => (),

// This could EMFILE implying resource exhaustion.
// Sleep a little bit and try again.
_ => {
warn!(self.log, "accept error"; "error" => e);
tokio::time::sleep(std::time::Duration::from_millis(
100,
))
.await;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably okay but I'm a little nervous about the error handling. This is definitely an area where the standards and manual pages are good but operational experience is also relevant, which is why I'm still a little uneasy about implementing this behavior here instead of using a hardened library. But I guess hyper 0.x was doing this inline before? Is this the same logic it used?


The ECONNABORTED check is definitely right and it's one a lot of software misses so thanks for getting that. ✔️

I checked out the illumos manual page (our production systems), POSIX, and Linux. I think most of the errors fall into one of these buckets:

  • "shouldn't ever happen and if they did you probably don't want to accept any more": EBADF, EFAULT, ENODEV, ENOTSOCK, EOPNOTSUPP, EPROTO, EINVAL, EAGAIN/EWOULDBLOCK
  • "can totally happen and should just be retried": ECONNABORTED, EINTR
  • "probably shouldn't happen, but definitely can, and there's probably not a great response but we could retry, and maybe pausing would help": EMFILE, ENOMEM, ENOSR, ENOBUFS, EPERM? (Linux)

I don't see ECONNREFUSED or ECONNRESET as possible errors that can be produced here, but the Linux manual page does say any other error for the new socket can happen here. That seems weird to me, but okay.

It seems brittle to encode all of these, so I guess the logic here is saying: assume the "can't happen" stuff really can't happen, handle ECONNABORTED[/ECONNREFUSED/ECONNRESET] explicitly as transient errors that don't require pausing, and treat everything else as a transient error, with a pause to avoid getting stuck in too tight a loop if we're wrong about it being transient. Is that the idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is essentially the same code that hyper 0.14 was doing (minus some TCP keep-alive options stuff, which Dropshot didn't enable): https://github.com/hyperium/hyper/blob/0.14.x/src/server/tcp.rs#L211-L254

I do think it'd be nice to have this in hyper-util, so server frameworks don't all need a copy of this. There's a couple design questions to make it sufficiently generic (listener type, timer type, how to log, etc), such that I don't expect it to be ready to be depended on immediately.

dropshot/src/server.rs Outdated Show resolved Hide resolved
@@ -327,6 +332,39 @@ impl<C: ServerContext> InnerHttpServerStarter<C> {
}
}

/// This accepts connections more gracefully from a TcpListener.
struct HttpAcceptor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're incorporating this pretty tricky functionality into Dropshot, we should probably have some test coverage for it. Any ideas how to do that? It'd be really nice if we could exercise some of those error cases but that seems hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have a way to programmatically trigger those error kinds. Maybe if it was sufficiently generic, such that some ErroringListener could be provided. But that won't be available yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. How about the happy path? What about a test that starts a Dropshot server and makes 100 or 1000 connections to it in a row? Just thinking out loud about how best to test this automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I get that, I write a lot of tests and have multiple times refactored large test suites to make it easier to add more. In this case, I don't see much that could be tested. Especially since the accept code happy path is required for the rest of the unit tests to work, too. But I can add one accepting in a loop, like you mentioned.

dropshot/src/server.rs Outdated Show resolved Hide resolved
dropshot/src/server.rs Outdated Show resolved Hide resolved
CHANGELOG.adoc Outdated Show resolved Hide resolved
@davepacheco
Copy link
Collaborator

I think the last items here are:

  • any tests for the HttpAcceptor, if we can think of something useful
  • test-converting Omicron, which @ahl is working on

@seanmonstar seanmonstar force-pushed the hyper-v1-2.4-own-body branch from 127a57d to 796918e Compare June 24, 2024 18:24
@ahl
Copy link
Collaborator

ahl commented Jun 26, 2024

@seanmonstar it looks like Cargo.lock needs an upodate

@ahl
Copy link
Collaborator

ahl commented Jun 27, 2024

@seanmonstar tests are failing

@ahl
Copy link
Collaborator

ahl commented Jun 28, 2024

@seanmonstar looks like windows tests are hung?

@ahl
Copy link
Collaborator

ahl commented Jul 1, 2024

Eyyyy! Passing tests! I'm working on merging this into omicron and its vassal repos; on vacation right now, but hope to shake out any kinks in a week or so.

CHANGELOG.adoc Outdated Show resolved Hide resolved
@ahl ahl merged commit 37dc307 into oxidecomputer:main Sep 26, 2024
10 checks passed
@ahl
Copy link
Collaborator

ahl commented Sep 26, 2024

WOOT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants