-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch lightway-server to a direct uring only i/o loop #123
base: main
Are you sure you want to change the base?
Conversation
a3675d3
to
447ca81
Compare
Code coverage summary for 2cb8c0e:
✅ Region coverage 54% passes |
df8bbd4
to
43fd992
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
I started going through the changes. Will sleep on it today to understand better
Overall idea seems good to use same loop for all IO and using less core sounds nice!
As an experiment I tried the The Zc mechanism involves the possibility of a second cqe for each operation (basically so the buffer is kept in-situ until the packet is actually sent, including any tcp resends), so that might explain it. Since this is Tx not Rx it's not stopping us from adding more requests to the ring, although Either way, I abandoned the experiment. |
8eb6038
to
a25174f
Compare
I think ZC itself supposedly yields serious gains over larger frames so perhaps our MTU is too small to see any benefit |
Yes, that's a plausible hypothesis. |
a25174f
to
37a3c48
Compare
} | ||
// On exit dropping _io_handle will cause EPIPE to be delivered to | ||
// io_cancel. This causes the corresponding read request on the | ||
// ring to complete and signal the loop should exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed ? AFAIK this current flow will be running in main_thread. And as soon as it exits, the process will exit.
Or does tokio runtime join the worker threads and make it work since we do not use thread directly and use task_blocking ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I added this I observed that even after I killed the main thread the process kept going because the uring userspace thread was still going (actually IIRC it was blocked in the submit syscall).
I don't know if it is Tokio cleaning up (as part of the #[tokio::main]
infra) or if this is a Linux/POSIX thing which won't actually destroy a process while a thread still exists (being blocked in the kernel might be a relevant factor too).
The FD is guaranteed to be closed at the point lightway_server::server
returns, which happens (shortly) before whatever happens on return from main
which is enough to kick the thread out of the syscall due to the arrival of EPIPE on the other end of the pipe which lets the process exit.
) -> Result<()> { | ||
let res = io_uring_res(cqe.result()).with_context(|| "outside send completion")? as usize; | ||
|
||
// We use MSG_WAITALL so this should not happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my google fu failed.
I could not find the reference to MSG_WAITALL flag for send operation.
I can see it is available for recv and friends:
https://man7.org/linux/man-pages/man2/recv.2.html
Can you please post a link where it is referenced for send ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mentioned in https://man7.org/linux/man-pages/man3/io_uring_prep_send.3.html. It appears to be a uring specific extension: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/io_uring/net.c#n644
37a3c48
to
3a5de76
Compare
This is now rebased onto #130. I've also taken a look at the equivalent code paths after the changes here and tried to make them similarly robust. |
07a7485
to
5c32d9e
Compare
d662344
to
1e39a0b
Compare
acde8fb
to
3ff957c
Compare
dcabfa7
to
bc5fd2a
Compare
Unsure why this PR caused it but clippy is complaining about some more "needless_lifetime" lints which was something we addressed a bunch of when we update to Rust 1.83.0. I've pushed the extra fixes |
Wrong tab -- I thought this was #137 -- it's clear that the new lints here are due to rebasing over the 1.83 upgrade... |
Depending on the specific implementation of the trait they may or may not need an owned version of the buffer. Since we have one already in the core in some paths we can expose that through the stack and avoid some extra allocations. In the UDP send path we can easily and cheaply freeze the `BytesMut` buffer to get an owned `Bytes` (with a little more overhead while `aggressive_send` is enabled). However in the TCP path the use of `SendBuffer` makes this harder (if not impossible) to achieve (cheaply at least). So add `CowBytes` which can contain either a `Bytes` or a `&[u8]` slice. Note that right now every consumer still uses the byte slice.
This does not consistently improve performance but reduces CPU overheads (by around 50%-100% i.e. half to one core) under heavy traffic, which adding perhaps a few hundred Mbps to a speedtest.net download test and making negligible difference to the upload test. It also removes about 1ms from the latency in the same tests. Finally the STDEV across multiple test runs appears to be lower. This appears to be due to a combination of avoiding async runtime overheads, as well as removing various channels/queues in favour of a more direct model of interaction between the ring and the connections. As well as those benefits we are now able to reach the same level of performance with far fewer slots used for the TUN rx path, here we use 64 slots (by default) and reach the same performance as using 1024 previously. The way uring handles blocking vs async for tun devices seems to be non-optimal. In blocking mode things are very slow. In async mode more and more time is spent on bookkeeping and polling, as the number of slots is increased, plus a high level of EAGAIN results (due to a request timing out after multiple failed polls[^0]) which waste time requeueing. This is related to axboe/liburing#886 and axboe/liburing#239. For UDP/TCP sockets io uring behaves well with the socket in blocking mode which avoids processing lots of EAGAIN results. Tuning the slots for each I/O path is a bit of an art (more is definitely not always better) and the sweet spot varies depending on the I/O device, so provide various tunables instead of just splitting the ring evenly. With this there's no real reason to have a very large ring, it's the number of inflight requests which matters. This is specific to the server since it relies on kernel features and correctness(/lack of bugs) which may not be upheld on an arbitrary client system (while it is assumed that server operators have more control over what they run). It is also not portable to non-Linux systems. It is known to work with Linux 6.1 (as found in Debian 12 AKA bookworm). Note that this kernel version contains a bug which causes the `iou-sqp-*` kernel thread to get stuck (unkillable) if the tun is in blocking mode, therefore an option is provided. Enabling that option on a kernel which contains [the fix][] allows equivalent performance with fewer slots on the ring. [^0]: When data becomes available _all_ requests are woken but only one will find data, the rest will see EAGAIN and after a certain number of such events I/O uring will propagate this back to userspace. [the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee
bc5fd2a
to
2cb8c0e
Compare
Description
This does not consistently improve performance but reduces CPU overheads (by
around 50%-100% i.e. half to one core) under heavy traffic, which adding
perhaps a few hundred Mbps to a speedtest.net download test and making
negligible difference to the upload test. It also removes about 1ms from the
latency in the same tests. Finally the STDEV across multiple test runs appears
to be lower.
This appears to be due to a combination of avoiding async runtime overheads, as
well as removing various channels/queues in favour of a more direct model of
interaction between the ring and the connections.
As well as those benefits we are now able to reach the same level of
performance with far fewer slots used for the TUN rx path, here we use 64 slots
(by default) and reach the same performance as using 1024 previously. The way
uring handles blocking vs async for tun devices seems to be non-optimal. In
blocking mode things are very slow. In async mode more and more time is spent
on bookkeeping and polling, as the number of slots is increased, plus a high
level of EAGAIN results (due to a request timing out after multiple failed
polls1) which waste time requeueing. This is related to
axboe/liburing#886 and axboe/liburing#239.
For UDP/TCP sockets io uring behaves well with the socket in blocking mode
which avoids processing lots of EAGAIN results.
Tuning the slots for each I/O path is a bit of an art (more is definitely not
always better) and the sweet spot varies depending on the I/O device, so
provide various tunables instead of just splitting the ring evenly. With this
there's no real reason to have a very large ring, it's the number of inflight
requests which matters.
This is specific to the server since it relies on kernel features and
correctness(/lack of bugs) which may not be upheld on an arbitrary client
system (while it is assumed that server operators have more control over what
they run). It is also not portable to non-Linux systems. It is known to work
with Linux 6.1 (as found in Debian 12 AKA bookworm).
Note that this kernel version contains a bug which causes the
iou-sqp-*
kernel thread to get stuck (unkillable) if the tun is in blocking mode,
therefore an option is provided. Enabling that option on a kernel which
contains [the fix][] allows equivalent performance with fewer slots on the
ring.
Motivation and Context
Attempting to improve performance and CPU usage (bang per buck)
How Has This Been Tested?
I tested with a pair of EC2 c7i.8xlarge nodes, one acting as a lightway server and the other as a client. The nodes have a fast interconnect (10Gbps)
iperf3 tests were run from the client to the server and speedtest from the client to a speedtest node via the server.
The iperf upload results are worse, however this does not appear to be reflected in the speedtest upload results which should better reflect real world outcomes.
Types of changes
Checklist:
main
This applies some of the techniques from #88.
I'm undecided if/how to apply this to the client, since the I/O loop is so different it's not really possible to have an implementation which flips from uring to using the Tokio abstractions other than by having two different i/o loops and an if statement.
Footnotes
When data becomes available all requests are woken but only one will
find data, the rest will see EAGAIN and after a certain number of such
events I/O uring will propagate this back to userspace.
[the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee ↩ ↩2