Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch lightway-server to a direct uring only i/o loop #123

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

xv-ian-c
Copy link
Contributor

Description

This does not consistently improve performance but reduces CPU overheads (by
around 50%-100% i.e. half to one core) under heavy traffic, which adding
perhaps a few hundred Mbps to a speedtest.net download test and making
negligible difference to the upload test. It also removes about 1ms from the
latency in the same tests. Finally the STDEV across multiple test runs appears
to be lower.

This appears to be due to a combination of avoiding async runtime overheads, as
well as removing various channels/queues in favour of a more direct model of
interaction between the ring and the connections.

As well as those benefits we are now able to reach the same level of
performance with far fewer slots used for the TUN rx path, here we use 64 slots
(by default) and reach the same performance as using 1024 previously. The way
uring handles blocking vs async for tun devices seems to be non-optimal. In
blocking mode things are very slow. In async mode more and more time is spent
on bookkeeping and polling, as the number of slots is increased, plus a high
level of EAGAIN results (due to a request timing out after multiple failed
polls1) which waste time requeueing. This is related to
axboe/liburing#886 and axboe/liburing#239.

For UDP/TCP sockets io uring behaves well with the socket in blocking mode
which avoids processing lots of EAGAIN results.

Tuning the slots for each I/O path is a bit of an art (more is definitely not
always better) and the sweet spot varies depending on the I/O device, so
provide various tunables instead of just splitting the ring evenly. With this
there's no real reason to have a very large ring, it's the number of inflight
requests which matters.

This is specific to the server since it relies on kernel features and
correctness(/lack of bugs) which may not be upheld on an arbitrary client
system (while it is assumed that server operators have more control over what
they run). It is also not portable to non-Linux systems. It is known to work
with Linux 6.1 (as found in Debian 12 AKA bookworm).

Note that this kernel version contains a bug which causes the iou-sqp-*
kernel thread to get stuck (unkillable) if the tun is in blocking mode,
therefore an option is provided. Enabling that option on a kernel which
contains [the fix][] allows equivalent performance with fewer slots on the
ring.

Motivation and Context

Attempting to improve performance and CPU usage (bang per buck)

How Has This Been Tested?

I tested with a pair of EC2 c7i.8xlarge nodes, one acting as a lightway server and the other as a client. The nodes have a fast interconnect (10Gbps)

iperf3 tests were run from the client to the server and speedtest from the client to a speedtest node via the server.

connection iperf download iperf upload speedtest download speedtest upload CPU use download1 CPU use upload
UDP +16% -25% +22% -0.75% ~200% -> ~140% ~270% -> 140%
TCP +33% -13% +32% -1% ~200% -> ~140% ~240% -> 140%

The iperf upload results are worse, however this does not appear to be reflected in the speedtest upload results which should better reflect real world outcomes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Performance enhancement

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • The correct base branch is being used, if not main

This applies some of the techniques from #88.

I'm undecided if/how to apply this to the client, since the I/O loop is so different it's not really possible to have an implementation which flips from uring to using the Tokio abstractions other than by having two different i/o loops and an if statement.

Footnotes

  1. When data becomes available all requests are woken but only one will
    find data, the rest will see EAGAIN and after a certain number of such
    events I/O uring will propagate this back to userspace.
    [the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee 2

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch from a3675d3 to 447ca81 Compare November 11, 2024 11:39
Copy link

github-actions bot commented Nov 11, 2024

Code coverage summary for 2cb8c0e:

Filename                                                     Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lightway-app-utils/src/args/cipher.rs                              7                 7     0.00%           2                 2     0.00%           6                 6     0.00%           0                 0         -
lightway-app-utils/src/args/connection_type.rs                     7                 7     0.00%           2                 2     0.00%           6                 6     0.00%           0                 0         -
lightway-app-utils/src/args/duration.rs                           10                10     0.00%           7                 7     0.00%          15                15     0.00%           0                 0         -
lightway-app-utils/src/args/ip_map.rs                             11                11     0.00%           3                 3     0.00%          15                15     0.00%           0                 0         -
lightway-app-utils/src/args/logging.rs                            20                20     0.00%           4                 4     0.00%          18                18     0.00%           0                 0         -
lightway-app-utils/src/connection_ticker.rs                       39                 4    89.74%          17                 2    88.24%         156                 5    96.79%           0                 0         -
lightway-app-utils/src/dplpmtud_timer.rs                          63                 7    88.89%          22                 4    81.82%         282                11    96.10%           0                 0         -
lightway-app-utils/src/event_stream.rs                             5                 0   100.00%           3                 0   100.00%          13                 0   100.00%           0                 0         -
lightway-app-utils/src/iouring.rs                                165               144    12.73%          26                17    34.62%         309               281     9.06%           0                 0         -
lightway-app-utils/src/metrics.rs                                  2                 2     0.00%           2                 2     0.00%           4                 4     0.00%           0                 0         -
lightway-app-utils/src/net.rs                                     41                 3    92.68%           9                 0   100.00%         135                 0   100.00%           0                 0         -
lightway-app-utils/src/sockopt/ip_mtu_discover.rs                 23                23     0.00%           4                 4     0.00%          93                93     0.00%           0                 0         -
lightway-app-utils/src/sockopt/ip_pktinfo.rs                       4                 4     0.00%           1                 1     0.00%          16                16     0.00%           0                 0         -
lightway-app-utils/src/tun.rs                                     76                76     0.00%          22                22     0.00%          94                94     0.00%           0                 0         -
lightway-app-utils/src/utils.rs                                   13                13     0.00%           1                 1     0.00%          11                11     0.00%           0                 0         -
lightway-client/src/args.rs                                       35                35     0.00%          29                29     0.00%          36                36     0.00%           0                 0         -
lightway-client/src/io/inside/tun.rs                              39                39     0.00%           7                 7     0.00%          48                48     0.00%           0                 0         -
lightway-client/src/io/outside/tcp.rs                             48                48     0.00%          10                10     0.00%          45                45     0.00%           0                 0         -
lightway-client/src/io/outside/udp.rs                             60                60     0.00%          12                12     0.00%          61                61     0.00%           0                 0         -
lightway-client/src/keepalive.rs                                 246                33    86.59%          56                 8    85.71%         372                26    93.01%           0                 0         -
lightway-client/src/lib.rs                                       177               177     0.00%          20                20     0.00%         273               273     0.00%           0                 0         -
lightway-client/src/main.rs                                       48                48     0.00%           7                 7     0.00%         166               166     0.00%           0                 0         -
lightway-core/src/borrowed_bytesmut.rs                            87                 1    98.85%          24                 0   100.00%         196                 1    99.49%           0                 0         -
lightway-core/src/builder_predicates.rs                           20                10    50.00%           4                 2    50.00%          32                16    50.00%           0                 0         -
lightway-core/src/cipher.rs                                       10                 0   100.00%           4                 0   100.00%          14                 0   100.00%           0                 0         -
lightway-core/src/connection.rs                                  486               237    51.23%          48                19    60.42%         669               283    57.70%           0                 0         -
lightway-core/src/connection/builders.rs                          63                23    63.49%          18                 7    61.11%         226                54    76.11%           0                 0         -
lightway-core/src/connection/dplpmtud.rs                         767                90    88.27%          69                 1    98.55%         950                11    98.84%           0                 0         -
lightway-core/src/connection/fragment_map.rs                     139                11    92.09%          29                 0   100.00%         292                 7    97.60%           0                 0         -
lightway-core/src/connection/io_adapter.rs                       149                17    88.59%          42                 5    88.10%         327                23    92.97%           0                 0         -
lightway-core/src/connection/key_update.rs                        23                 7    69.57%           5                 0   100.00%          38                19    50.00%           0                 0         -
lightway-core/src/context.rs                                      93                25    73.12%          26                 7    73.08%         199                40    79.90%           0                 0         -
lightway-core/src/context/ip_pool.rs                               7                 3    57.14%           1                 0   100.00%           6                 0   100.00%           0                 0         -
lightway-core/src/context/server_auth.rs                          14                11    21.43%           4                 3    25.00%          24                20    16.67%           0                 0         -
lightway-core/src/io.rs                                           11                 7    36.36%           5                 4    20.00%          20                15    25.00%           0                 0         -
lightway-core/src/lib.rs                                           9                 6    33.33%           4                 1    75.00%          18                 9    50.00%           0                 0         -
lightway-core/src/metrics.rs                                       9                 9     0.00%           7                 7     0.00%          17                17     0.00%           0                 0         -
lightway-core/src/packet.rs                                       27                 7    74.07%           4                 1    75.00%          30                 6    80.00%           0                 0         -
lightway-core/src/plugin.rs                                       70                 8    88.57%          25                 5    80.00%         162                 9    94.44%           0                 0         -
lightway-core/src/utils.rs                                       111                19    82.88%          26                 0   100.00%         184                11    94.02%           0                 0         -
lightway-core/src/version.rs                                      49                 0   100.00%          24                 0   100.00%          94                 0   100.00%           0                 0         -
lightway-core/src/wire.rs                                        154                19    87.66%          39                 0   100.00%         263                 2    99.24%           0                 0         -
lightway-core/src/wire/auth_failure.rs                             9                 1    88.89%           3                 0   100.00%          19                 0   100.00%           0                 0         -
lightway-core/src/wire/auth_request.rs                           152                12    92.11%          37                 0   100.00%         290                 0   100.00%           0                 0         -
lightway-core/src/wire/auth_success_with_config_ipv4.rs           69                 4    94.20%          13                 0   100.00%         145                 0   100.00%           0                 0         -
lightway-core/src/wire/data.rs                                    21                 1    95.24%           7                 0   100.00%          43                 0   100.00%           0                 0         -
lightway-core/src/wire/data_frag.rs                               47                 1    97.87%          19                 0   100.00%          94                 0   100.00%           0                 0         -
lightway-core/src/wire/ping.rs                                    24                 3    87.50%           8                 0   100.00%          69                 0   100.00%           0                 0         -
lightway-core/src/wire/pong.rs                                    15                 2    86.67%           5                 0   100.00%          34                 0   100.00%           0                 0         -
lightway-core/src/wire/server_config.rs                           20                 2    90.00%           5                 0   100.00%          44                 0   100.00%           0                 0         -
lightway-server/src/args.rs                                       38                38     0.00%          36                36     0.00%          36                36     0.00%           0                 0         -
lightway-server/src/auth.rs                                      109                36    66.97%          21                 5    76.19%         150                28    81.33%           0                 0         -
lightway-server/src/connection.rs                                 50                50     0.00%          23                23     0.00%         107               107     0.00%           0                 0         -
lightway-server/src/connection_manager.rs                        103               103     0.00%          34                34     0.00%         217               217     0.00%           0                 0         -
lightway-server/src/connection_manager/connection_map.rs          90                 9    90.00%          27                 2    92.59%         274                 9    96.72%           0                 0         -
lightway-server/src/io.rs                                         75                75     0.00%          13                13     0.00%         135               135     0.00%           0                 0         -
lightway-server/src/io/ffi.rs                                      4                 4     0.00%           4                 4     0.00%          12                12     0.00%           0                 0         -
lightway-server/src/io/inside/tun.rs                              79                79     0.00%          13                13     0.00%         171               171     0.00%           0                 0         -
lightway-server/src/io/outside/tcp.rs                            211               211     0.00%          22                22     0.00%         360               360     0.00%           0                 0         -
lightway-server/src/io/outside/udp.rs                            140               140     0.00%          20                20     0.00%         341               341     0.00%           0                 0         -
lightway-server/src/io/outside/udp/cmsg.rs                        48                17    64.58%          15                 6    60.00%         236                59    75.00%           0                 0         -
lightway-server/src/io/tx.rs                                      39                39     0.00%          10                10     0.00%          67                67     0.00%           0                 0         -
lightway-server/src/ip_manager.rs                                 82                25    69.51%          19                 4    78.95%         254                28    88.98%           0                 0         -
lightway-server/src/ip_manager/ip_pool.rs                        101                 1    99.01%          32                 0   100.00%         305                 0   100.00%           0                 0         -
lightway-server/src/lib.rs                                        85                75    11.76%          13                10    23.08%         185               131    29.19%           0                 0         -
lightway-server/src/main.rs                                       65                65     0.00%           9                 9     0.00%         179               179     0.00%           0                 0         -
lightway-server/src/metrics.rs                                    83                81     2.41%          74                72     2.70%         192               188     2.08%           0                 0         -
lightway-server/src/statistics.rs                                 55                23    58.18%           9                 4    55.56%         104                42    59.62%           0                 0         -
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                           5251              2378    54.71%        1165               513    55.97%        9998              3883    61.16%           0                 0         -

✅ Region coverage 54% passes
✅ Line coverage 61% passes

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch 2 times, most recently from df8bbd4 to 43fd992 Compare November 14, 2024 09:54
@xv-ian-c xv-ian-c marked this pull request as ready for review November 14, 2024 09:55
@xv-ian-c xv-ian-c requested a review from a team as a code owner November 14, 2024 09:55
Copy link
Contributor

@kp-mariappan-ramasamy kp-mariappan-ramasamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement!

I started going through the changes. Will sleep on it today to understand better

Overall idea seems good to use same loop for all IO and using less core sounds nice!

@xv-ian-c
Copy link
Contributor Author

As an experiment I tried the Zc (zero copy) variants of the operations for outside send, opcode::SendMsgZc for UDP and opcode::SendZc for TCP. Counter intuitively it actually made things worse -- UDP was about 15% worse on iperf download for example. I didn't do a full suite of tests for TCP but those I did do were down a few % too.

The Zc mechanism involves the possibility of a second cqe for each operation (basically so the buffer is kept in-situ until the packet is actually sent, including any tcp resends), so that might explain it. Since this is Tx not Rx it's not stopping us from adding more requests to the ring, although --iouring-tx-count 1024 does look to have mitigate things a bit so maybe we are using far more tx slots concurrently than I expected we might be.

Either way, I abandoned the experiment.

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch from 8eb6038 to a25174f Compare November 15, 2024 13:41
@xv-raihaan-m
Copy link
Contributor

As an experiment I tried the Zc (zero copy) variants of the operations for outside send, opcode::SendMsgZc for UDP and opcode::SendZc for TCP. Counter intuitively it actually made things worse -- UDP was about 15% worse on iperf download for example. I didn't do a full suite of tests for TCP but those I did do were down a few % too.

The Zc mechanism involves the possibility of a second cqe for each operation (basically so the buffer is kept in-situ until the packet is actually sent, including any tcp resends), so that might explain it. Since this is Tx not Rx it's not stopping us from adding more requests to the ring, although --iouring-tx-count 1024 does look to have mitigate things a bit so maybe we are using far more tx slots concurrently than I expected we might be.

Either way, I abandoned the experiment.

I think ZC itself supposedly yields serious gains over larger frames so perhaps our MTU is too small to see any benefit

@xv-ian-c
Copy link
Contributor Author

Yes, that's a plausible hypothesis.

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch from a25174f to 37a3c48 Compare November 19, 2024 08:55
}
// On exit dropping _io_handle will cause EPIPE to be delivered to
// io_cancel. This causes the corresponding read request on the
// ring to complete and signal the loop should exit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed ? AFAIK this current flow will be running in main_thread. And as soon as it exits, the process will exit.

Or does tokio runtime join the worker threads and make it work since we do not use thread directly and use task_blocking ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I added this I observed that even after I killed the main thread the process kept going because the uring userspace thread was still going (actually IIRC it was blocked in the submit syscall).

I don't know if it is Tokio cleaning up (as part of the #[tokio::main] infra) or if this is a Linux/POSIX thing which won't actually destroy a process while a thread still exists (being blocked in the kernel might be a relevant factor too).

The FD is guaranteed to be closed at the point lightway_server::server returns, which happens (shortly) before whatever happens on return from main which is enough to kick the thread out of the syscall due to the arrival of EPIPE on the other end of the pipe which lets the process exit.

) -> Result<()> {
let res = io_uring_res(cqe.result()).with_context(|| "outside send completion")? as usize;

// We use MSG_WAITALL so this should not happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my google fu failed.

I could not find the reference to MSG_WAITALL flag for send operation.
I can see it is available for recv and friends:
https://man7.org/linux/man-pages/man2/recv.2.html

Can you please post a link where it is referenced for send ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch from 37a3c48 to 3a5de76 Compare November 21, 2024 14:38
@xv-ian-c xv-ian-c changed the base branch from main to CVPN-1608-cleanup-tcp-conn November 21, 2024 14:38
@xv-ian-c
Copy link
Contributor Author

This is now rebased onto #130. I've also taken a look at the equivalent code paths after the changes here and tried to make them similarly robust.

@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch 6 times, most recently from 07a7485 to 5c32d9e Compare November 21, 2024 16:15
Base automatically changed from CVPN-1608-cleanup-tcp-conn to main November 22, 2024 08:21
@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch 3 times, most recently from d662344 to 1e39a0b Compare November 22, 2024 10:15
@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch 2 times, most recently from acde8fb to 3ff957c Compare December 2, 2024 09:11
@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch 2 times, most recently from dcabfa7 to bc5fd2a Compare December 5, 2024 10:11
@xv-ian-c
Copy link
Contributor Author

xv-ian-c commented Dec 5, 2024

Unsure why this PR caused it but clippy is complaining about some more "needless_lifetime" lints which was something we addressed a bunch of when we update to Rust 1.83.0. I've pushed the extra fixes

@xv-ian-c
Copy link
Contributor Author

xv-ian-c commented Dec 5, 2024

Wrong tab -- I thought this was #137 -- it's clear that the new lints here are due to rebasing over the 1.83 upgrade...

Depending on the specific implementation of the trait they may or may not need
an owned version of the buffer. Since we have one already in the core in some
paths we can expose that through the stack and avoid some extra allocations.

In the UDP send path we can easily and cheaply freeze the `BytesMut` buffer to
get an owned `Bytes` (with a little more overhead while `aggressive_send` is
enabled). However in the TCP path the use of `SendBuffer` makes this harder (if
not impossible) to achieve (cheaply at least).

So add `CowBytes` which can contain either a `Bytes` or a `&[u8]` slice.

Note that right now every consumer still uses the byte slice.
This does not consistently improve performance but reduces CPU overheads (by
around 50%-100% i.e. half to one core) under heavy traffic, which adding
perhaps a few hundred Mbps to a speedtest.net download test and making
negligible difference to the upload test. It also removes about 1ms from the
latency in the same tests. Finally the STDEV across multiple test runs appears
to be lower.

This appears to be due to a combination of avoiding async runtime overheads, as
well as removing various channels/queues in favour of a more direct model of
interaction between the ring and the connections.

As well as those benefits we are now able to reach the same level of
performance with far fewer slots used for the TUN rx path, here we use 64 slots
(by default) and reach the same performance as using 1024 previously. The way
uring handles blocking vs async for tun devices seems to be non-optimal. In
blocking mode things are very slow. In async mode more and more time is spent
on bookkeeping and polling, as the number of slots is increased, plus a high
level of EAGAIN results (due to a request timing out after multiple failed
polls[^0]) which waste time requeueing. This is related to
axboe/liburing#886 and
axboe/liburing#239.

For UDP/TCP sockets io uring behaves well with the socket in blocking mode
which avoids processing lots of EAGAIN results.

Tuning the slots for each I/O path is a bit of an art (more is definitely not
always better) and the sweet spot varies depending on the I/O device, so
provide various tunables instead of just splitting the ring evenly. With this
there's no real reason to have a very large ring, it's the number of inflight
requests which matters.

This is specific to the server since it relies on kernel features and
correctness(/lack of bugs) which may not be upheld on an arbitrary client
system (while it is assumed that server operators have more control over what
they run). It is also not portable to non-Linux systems. It is known to work
with Linux 6.1 (as found in Debian 12 AKA bookworm).

Note that this kernel version contains a bug which causes the `iou-sqp-*`
kernel thread to get stuck (unkillable) if the tun is in blocking mode,
therefore an option is provided. Enabling that option on a kernel which
contains [the fix][] allows equivalent performance with fewer slots on the
ring.

[^0]: When data becomes available _all_ requests are woken but only one will
      find data, the rest will see EAGAIN and after a certain number of such
      events I/O uring will propagate this back to userspace.
[the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee
@xv-ian-c xv-ian-c force-pushed the CVPN-1452-server-uring-only branch from bc5fd2a to 2cb8c0e Compare December 5, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants