lightway-server: Use i/o uring for all i/o, not just tun.

This does not consistently improve performance but reduces CPU overheads (by around 50%-100% i.e. half to one core) under heavy traffic, which adding perhaps a few hundred Mbps to a speedtest.net download test and making negligible difference to the upload test. It also removes about 1ms from the latency in the same tests. Finally the STDEV across multiple test runs appears to be lower. This appears to be due to a combination of avoiding async runtime overheads, as well as removing various channels/queues in favour of a more direct model of interaction between the ring and the connections. As well as those benefits we are now able to reach the same level of performance with far fewer slots used for the TUN rx path, here we use 64 slots (by default) and reach the same performance as using 1024 previously. The way uring handles blocking vs async for tun devices seems to be non-optimal. In blocking mode things are very slow. In async mode more and more time is spent on bookkeeping and polling, as the number of slots is increased, plus a high level of EAGAIN results (due to a request timing out after multiple failed polls[^0]) which waste time requeueing. This is related to axboe/liburing#886 and axboe/liburing#239. For UDP/TCP sockets io uring behaves well with the socket in blocking mode which avoids processing lots of EAGAIN results. Tuning the slots for each I/O path is a bit of an art (more is definitely not always better) and the sweet spot varies depending on the I/O device, so provide various tunables instead of just splitting the ring evenly. With this there's no real reason to have a very large ring, it's the number of inflight requests which matters. This is specific to the server since it relies on kernel features and correctness(/lack of bugs) which may not be upheld on an arbitrary client system (while it is assumed that server operators have more control over what they run). It is also not portable to non-Linux systems. It is known to work with Linux 6.1 (as found in Debian 12 AKA bookworm). Note that this kernel version contains a bug which causes the `iou-sqp-*` kernel thread to get stuck (unkillable) if the tun is in blocking mode, therefore an option is provided. Enabling that option on a kernel which contains [the fix][] allows equivalent performance with fewer slots on the ring. [^0]: When data becomes available _all_ requests are woken but only one will find data, the rest will see EAGAIN and after a certain number of such events I/O uring will propagate this back to userspace. [the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee
expressvpn · Nov 14, 2024 · 43fd992 · 43fd992
1 parent 571b010
commit 43fd992
Show file tree

Hide file tree

Showing 17 changed files with 1,692 additions and 441 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -52,3 +52,4 @@ tokio-util = "0.7.10"
 tracing = "0.1.37"
 tracing-subscriber = "0.3.17"
 twelf = { version = "0.15.0", default-features = false, features = ["env", "clap", "yaml"]}
+tun = { version = "0.7.1" }
diff --git a/lightway-app-utils/Cargo.toml b/lightway-app-utils/Cargo.toml
@@ -38,7 +38,7 @@ tokio-stream = { workspace = true, optional = true }
 tokio-util.workspace = true
 tracing.workspace = true
 tracing-subscriber = { workspace = true, features = ["json"] }
-tun = { version = "0.7", features = ["async"] }
+tun = { workspace = true, features = ["async"] }
 
 [[example]]
 name = "udprelay"

diff --git a/lightway-server/Cargo.toml b/lightway-server/Cargo.toml
@@ -9,9 +9,8 @@ license = "GPL-2.0-only"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
 [features]
-default = ["io-uring"]
+default = []
 debug = ["lightway-core/debug"]
-io-uring = ["lightway-app-utils/io-uring"]
 
 [lints]
 workspace = true
@@ -26,6 +25,7 @@ clap.workspace = true
 ctrlc.workspace = true
 delegate.workspace = true
 educe.workspace = true
+io-uring = "0.7.0"
 ipnet.workspace = true
 jsonwebtoken = "9.3.0"
 libc.workspace = true
@@ -48,6 +48,7 @@ tokio-stream = { workspace = true, features = ["time"] }
 tracing.workspace = true
 tracing-log = "0.2.0"
 tracing-subscriber = { workspace = true, features = ["json"] }
+tun.workspace = true
 twelf.workspace = true
 
 [dev-dependencies]

diff --git a/lightway-server/src/args.rs b/lightway-server/src/args.rs
@@ -71,13 +71,25 @@ pub struct Config {
     #[clap(long, default_value_t)]
     pub enable_pqc: bool,
 
-    /// Enable IO-uring interface for Tunnel
-    #[clap(long, default_value_t)]
-    pub enable_tun_iouring: bool,
-
-    /// IO-uring submission queue count. Only applicable when
-    /// `enable_tun_iouring` is `true`
-    // Any value more than 1024 negatively impact the throughput
+    /// Total IO-uring submission queue count.
+    ///
+    /// Must be larger than the total of:
+    ///
+    /// UDP:
+    ///
+    ///   iouring_tun_rx_count + iouring_udp_rx_count +
+    ///   iouring_tx_count + 1 (cancellation request)
+    ///
+    /// TCP:
+    ///
+    ///   iouring_tun_rx_count + iouring_tx_count + 1 (cancellation
+    ///   request) + 2 * maximum number of connections.
+    ///
+    ///   Each connection actually uses up to 3 slots, a persistent
+    ///   recv request and on demand slots for TX and cancellation
+    ///   (teardown).
+    ///
+    /// There is no downside to setting this much larger.
     #[clap(long, default_value_t = 1024)]
     pub iouring_entry_count: usize,
 
@@ -87,6 +99,35 @@ pub struct Config {
     #[clap(long, default_value = "100ms")]
     pub iouring_sqpoll_idle_time: Duration,
 
+    /// Number of concurrent TUN device read requests to issue to
+    /// IO-uring. Setting this too large may negatively impact
+    /// performance.
+    #[clap(long, default_value_t = 64)]
+    pub iouring_tun_rx_count: u32,
+
+    /// Configure TUN device in blocking mode. This can allow
+    /// equivalent performance with fewer `ìouring-tun-rx-count`
+    /// entries but can significantly harm performance on some kernels
+    /// where the kernel does not indicate that the tun device handles
+    /// `FMODE_NOWAIT`.
+    ///
+    /// If blocking mode is enabled then `iouring_tun_rx_count` may be
+    /// set much lower.
+    ///
+    /// This was fixed by <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee>
+    #[clap(long, default_value_t = false)]
+    pub iouring_tun_blocking: bool,
+
+    /// Number of concurrent UDP socket recvmsg requests to issue to
+    /// IO-uring.
+    #[clap(long, default_value_t = 32)]
+    pub iouring_udp_rx_count: u32,
+
+    /// Maximum number of concurrent UDP + TUN sendmsg/write requests
+    /// to issue to IO-uring.
+    #[clap(long, default_value_t = 512)]
+    pub iouring_tx_count: u32,
+
     /// Log format
     #[clap(long, value_enum, default_value_t = LogFormat::Full)]
     pub log_format: LogFormat,