Added Gen3 load balancing

PlatformLab · Nov 10, 2023 · bdd7633 · bdd7633
1 parent 767640e
commit bdd7633
Show file tree

Hide file tree

Showing 11 changed files with 638 additions and 101 deletions.
diff --git a/balance.txt b/balance.txt
@@ -0,0 +1,114 @@
+This file discusses the issue of load-balancing in Homa.
+
+In order to keep up with fast networks, transport protocols must distribute
+their processing across multiple cores. For outgoing packets this happens
+naturally: sending threads run on different cores and packet processing
+for outbound packets happens on the same core is the sending thread. Things
+are more difficult for incoming packets. In general, an incoming packet
+will pass through 3 cores:
+* NAPI/GRO: the NIC distributes incoming packets across cores using RSS.
+  The number of incoming channels, and their association with cores, can
+  be configured in software. The NIC will then distribute packets across
+  those channels using a hash based on packet header fields. The device
+  driver receives packets as part of NAPI, then packets are collected into
+  batches using GRO and handed off to SoftIRQ.
+* SoftIRQ processing occurs on a (potentially) different core from NAPI/GRO;
+  the network stack runs here, including Homa's main handlers for incoming
+  packets. The system default is to compute another hash function on packet
+  headers to select a SoftIRQ or for a batch, but it is possible for GRO
+  to make its own choice of core, and Homa does this.
+* Once a complete message is received, it is handed off to an application
+  thread, which typically runs on a different core.
+
+The load balancing challenge is to distribute load across multiple cores
+without overloading any individual core ("hotspots"). This has proven
+quite difficult, and hotspots are the primary source of tail latency in Homa.
+The most common cause of hotspots is when 2 or more of the above tasks
+are assigned to the same core. For example:
+* Two batches from different NAPI/GRO cores might get assigned to the same
+  SoftIRQ core.
+* A particular core might be very busy handling NAPI/GRO for a stream of
+  packets in a large message; this will prevent application threads from
+  making progress on that core. A short message might pass through other
+  cores for NAPI/GRO and SoftIRQ, but if its application is running on
+  the busy core, then it will not able to process the short message.
+
+Part of the problem is that core assignments are made independently by
+3 different schedulers (RSS for the NAPI/GRO core, GRO or the system for
+the SoftIRQ core, and the Linux scheduler for the application core),
+so conflicts are likely to occur. Only one of these schedulers is under
+control of the transport protocol.
+
+It's also important to note that using more cores isn't always the best
+approach. For example, if a node is lightly loaded, it would be best to
+do all RX processing on a single core: using multiple cores causes extra
+cache misses as data migrates from core to core, and it also adds latency
+to pass control between cores. In an ideal world, the number of cores used for
+protocol processing would be just enough to keep any of them from getting
+overloaded. However, it appears to be hard to vary the number of cores
+without risking overloads; except in a few special cases, Homa doesn't do
+this.
+
+Homa tries to use its control over SoftIRQ scheduling to minimize hotspots.
+Several different approaches have been tried over time; this document
+focuses on the two most recent ones, which are called "Gen2" and "Gen3".
+
+Gen2 Load Balancing
+-------------------
+* Gen2 assumes that NAPI/GRO processing is occurring on all cores.
+* When GRO chooses where to assign a batch of packets for SoftIRQ, it
+  considers the next several cores (in ascending circular core order
+  after the GRO core).
+* GRO uses several criteria to try to find a "good" core for SoftIRQ, such
+  as avoiding a core that has done recent GRO processing, or one for which
+  there is already pending SoftIRQ work.
+* Selection stops as soon as it finds a "good" core.
+* If no "good" core is found, then GRO will rotate among the successor
+  cores on a batch-by-batch basis.
+* In some cases, Gen2 will bypass the SoftIRQ handoff mechanism and simply
+  run SoftIRQ immediately on its core. This is done in two cases: short
+  packets and grant packets. Bypass is particularly useful for grants
+  because it eliminates the latency associated with a handoff, and grant
+  turnaround time is important for overall performance.
+
+Gen2 has several problems:
+* It doesn't do anything about the problem of application threads conflicting
+  with NAPI/GRO or SoftIRQ.
+* A single core may be assigned both SoftIRQ and NAPI/GRO work at the
+  same time.
+* The SoftIRQ core groups for different NAPI/GRO cores overlap, so it's
+  possible for multiple GROs to schedule batches to the same SoftIRQ core.
+* When receiving packets from a large message, Gen2 tends to alternate between
+  2 or more SoftIRQ cores, which results in unnecessary cache coherency
+  traffic.
+* If the NAPI/GRO core is overloaded, bypass can make things worse (especially
+  since grant processing results in transmitting additional packets, which
+  is fairly expensive).
+
+Gen3 Load Balancing
+-------------------
+The Gen3 load-balancing mechanism is an attempt to solve the problems
+associated with Gen2.
+* The number of channels is reduced, so that only 1/4 of the cores do
+  NAPI/GRO processing. This appears to be sufficient capacity to avoid
+  overloads on any of the NAPI/GRO cores.
+* Each NAPI/GRO core has 3 other cores (statically assigned) that it can use
+  for SoftIRQ processing. The SoftIRQ core groups for different NAPI/GRO
+  cores do not overlap. This means that SoftIRQ and GRO will never happen
+  simultaneously on the same core, and there will be no conflicts between
+  the SoftIRQ groups of different NAPI/GRO cores.
+* Gen3 takes steps to avoid core conflicts between application threads and
+  NAPI/GRO and SoftIRQ processing, as described below.
+* When an application thread is using Homa actively on a core, the core
+  is marked as "busy". When GRO selects a SoftIRQ core, it attempts to
+  avoid cores that are busy with application threads. If there is a choice
+  of un-busy cores, GRO will try to reuse a single SoftIRQ over and over.
+* Homa also keeps track of recent NAPI/GRO and SoftIRQ processing on each
+  core. When an incoming message becomes ready and there are multiple threads
+  waiting for messages, Homa tries to pick a thread whose core has not had
+  recent Homa activity.
+* Between these two mechanisms, the hope is that SoftIRQ and application
+  work will adjust their core assignments to avoid conflicts.
+
+Gen3 was implemented in November of 2023; so far its performance appears to be
+about the same as Gen2. We
diff --git a/homa_impl.h b/homa_impl.h
@@ -716,6 +716,12 @@ struct homa_interest {
 	 */
 	int locked;
 
+	/**
+	 * @core: Core on which @thread was executing when it registered
+	 * its interest.  Used for load balancing (see balance.txt).
+	 */
+	int core;
+
 	/**
 	 * @reg_rpc: RPC whose @interest field points here, or
 	 * NULL if none.
@@ -746,6 +752,7 @@ static void inline homa_interest_init(struct homa_interest *interest)
 	interest->thread = current;
 	atomic_long_set(&interest->ready_rpc, 0);
 	interest->locked = 0;
+	interest->core = raw_smp_processor_id();
 	interest->reg_rpc = NULL;
 	interest->request_links.next = LIST_POISON1;
 	interest->response_links.next = LIST_POISON1;
@@ -1950,10 +1957,12 @@ struct homa {
 	 *                            order for SoftIRQ (deprecated).
 	 * HOMA_GRO_GEN2              Use the new mechanism for selecting an
 	 *                            idle core for SoftIRQ.
-	 * HOMA_GRO_FAST_GRANTS       Pass all grant I can see immediately to
+	 * HOMA_GRO_FAST_GRANTS       Pass all grants immediately to
 	 *                            homa_softirq during GRO.
 	 * HOMA_GRO_SHORT_BYPASS      Pass all short packets directly to
-	 *                            homa_softirq during GR).
+	 *                            homa_softirq during GRO.
+	 * HOMA_GRO_GEN3              Use the "Gen3" mechanisms for load
+	 *                            balancing.
 	 */
 	#define HOMA_GRO_BYPASS          1
 	#define HOMA_GRO_SAME_CORE       2
@@ -1962,14 +1971,15 @@ struct homa {
 	#define HOMA_GRO_GEN2           16
 	#define HOMA_GRO_FAST_GRANTS    32
 	#define HOMA_GRO_SHORT_BYPASS   64
+	#define HOMA_GRO_GEN3          128
 	#define HOMA_GRO_NORMAL      (HOMA_GRO_SAME_CORE|HOMA_GRO_GEN2 \
 			|HOMA_GRO_SHORT_BYPASS)
 
 	/*
 	 * @busy_usecs: if there has been activity on a core within the
 	 * last @busy_usecs, it is considered to be busy and Homa will
-	 * try to avoid scheduling other activities on the core. Set
-	 * externally via sysctl.
+	 * try to avoid scheduling other activities on the core. See
+	 * balance.txt for more on load balancing. Set externally via sysctl.
 	 */
 	int busy_usecs;
 
@@ -2157,6 +2167,19 @@ struct homa_metrics {
 	 */
 	__u64 slow_wakeups;
 
+	/**
+	 * @handoffs_thread_waiting: total number of times that an RPC
+	 * was handed off to a waiting thread (vs. being queued).
+	 */
+	__u64 handoffs_thread_waiting;
+
+	/**
+	 * @handoffs_alt_thread: total number of times that a thread other
+	 * than the first on the list was chosen for a handoff (because the
+	 * first thread was on a busy core).
+	 */
+	__u64 handoffs_alt_thread;
+
 	/**
 	 * @poll_cycles: total time spent in the polling loop in
 	 * homa_wait_for_message, as measured with get_cycles().
@@ -2593,6 +2616,19 @@ struct homa_metrics {
 	 */
 	__u64 dropped_data_no_bufs;
 
+	/**
+	 * @gen3_handoffs: total number of handoffs from GRO to SoftIRQ made
+	 * by Gen3 load balancer.
+	 */
+	__u64 gen3_handoffs;
+
+	/**
+	 * @gen3_alt_handoffs: total number of GRO->SoftIRQ handoffs that
+	 * didn't choose the primary SoftIRQ core because it was busy with
+	 * app threads.
+	 */
+	__u64 gen3_alt_handoffs;
+
 	/** @temp: For temporary use during testing. */
 #define NUM_TEMP_METRICS 10
 	__u64 temp[NUM_TEMP_METRICS];
@@ -2607,8 +2643,7 @@ struct homa_core {
 	/**
 	 * @last_active: the last time (in get_cycle() units) that
 	 * there was system activity, such NAPI or SoftIRQ, on this
-	 * core. Used to pick a less-busy core for assigning SoftIRQ
-	 * handlers.
+	 * core. Used for load balancing.
 	 */
 	__u64 last_active;
 
@@ -2634,6 +2669,23 @@ struct homa_core {
 	 */
 	int softirq_offset;
 
+	/**
+	 * @gen3_softirq_cores: when the Gen3 load balancer is in use,
+	 * GRO will arrange for SoftIRQ processing to occur on one of
+	 * these cores; -1 values are ignored (see balance.txt for more
+	 * on lewd balancing). This information is filled in via sysctl.
+	 */
+#define NUM_GEN3_SOFTIRQ_CORES 3
+	int gen3_softirq_cores[NUM_GEN3_SOFTIRQ_CORES];
+
+	/**
+	 * @last_app_active: the most recent time (get_cycles() units)
+	 * when an application was actively using Homa on this core (e.g.,
+	 * by sending or receiving messages). Used for load balancing
+	 * (see balance.txt).
+	 */
+	__u64 last_app_active;
+
         /**
          * held_skb: last packet buffer known to be available for
          * merging other packets into on this core (note: may not still
@@ -3058,6 +3110,9 @@ extern int      homa_check_nic_queue(struct homa *homa, struct sk_buff *skb,
                     bool force);
 extern struct homa_rpc
 	       *homa_choose_fifo_grant(struct homa *homa);
+extern struct homa_interest
+               *homa_choose_interest(struct homa *homa, struct list_head *head,
+	            int offset);
 extern int      homa_choose_rpcs_to_grant(struct homa *homa,
 		    struct homa_rpc **rpcs, int max_rpcs);
 extern void     homa_close(struct sock *sock, long timeout);
@@ -3095,6 +3150,8 @@ extern int      homa_getsockopt(struct sock *sk, int level, int optname,
                     char __user *optval, int __user *option);
 extern void     homa_grant_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
 extern int      homa_gro_complete(struct sk_buff *skb, int thoff);
+extern void     homa_gro_gen2(struct sk_buff *skb);
+extern void     homa_gro_gen3(struct sk_buff *skb);
 extern struct sk_buff
                *homa_gro_receive(struct list_head *gro_list,
                     struct sk_buff *skb);
@@ -3229,6 +3286,8 @@ extern int      homa_softirq(struct sk_buff *skb);
 extern void     homa_spin(int usecs);
 extern char    *homa_symbol_for_state(struct homa_rpc *rpc);
 extern char    *homa_symbol_for_type(uint8_t type);
+extern int      homa_sysctl_softirq_cores(struct ctl_table *table, int write,
+                    void __user *buffer, size_t *lenp, loff_t *ppos);
 extern void     homa_timer(struct homa *homa);
 extern int      homa_timer_main(void *transportInfo);
 extern void     homa_unhash(struct sock *sk);

diff --git a/homa_incoming.c b/homa_incoming.c
@@ -1694,6 +1694,7 @@ struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
 		INC_METRIC(poll_cycles, now - poll_start);
 
 		/* Now it's time to sleep. */
+		homa_cores[interest.core]->last_app_active = now;
 		set_current_state(TASK_INTERRUPTIBLE);
 		rpc = (struct homa_rpc *) atomic_long_read(&interest.ready_rpc);
 		if (!rpc && !hsk->shutdown) {
@@ -1779,7 +1780,43 @@ struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
 }
 
 /**
- * @homa_rpc_handoff: This function is called when the input message for
+ * @homa_choose_interest() - Given a list of interests for an incoming
+ * message, choose the best one to handle it (if any).
+ * @homa:        Overall information about the Homa transport.
+ * @head:        Head pointers for the list of interest: either
+ *		 hsk->request_interests or hsk->response_interests.
+ * @offset:      Offset of "next" pointers in the list elements (either
+ *               offsetof(request_links) or offsetof(response_links).
+ * Return:       An interest to use for the incoming message, or NULL if none
+ *               is available. If possible, this function tries to pick an
+ *               interest whose thread is running on a core that isn't
+ *               currently busy doing Homa transport work.
+ */
+struct homa_interest *homa_choose_interest(struct homa *homa,
+		struct list_head *head, int offset)
+{
+	struct homa_interest *backup = NULL;
+	struct list_head *pos;
+	struct homa_interest *interest;
+	__u64 busy_time = get_cycles() - homa->busy_cycles;
+
+	list_for_each(pos, head) {
+		interest = (struct homa_interest *) (((char *) pos) - offset);
+		if (homa_cores[interest->core]->last_active < busy_time) {
+			if (backup != NULL)
+				INC_METRIC(handoffs_alt_thread, 1);
+			return interest;
+		}
+		if (backup == NULL)
+			backup = interest;
+	}
+
+	/* All interested threads are on busy cores; return the first. */
+	return backup;
+}
+
+/**
+ * @homa_rpc_handoff() - This function is called when the input message for
  * an RPC is ready for attention from a user thread. It either notifies
  * a waiting reader or queues the RPC.
  * @rpc:                RPC to handoff; must be locked. The caller must
@@ -1803,17 +1840,17 @@ void homa_rpc_handoff(struct homa_rpc *rpc)
 
 	/* Second, check the interest list for this type of RPC. */
 	if (homa_is_client(rpc->id)) {
-		interest = list_first_entry_or_null(
+		interest = homa_choose_interest(hsk->homa,
 				&hsk->response_interests,
-				struct homa_interest, response_links);
+				offsetof(struct homa_interest, response_links));
 		if (interest)
 			goto thread_waiting;
 		list_add_tail(&rpc->ready_links, &hsk->ready_responses);
 		INC_METRIC(responses_queued, 1);
 	} else {
-		interest = list_first_entry_or_null(
+		interest = homa_choose_interest(hsk->homa,
 				&hsk->request_interests,
-				struct homa_interest, request_links);
+				offsetof(struct homa_interest, request_links));
 		if (interest)
 			goto thread_waiting;
 		list_add_tail(&rpc->ready_links, &hsk->ready_requests);
@@ -1838,8 +1875,17 @@ void homa_rpc_handoff(struct homa_rpc *rpc)
 	 */
 	atomic_or(RPC_HANDING_OFF, &rpc->flags);
 	interest->locked = 0;
+	INC_METRIC(handoffs_thread_waiting, 1);
+	tt_record3("homa_rpc_handoff handing off id %d to pid %d on core %d",
+			rpc->id, interest->thread->pid,
+			task_cpu(interest->thread));
 	atomic_long_set_release(&interest->ready_rpc, (long) rpc);
 
+	/* Update the last_app_active time for the thread's core, so Homa
+	 * will try to avoid doing any work there.
+	 */
+	homa_cores[interest->core]->last_app_active = get_cycles();
+
 	/* Clear the interest. This serves two purposes. First, it saves
 	 * the waking thread from acquiring the socket lock again, which
 	 * reduces contention on that lock). Second, it ensures that
@@ -1854,9 +1900,6 @@ void homa_rpc_handoff(struct homa_rpc *rpc)
 	if (interest->response_links.next != LIST_POISON1)
 		list_del(&interest->response_links);
 	wake_up_process(interest->thread);
-	tt_record3("homa_rpc_handoff handed off id %d to pid %d on core %d",
-			rpc->id, interest->thread->pid,
-			task_cpu(interest->thread));
 }
 
 /**