Skip to content

Commit

Permalink
Added Gen3 load balancing
Browse files Browse the repository at this point in the history
  • Loading branch information
johnousterhout committed Nov 10, 2023
1 parent 767640e commit bdd7633
Show file tree
Hide file tree
Showing 11 changed files with 638 additions and 101 deletions.
114 changes: 114 additions & 0 deletions balance.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
This file discusses the issue of load-balancing in Homa.

In order to keep up with fast networks, transport protocols must distribute
their processing across multiple cores. For outgoing packets this happens
naturally: sending threads run on different cores and packet processing
for outbound packets happens on the same core is the sending thread. Things
are more difficult for incoming packets. In general, an incoming packet
will pass through 3 cores:
* NAPI/GRO: the NIC distributes incoming packets across cores using RSS.
The number of incoming channels, and their association with cores, can
be configured in software. The NIC will then distribute packets across
those channels using a hash based on packet header fields. The device
driver receives packets as part of NAPI, then packets are collected into
batches using GRO and handed off to SoftIRQ.
* SoftIRQ processing occurs on a (potentially) different core from NAPI/GRO;
the network stack runs here, including Homa's main handlers for incoming
packets. The system default is to compute another hash function on packet
headers to select a SoftIRQ or for a batch, but it is possible for GRO
to make its own choice of core, and Homa does this.
* Once a complete message is received, it is handed off to an application
thread, which typically runs on a different core.

The load balancing challenge is to distribute load across multiple cores
without overloading any individual core ("hotspots"). This has proven
quite difficult, and hotspots are the primary source of tail latency in Homa.
The most common cause of hotspots is when 2 or more of the above tasks
are assigned to the same core. For example:
* Two batches from different NAPI/GRO cores might get assigned to the same
SoftIRQ core.
* A particular core might be very busy handling NAPI/GRO for a stream of
packets in a large message; this will prevent application threads from
making progress on that core. A short message might pass through other
cores for NAPI/GRO and SoftIRQ, but if its application is running on
the busy core, then it will not able to process the short message.

Part of the problem is that core assignments are made independently by
3 different schedulers (RSS for the NAPI/GRO core, GRO or the system for
the SoftIRQ core, and the Linux scheduler for the application core),
so conflicts are likely to occur. Only one of these schedulers is under
control of the transport protocol.

It's also important to note that using more cores isn't always the best
approach. For example, if a node is lightly loaded, it would be best to
do all RX processing on a single core: using multiple cores causes extra
cache misses as data migrates from core to core, and it also adds latency
to pass control between cores. In an ideal world, the number of cores used for
protocol processing would be just enough to keep any of them from getting
overloaded. However, it appears to be hard to vary the number of cores
without risking overloads; except in a few special cases, Homa doesn't do
this.

Homa tries to use its control over SoftIRQ scheduling to minimize hotspots.
Several different approaches have been tried over time; this document
focuses on the two most recent ones, which are called "Gen2" and "Gen3".

Gen2 Load Balancing
-------------------
* Gen2 assumes that NAPI/GRO processing is occurring on all cores.
* When GRO chooses where to assign a batch of packets for SoftIRQ, it
considers the next several cores (in ascending circular core order
after the GRO core).
* GRO uses several criteria to try to find a "good" core for SoftIRQ, such
as avoiding a core that has done recent GRO processing, or one for which
there is already pending SoftIRQ work.
* Selection stops as soon as it finds a "good" core.
* If no "good" core is found, then GRO will rotate among the successor
cores on a batch-by-batch basis.
* In some cases, Gen2 will bypass the SoftIRQ handoff mechanism and simply
run SoftIRQ immediately on its core. This is done in two cases: short
packets and grant packets. Bypass is particularly useful for grants
because it eliminates the latency associated with a handoff, and grant
turnaround time is important for overall performance.

Gen2 has several problems:
* It doesn't do anything about the problem of application threads conflicting
with NAPI/GRO or SoftIRQ.
* A single core may be assigned both SoftIRQ and NAPI/GRO work at the
same time.
* The SoftIRQ core groups for different NAPI/GRO cores overlap, so it's
possible for multiple GROs to schedule batches to the same SoftIRQ core.
* When receiving packets from a large message, Gen2 tends to alternate between
2 or more SoftIRQ cores, which results in unnecessary cache coherency
traffic.
* If the NAPI/GRO core is overloaded, bypass can make things worse (especially
since grant processing results in transmitting additional packets, which
is fairly expensive).

Gen3 Load Balancing
-------------------
The Gen3 load-balancing mechanism is an attempt to solve the problems
associated with Gen2.
* The number of channels is reduced, so that only 1/4 of the cores do
NAPI/GRO processing. This appears to be sufficient capacity to avoid
overloads on any of the NAPI/GRO cores.
* Each NAPI/GRO core has 3 other cores (statically assigned) that it can use
for SoftIRQ processing. The SoftIRQ core groups for different NAPI/GRO
cores do not overlap. This means that SoftIRQ and GRO will never happen
simultaneously on the same core, and there will be no conflicts between
the SoftIRQ groups of different NAPI/GRO cores.
* Gen3 takes steps to avoid core conflicts between application threads and
NAPI/GRO and SoftIRQ processing, as described below.
* When an application thread is using Homa actively on a core, the core
is marked as "busy". When GRO selects a SoftIRQ core, it attempts to
avoid cores that are busy with application threads. If there is a choice
of un-busy cores, GRO will try to reuse a single SoftIRQ over and over.
* Homa also keeps track of recent NAPI/GRO and SoftIRQ processing on each
core. When an incoming message becomes ready and there are multiple threads
waiting for messages, Homa tries to pick a thread whose core has not had
recent Homa activity.
* Between these two mechanisms, the hope is that SoftIRQ and application
work will adjust their core assignments to avoid conflicts.

Gen3 was implemented in November of 2023; so far its performance appears to be
about the same as Gen2. We
71 changes: 65 additions & 6 deletions homa_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -716,6 +716,12 @@ struct homa_interest {
*/
int locked;

/**
* @core: Core on which @thread was executing when it registered
* its interest. Used for load balancing (see balance.txt).
*/
int core;

/**
* @reg_rpc: RPC whose @interest field points here, or
* NULL if none.
Expand Down Expand Up @@ -746,6 +752,7 @@ static void inline homa_interest_init(struct homa_interest *interest)
interest->thread = current;
atomic_long_set(&interest->ready_rpc, 0);
interest->locked = 0;
interest->core = raw_smp_processor_id();
interest->reg_rpc = NULL;
interest->request_links.next = LIST_POISON1;
interest->response_links.next = LIST_POISON1;
Expand Down Expand Up @@ -1950,10 +1957,12 @@ struct homa {
* order for SoftIRQ (deprecated).
* HOMA_GRO_GEN2 Use the new mechanism for selecting an
* idle core for SoftIRQ.
* HOMA_GRO_FAST_GRANTS Pass all grant I can see immediately to
* HOMA_GRO_FAST_GRANTS Pass all grants immediately to
* homa_softirq during GRO.
* HOMA_GRO_SHORT_BYPASS Pass all short packets directly to
* homa_softirq during GR).
* homa_softirq during GRO.
* HOMA_GRO_GEN3 Use the "Gen3" mechanisms for load
* balancing.
*/
#define HOMA_GRO_BYPASS 1
#define HOMA_GRO_SAME_CORE 2
Expand All @@ -1962,14 +1971,15 @@ struct homa {
#define HOMA_GRO_GEN2 16
#define HOMA_GRO_FAST_GRANTS 32
#define HOMA_GRO_SHORT_BYPASS 64
#define HOMA_GRO_GEN3 128
#define HOMA_GRO_NORMAL (HOMA_GRO_SAME_CORE|HOMA_GRO_GEN2 \
|HOMA_GRO_SHORT_BYPASS)

/*
* @busy_usecs: if there has been activity on a core within the
* last @busy_usecs, it is considered to be busy and Homa will
* try to avoid scheduling other activities on the core. Set
* externally via sysctl.
* try to avoid scheduling other activities on the core. See
* balance.txt for more on load balancing. Set externally via sysctl.
*/
int busy_usecs;

Expand Down Expand Up @@ -2157,6 +2167,19 @@ struct homa_metrics {
*/
__u64 slow_wakeups;

/**
* @handoffs_thread_waiting: total number of times that an RPC
* was handed off to a waiting thread (vs. being queued).
*/
__u64 handoffs_thread_waiting;

/**
* @handoffs_alt_thread: total number of times that a thread other
* than the first on the list was chosen for a handoff (because the
* first thread was on a busy core).
*/
__u64 handoffs_alt_thread;

/**
* @poll_cycles: total time spent in the polling loop in
* homa_wait_for_message, as measured with get_cycles().
Expand Down Expand Up @@ -2593,6 +2616,19 @@ struct homa_metrics {
*/
__u64 dropped_data_no_bufs;

/**
* @gen3_handoffs: total number of handoffs from GRO to SoftIRQ made
* by Gen3 load balancer.
*/
__u64 gen3_handoffs;

/**
* @gen3_alt_handoffs: total number of GRO->SoftIRQ handoffs that
* didn't choose the primary SoftIRQ core because it was busy with
* app threads.
*/
__u64 gen3_alt_handoffs;

/** @temp: For temporary use during testing. */
#define NUM_TEMP_METRICS 10
__u64 temp[NUM_TEMP_METRICS];
Expand All @@ -2607,8 +2643,7 @@ struct homa_core {
/**
* @last_active: the last time (in get_cycle() units) that
* there was system activity, such NAPI or SoftIRQ, on this
* core. Used to pick a less-busy core for assigning SoftIRQ
* handlers.
* core. Used for load balancing.
*/
__u64 last_active;

Expand All @@ -2634,6 +2669,23 @@ struct homa_core {
*/
int softirq_offset;

/**
* @gen3_softirq_cores: when the Gen3 load balancer is in use,
* GRO will arrange for SoftIRQ processing to occur on one of
* these cores; -1 values are ignored (see balance.txt for more
* on lewd balancing). This information is filled in via sysctl.
*/
#define NUM_GEN3_SOFTIRQ_CORES 3
int gen3_softirq_cores[NUM_GEN3_SOFTIRQ_CORES];

/**
* @last_app_active: the most recent time (get_cycles() units)
* when an application was actively using Homa on this core (e.g.,
* by sending or receiving messages). Used for load balancing
* (see balance.txt).
*/
__u64 last_app_active;

/**
* held_skb: last packet buffer known to be available for
* merging other packets into on this core (note: may not still
Expand Down Expand Up @@ -3058,6 +3110,9 @@ extern int homa_check_nic_queue(struct homa *homa, struct sk_buff *skb,
bool force);
extern struct homa_rpc
*homa_choose_fifo_grant(struct homa *homa);
extern struct homa_interest
*homa_choose_interest(struct homa *homa, struct list_head *head,
int offset);
extern int homa_choose_rpcs_to_grant(struct homa *homa,
struct homa_rpc **rpcs, int max_rpcs);
extern void homa_close(struct sock *sock, long timeout);
Expand Down Expand Up @@ -3095,6 +3150,8 @@ extern int homa_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *option);
extern void homa_grant_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
extern int homa_gro_complete(struct sk_buff *skb, int thoff);
extern void homa_gro_gen2(struct sk_buff *skb);
extern void homa_gro_gen3(struct sk_buff *skb);
extern struct sk_buff
*homa_gro_receive(struct list_head *gro_list,
struct sk_buff *skb);
Expand Down Expand Up @@ -3229,6 +3286,8 @@ extern int homa_softirq(struct sk_buff *skb);
extern void homa_spin(int usecs);
extern char *homa_symbol_for_state(struct homa_rpc *rpc);
extern char *homa_symbol_for_type(uint8_t type);
extern int homa_sysctl_softirq_cores(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
extern void homa_timer(struct homa *homa);
extern int homa_timer_main(void *transportInfo);
extern void homa_unhash(struct sock *sk);
Expand Down
59 changes: 51 additions & 8 deletions homa_incoming.c
Original file line number Diff line number Diff line change
Expand Up @@ -1694,6 +1694,7 @@ struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
INC_METRIC(poll_cycles, now - poll_start);

/* Now it's time to sleep. */
homa_cores[interest.core]->last_app_active = now;
set_current_state(TASK_INTERRUPTIBLE);
rpc = (struct homa_rpc *) atomic_long_read(&interest.ready_rpc);
if (!rpc && !hsk->shutdown) {
Expand Down Expand Up @@ -1779,7 +1780,43 @@ struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
}

/**
* @homa_rpc_handoff: This function is called when the input message for
* @homa_choose_interest() - Given a list of interests for an incoming
* message, choose the best one to handle it (if any).
* @homa: Overall information about the Homa transport.
* @head: Head pointers for the list of interest: either
* hsk->request_interests or hsk->response_interests.
* @offset: Offset of "next" pointers in the list elements (either
* offsetof(request_links) or offsetof(response_links).
* Return: An interest to use for the incoming message, or NULL if none
* is available. If possible, this function tries to pick an
* interest whose thread is running on a core that isn't
* currently busy doing Homa transport work.
*/
struct homa_interest *homa_choose_interest(struct homa *homa,
struct list_head *head, int offset)
{
struct homa_interest *backup = NULL;
struct list_head *pos;
struct homa_interest *interest;
__u64 busy_time = get_cycles() - homa->busy_cycles;

list_for_each(pos, head) {
interest = (struct homa_interest *) (((char *) pos) - offset);
if (homa_cores[interest->core]->last_active < busy_time) {
if (backup != NULL)
INC_METRIC(handoffs_alt_thread, 1);
return interest;
}
if (backup == NULL)
backup = interest;
}

/* All interested threads are on busy cores; return the first. */
return backup;
}

/**
* @homa_rpc_handoff() - This function is called when the input message for
* an RPC is ready for attention from a user thread. It either notifies
* a waiting reader or queues the RPC.
* @rpc: RPC to handoff; must be locked. The caller must
Expand All @@ -1803,17 +1840,17 @@ void homa_rpc_handoff(struct homa_rpc *rpc)

/* Second, check the interest list for this type of RPC. */
if (homa_is_client(rpc->id)) {
interest = list_first_entry_or_null(
interest = homa_choose_interest(hsk->homa,
&hsk->response_interests,
struct homa_interest, response_links);
offsetof(struct homa_interest, response_links));
if (interest)
goto thread_waiting;
list_add_tail(&rpc->ready_links, &hsk->ready_responses);
INC_METRIC(responses_queued, 1);
} else {
interest = list_first_entry_or_null(
interest = homa_choose_interest(hsk->homa,
&hsk->request_interests,
struct homa_interest, request_links);
offsetof(struct homa_interest, request_links));
if (interest)
goto thread_waiting;
list_add_tail(&rpc->ready_links, &hsk->ready_requests);
Expand All @@ -1838,8 +1875,17 @@ void homa_rpc_handoff(struct homa_rpc *rpc)
*/
atomic_or(RPC_HANDING_OFF, &rpc->flags);
interest->locked = 0;
INC_METRIC(handoffs_thread_waiting, 1);
tt_record3("homa_rpc_handoff handing off id %d to pid %d on core %d",
rpc->id, interest->thread->pid,
task_cpu(interest->thread));
atomic_long_set_release(&interest->ready_rpc, (long) rpc);

/* Update the last_app_active time for the thread's core, so Homa
* will try to avoid doing any work there.
*/
homa_cores[interest->core]->last_app_active = get_cycles();

/* Clear the interest. This serves two purposes. First, it saves
* the waking thread from acquiring the socket lock again, which
* reduces contention on that lock). Second, it ensures that
Expand All @@ -1854,9 +1900,6 @@ void homa_rpc_handoff(struct homa_rpc *rpc)
if (interest->response_links.next != LIST_POISON1)
list_del(&interest->response_links);
wake_up_process(interest->thread);
tt_record3("homa_rpc_handoff handed off id %d to pid %d on core %d",
rpc->id, interest->thread->pid,
task_cpu(interest->thread));
}

/**
Expand Down
Loading

0 comments on commit bdd7633

Please sign in to comment.