Example work stealing chunked task allocator for issue #291 #2026

mqy · 2023-06-27T20:21:10Z

Background

Current master split node compute at per-thread level. @ggerganov wants threads compete on smaller chunks, but afraid of hurting cache or NUMA. I quickly implemented this PR for

demonstrating a explicit work-stealing task allocator.
investing how to make it cache friendly. I'm not expert at this domain, but this problem deserve a try I think.

Design

Firstly, I suppose a node runner that want to parallel it's computing must know how to split it's workload. This is true in master branch: given ith and nth, most of them are split into rows.

In this example, n_multiplier is used to define how many chunks to split for individual worker.
Every chunk is identified by an index into per worker task queue, continuous chunk id's are mapped to continuous rows (memories).

Chunk groups are assigned to workers, allow others steal work from the end.
Suppose we assigned two logical chunks groups for two workers:

1 2 3   <-- worker A's task queue
4 5 6  <-- worker B's task queue

Each worker takes chunk from head, stealer takes chunk from tail when it has finished it's tasks. This design at least guarantee chunk owner process it's data sequentially (if has not been stolen yet).

Example

void ggml_compute_forward_xxx (struct params params, struct ggml_tensor *node) {
    const int ith = params.ith;
    int chunk_idx;
    int n_chunks;

    while (true) {
        allocate_chunk(params.task_allocator, ith, &chunk_idx, &n_chunks);
        if (chunk_idx < 0 || n_chunks <= 0) {
            break;
        }

        const int nr =  ...
        const int dr = (nr + n_chunks - 1) / n_chunks;
        const int ir0 = dr * chunk_idx;
        const int ir1 = MIN(ir0 + dr, nr);
        // compute ...
    }
}

mqy · 2023-07-09T11:30:29Z

examples/task-allocator/task-allocator.c

+    *chunk_idx = -1;
+    *n_chunks = total_chunks;
+
+    while (atomic_fetch_add(&a->lock, 1) != 0) { // lock


This is a bad design, because:

individual thread MUST NOT be blocked on it's own queue.

should avoid spin lock in user space, think about a thread added the lock then be scheduled out.

atomic_flag may be cheaper than the combination of atomic_fetch_add and atomic_fetch_sub.
But, the contention is light so mutex is the best choice?

mqy added 2 commits June 28, 2023 04:18

work stealing chunked task allocator example for issue ggerganov#291

b1d402d

fix windows build error caused by mis-replacing text

fef9eac

mqy force-pushed the ggml_chunked_allocator branch from 438ff02 to fef9eac Compare June 27, 2023 20:38

mqy marked this pull request as draft June 27, 2023 20:40

mqy added 5 commits June 28, 2023 04:51

remove thread local variable: Windows does not recogonize it

767d1db

prevent deadlock; cleanup

a1306ce

stdbool.h

e300a91

stdint.h

4afb12f

corner case: when nth is 1, n_multiplier should be 1

76e5e27

mqy commented Jul 9, 2023

View reviewed changes

BrickBee mentioned this pull request Jul 11, 2023

Large sched_yield() and threading overhead (+25-40% perf boost) #2134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example work stealing chunked task allocator for issue #291 #2026

Example work stealing chunked task allocator for issue #291 #2026

mqy commented Jun 27, 2023 •

edited

Loading

mqy Jul 9, 2023

Example work stealing chunked task allocator for issue #291 #2026

Are you sure you want to change the base?

Example work stealing chunked task allocator for issue #291 #2026

Conversation

mqy commented Jun 27, 2023 • edited Loading

Background

Design

Example

mqy Jul 9, 2023

Choose a reason for hiding this comment

mqy commented Jun 27, 2023 •

edited

Loading