Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example work stealing chunked task allocator for issue #291 #2026

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mqy
Copy link
Contributor

@mqy mqy commented Jun 27, 2023

Background

#1507
ggerganov/ggml#291

Current master split node compute at per-thread level. @ggerganov wants threads compete on smaller chunks, but afraid of hurting cache or NUMA. I quickly implemented this PR for

  • demonstrating a explicit work-stealing task allocator.
  • investing how to make it cache friendly. I'm not expert at this domain, but this problem deserve a try I think.

Design

Firstly, I suppose a node runner that want to parallel it's computing must know how to split it's workload. This is true in master branch: given ith and nth, most of them are split into rows.

In this example, n_multiplier is used to define how many chunks to split for individual worker.
Every chunk is identified by an index into per worker task queue, continuous chunk id's are mapped to continuous rows (memories).

Chunk groups are assigned to workers, allow others steal work from the end.
Suppose we assigned two logical chunks groups for two workers:

1 2 3   <-- worker A's task queue
4 5 6  <-- worker B's task queue

Each worker takes chunk from head, stealer takes chunk from tail when it has finished it's tasks. This design at least guarantee chunk owner process it's data sequentially (if has not been stolen yet).

Example

void ggml_compute_forward_xxx (struct params params, struct ggml_tensor *node) {
    const int ith = params.ith;
    int chunk_idx;
    int n_chunks;

    while (true) {
        allocate_chunk(params.task_allocator, ith, &chunk_idx, &n_chunks);
        if (chunk_idx < 0 || n_chunks <= 0) {
            break;
        }

        const int nr =  ...
        const int dr = (nr + n_chunks - 1) / n_chunks;
        const int ir0 = dr * chunk_idx;
        const int ir1 = MIN(ir0 + dr, nr);
        // compute ...
    }
}

@mqy mqy force-pushed the ggml_chunked_allocator branch from 438ff02 to fef9eac Compare June 27, 2023 20:38
@mqy mqy marked this pull request as draft June 27, 2023 20:40
*chunk_idx = -1;
*n_chunks = total_chunks;

while (atomic_fetch_add(&a->lock, 1) != 0) { // lock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad design, because:

  1. individual thread MUST NOT be blocked on it's own queue.
  2. should avoid spin lock in user space, think about a thread added the lock then be scheduled out.
  3. atomic_flag may be cheaper than the combination of atomic_fetch_add and atomic_fetch_sub.
    But, the contention is light so mutex is the best choice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant