ggml : add optional CPU backend context, support reusing threads, async compute #721

slaren · 2024-02-01T15:50:23Z

As recently seen in llama.cpp (ggerganov/llama.cpp#5226), the cost of starting the threads of the CPU backend is not insignificant. To address this, I propose adding a new CPU context object that holds the threads and can reuse them between invocations. Additionally, this CPU context would behave as an asynchronous queue, so that multiple graph evaluations could be queued into the object. This would enable the implementation of pipeline parallelism with the CPU and GPU backends (ref: ggerganov/llama.cpp#4918 (comment)).

Possible API:

ggml_compute_context_t ggml_compute_context_init(int n_threads);
void ggml_graph_compute_async(ggml_compute_context_t context, struct ggml_cgraph * graph);
void ggml_synchronize(ggml_compute_context_t context);

ggerganov · 2024-02-01T16:29:47Z

Would the threads wait on a condition variable while not running? I've done some testing in the past to maintain a global pool of threads and wake them when there is work (ggerganov/whisper.cpp#343). This didn't seem to help the performance much, but it's possible that the implementation was not done ideally

Regardless if there is performance gain or not, the rest of the functionality that would be enabled is worth it alone

slaren · 2024-02-01T16:38:10Z

Yes, the threads would wait on a condition variable or something to the same effect. In linux and possibly macOS, the overhead of creating a thread and waking a blocked thread is probably close enough that for large graphs it wouldn't make much difference, but for very small graphs like the ones used often by ggml_backend_sched it may be significant.

slaren · 2024-06-19T21:18:56Z

I don't think it is worth exploring the idea of pipeline parallelism between the CPU and GPU anymore because the current implementation of partial offloading with ggml_backend_sched doesn't use the CPU at all for large batches, and this is likely always going to be faster than running anything on the CPU. It is no longer necessary to implementing async computation support in the CPU backend, since this was the main motivation.

Thread pools have been implemented using OpenMP, and it significantly reduces the overhead under linux and windows. On macOS, OpenMP is not supported by default. I have tried a few different thread pool implementations, including using futexes with undocumented APIs, and they were all significantly slower than starting new threads on every evaluation. It is possible to install the OpenMP library with brew and force the Apple clang compiler to use it, but that is also significantly slower than creating new threads. Implementing an efficient thread pool for macOS does not seem to be trivial, and the thread launch overhead seems to be low enough that it is probably not necessary.

Therefore I think that this issue can be considered completed.

ggerganov added the enhancement New feature or request label Feb 1, 2024

ggerganov added this to ggml : roadmap Feb 1, 2024

ggerganov moved this to Todo in ggml : roadmap Feb 1, 2024

slaren closed this as completed Jun 19, 2024

ggerganov moved this from Todo to Done in ggml : roadmap Jun 20, 2024

slaren mentioned this issue Aug 27, 2024

Threadpool: take 2 ggerganov/llama.cpp#8672

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add optional CPU backend context, support reusing threads, async compute #721

ggml : add optional CPU backend context, support reusing threads, async compute #721

slaren commented Feb 1, 2024 •

edited

Loading

ggerganov commented Feb 1, 2024

slaren commented Feb 1, 2024

slaren commented Jun 19, 2024

ggml : add optional CPU backend context, support reusing threads, async compute #721

ggml : add optional CPU backend context, support reusing threads, async compute #721

Comments

slaren commented Feb 1, 2024 • edited Loading

ggerganov commented Feb 1, 2024

slaren commented Feb 1, 2024

slaren commented Jun 19, 2024

slaren commented Feb 1, 2024 •

edited

Loading