-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large sched_yield() and threading overhead (+25-40% perf boost) #2134
Comments
The initial implementation indeed was using only busy loops on atomic variables without I just redid the tests and I don't see any performance difference with or without it. I'm surprised that your time on Regarding your point about a thread pool - I did try to implement a thread pool some time ago: It didn't provide any improvement, but I'm open to revisit this idea today. |
I can confirm the Let me list some of the change log of ggml threading I know:
The actual behavior of |
Hm, I don't think there is a case where removing it can degrade performance. It could only potentially increase the energy consumption maybe |
Is there a reason not to use pthread condition variables here? |
Typical compute graph contains more than a thousand tensor nodes to compute, most of the per-node computing time is several us (even less than 1 us), mul_mat for large prompt may take more than 1 ms. Typical pthread cond wait or notify/broadcast take at least several us, at worst tens of us. |
I might have found such a case: There is no point in running so many threads for generation, since it's RAM-bound anyway. When running only 6 threads, there was no difference between yielding and not yielding. The text generation speed was also roughly 12% faster than in the previous test with 32 threads. The prompt processing was of course slower, independent of yielding, because it scales with the number of cores. The energy consumption appeared surprisingly similar between the two versions. I didn't measure it precisely, but a big difference would have stood out. Maybe others who have machines with a high number of cores can check if they also observe that effect on their systems? |
It highly depends on what OS you're running on. If that's Linux, then that's good feedback, if it's Windows, then sched_yield() just reduces back to Sleep(0); , so it doesn't mean much. |
In any case, pauses and sched_yield() ought to only be called after a failure, and never before the first attempt, as it's extremely expensive, due to being a syscall (count ~400 cycles). In other code, I'm used to use incremental operations in exponential backoff: loop without any pause a few thousand times, then with incremental count of PAUSE insns a few tens of thousands times, then only switch to sched_yield() as a last resort. I ran some test here a few months ago and won less than 1% with this, the contention being elsewhere, but eventually I should redo some tests. |
Ouch! the removal is catastrophic here (80-core ARM Ampere Altra / Neoverse N1). Before (commit ed9a54e with sched_yield()):
After (commit 1d16309 with sched_yield() commented out):
It's got 15 times slower! With the sched_yield() moved after the test:
The performance is back (and admittedly the ctx sw is huge). Let me check with pause instructions first. |
With 10 dummy loops before triggering sched_yield() I'm already getting this which is better than the original, and context-sw cut in half:
Now with a real pause instruction (
Increasing the sched_yield() trigger threshold to 50 iterations yields even more gains:
So basically we were at 4.52 tok/s with sched_yield(), 0.32 tok/s without, and 5.25 with it slightly arranged, hence 16% faster than the original and 16 times faster than the current master. The patch should be cut in two parts, one to define a pause() instruction and one to use it. I've only done x86_64 and aarch64 but armv7 could benefit from it as well. I cannot test on windows and I don't know if windows on arm is supported by the project, because the asm statement might differ there. First patch to define ggml_pause(): diff --git a/ggml.c b/ggml.c
index c10877a..fcc9a5e 100644
--- a/ggml.c
+++ b/ggml.c
@@ -15887,6 +15887,14 @@ typedef pthread_t ggml_thread_t;
#endif
+#if defined(__x86_64__) || (defined(_MSC_VER) && defined(_M_AMD64))
+#define ggml_pause() _mm_pause()
+#elif defined(__aarch_64__)
+#define ggml_pause() __asm__("dmb ish")
+#else
+#define ggml_pause() __asm__("")
+#endif
+ Second one to use it: diff --git a/ggml.c b/ggml.c
index c10877a..fcc9a5e 100644
--- a/ggml.c
+++ b/ggml.c
@@ -16045,10 +16053,19 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
} else {
// wait for other threads to finish
const int last = node_n;
- do {
- //sched_yield();
+ unsigned int retries = 0;
+ unsigned int loops;
+
+ while (1) {
node_n = atomic_load(&state->shared->node_n);
- } while (node_n == last);
+ if (node_n != last)
+ break;
+ retries++;
+ if (retries > 50)
+ sched_yield();
+ for (loops = 0; loops < (1U << ((retries-1) & 0x7)); loops++)
+ ggml_pause();
+ }
}
// check if we should stop |
marl use nop + sched_yield() . Since we are not aiming at energy savings and pause may slow down cpu frequency, I prefer nop. |
The problem doesn't have anything to do with energy savings but with the extreme hammering of a shared cache line that's each other thread tries to get exclusive access on, so nobody can work, each finishing thread has to wait for all other ones to accept to give back that cache line for an exclusive access just to perform the atomic_fetch_sub(). I tried already to split Also on arm the "dmb ish" (equivalent to pause) is much more efficient than a series of NOP because it only flushes the instruction queue (not the cache), so that effectively adds a hole in the pipeline during which a core doesn't hinder the other ones. On x86 it's very interesting because it relinquishes the execution ports for the sibling thread of the same core for a few tens to hundreds of cycles, which is great as well (better perf on the other one, no contention). I can try to have a look at this later. |
A simple demo.
|
Found In Rust, the spin hint of ARM is a bit complicated than x86-64: https://github.com/rust-lang/stdarch/blob/a4cde4a0ec9dcd05247eff5b2a23d695e46a6850/crates/core_arch/src/arm_shared/hints.rs#L62 |
I'm sorry I don't understand what you're trying to illustrate. This task-allocator.c file is not the one we're working on. And this rust thing has nothing of interest. Regarding "dmb ish" while "dmb" is a data memory barrier, here's it's used in conjunction with the instruction cache and has the effect of only flushing the queue of decoded instruction (think about self-modifying code, it's more or less equivalent to the old jmp $+2 we used to do on x86 several decades ago, except that the equivalent |
Interesting, it seems that different systems require different code to achieve the best token generation performance, at least with the current threading approach. The diff with the retries seems to be most beneficial when all cores (can) run at the same speed. On the AMD 7950X3D the cores always have different frequencies by design. The data is a bit noisy, so I made 20 runs per variant and sorted the data.
But there is more: Running with just 6 threads is faster than running with 32. @wtarreau maybe you can try to reproduce this effect on your system? Repeat your test with just 10 threads and a short prompt to save time when testing the token generation, like "Write a poem consisting of three paragraphs". |
On some machines (e.g. a 24-core AMD 74F3) I already found that halving the threads yielded better results by spending way less time waiting on the threads, but that was about 2 months ago before the changes to threading, the perf top outputs were horrible. I have not retried on that machine since, as I got much better performance on a 8-core xeon at higher frequency (that's classic with thread contention). I wanted to invest more time on this when I have some available but for now I can't. Just a point. In your case if the time increases with successive runs, it certainly means your CPU has exhausted the max allocated time at the maximum frequency bins to stay under the power envelope, and is progressively shrinking the frequency. This does also have an effect, as with less cores running you'll stay longer (even forever) at the max frequency. That's especially true when AVX2 is being used. I've rerun a few tests for you (good idea for the poem as a short prompt BTW, I'm doing that with a fixed seed). with a slight improvement of the retries above (in fact I stole the EBO code from my progressive locks and tailored it for this test), and run it at different thread counts:
By comparison the previous "optimal" code (the one hammering on sched_yield) at commit ed9a54e showed 156ms at 30 threads. Please note that this machine has "only" 6 memory channels and that it will still quickly represent a bottleneck. Also, in an ideal case we should probably mark these memory areas non-cacheable so that reading from there does not cause cache any eviction. Finally, another point I'd like to study as well is the amount of time a thread spends spinning, waiting for the last thread to finish its work in ggml_compute_thread(). Couldn't we try to pull in more work to make the ready thread restart working instead of all being busy waiting ? Maybe it's not easy (very likely) but if we'd manage to alternate between two sets of data to compute with, it would be possible to have some threads working on one set while the other ones attack the new set, and only when the first set is done, the first read thread prepares these data again. This way there would be almost no time lost waiting for other threads. But I don't know the data model nor memory access patterns, so I could be wrong of course. |
I just realized I forgot to join the updated patch. It might not necessarily be optimal on x86 (not tried there), though the original one it's derived from was optimized on a large x86 system, so YMMV. diff --git a/ggml.c b/ggml.c
index c10877a..537fa6b 100644
--- a/ggml.c
+++ b/ggml.c
@@ -16045,10 +16053,28 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
} else {
// wait for other threads to finish
const int last = node_n;
- do {
- //sched_yield();
+ unsigned int wait_time = 1;
+
+ while (1) {
+ unsigned int loops = wait_time;
+
node_n = atomic_load(&state->shared->node_n);
- } while (node_n == last);
+ if (node_n != last)
+ break;
+
+ if (loops >= 8192) {
+ for (; loops >= 65536; loops -= 32768)
+ sched_yield();
+
+ for (; loops >= 8192; loops -= 256)
+ ggml_pause();
+ }
+
+ for (; loops >= 100; loops -= 100)
+ __asm__("");
+
+ wait_time = ((wait_time + (wait_time >> 1)) + 2) & 0x3ffff;
+ }
}
// check if we should stop
|
The updated patch performs worse on my system than the simple "Sleep after 50" without _mm_pause. When running with 32 threads it's just a tiny bit slower than the previous variants with and without _mm_pause.
#2026 does that. I think it came up due to the E-Cores where the effect would be the most noticeable.
Yes, that's what happens during prompt processing. During token generation my CPU stays way below the thermal and power limits. It seems my "sorted the data" remark was not clear enough. The original data of the interleaved runs with pauses between runs and batches was noisy with no visible pattern. So, I sorted the measurements for the graph to provide a better overview, instead of squeezing all 20 measurements into a single candlestick per variant with error bars. In any case, we can see that there can be small performance differences of the different approaches depending on the system, yet the removal of the sched_yield() doesn't seem to do any good, aside from the M2 Pro case. |
OK, thanks for testing! As often, optimizing for the last drop on a specific arch makes things worse on another one.
Good point indeed! That would be the same on big.little ARM configurations.
Ah, got it now, indeed I didn't understand you had sorted it, yeah that does help to compare, and it makes sense.
Given that it divides the whole performance by 16 on some machines, it's a significant problem. We'd need to figure a simple balanced solution that's reasonably good everywhere and from there try to do better with pipelined work. |
On my Ryzen 1700X system using 7 threads without sched_yield I get 4t/s on a 30B 4_0 quant using all layers on the GPU with cuBlas. This was tested on Windows while troubleshooting a speed regression on Koboldcpp. |
Not surprised, same problem, high contention between threads. And Ryzens do not easily give up on their cache lines. Could you try replacing the |
For now we decided to ship a build to our users with it restored after multiple users complained about 50% loss in performance on Windows both AMD and Intel. If this is a very platform dependant case would it not make sense to only exclude it for the platforms it benefits and leave it in for everything else? Because unless I missed something my understanding is that it is only a benefit on mac. |
I suspect that even on mac it's just a loss vs another loss and that a better approach would give better results. I don't know how often that part is called, I'd like to have a look at it maybe this week-end. I suspect that switching to a condwait() after several iterations over pause() could further improve the situation but it's generally tricky to mix these together. |
Trying with pthread_cond_wait() unfortunately gives horrible performance. Just calling the pthread API there is expensive since the loop is fast. I measured 774k calls/s with 80 threads on a 7B model and 338k on a 13B one. But the real problem is how to wake all threads up, we in practice wake them one at a time, that's too slow. Even with pthread_cond_broadcast() in practice, since all threads must acquire the mutex when waking up, their wakeup is heavily serialized. I thought about trying with a file descriptor instead (e.g. a blocking recv(MSG_PEEK)). But any syscall I add here makes everything much worse, we don't sleep long enough in this loop. So it looks like the approach with the ggml_pause() first and sched_yield() next remains the lightest. |
There is no point in using so many threads. The memory bandwidth is easily saturated on pretty much all hardware with just 8 threads. I would be surprised if you see any performance gain from using more than 8 threads. I don't think this change degrades the performance - show me a specific example. |
It is, and there is thus not much point in using more threads for token generation. Running with 6 threads is 0.15% slower than before the change (Sleep 0 vs None in my graph above). That's indeed not too bad. The situation is different for prompt processing though. Not using all the cores and sticking to 6 or 8 threads for optimal token generation speed means having not even half of the possible prompt processing speed. So, it could be useful to be able to specify a different number of threads for prompt processing and token generation, to achieve the highest possible performance on both. |
In fact the performance grows till 30-40 threads and stays flat. However, I can significantly gain in total performance by using more processes in parallel, indicating that the memory bandwidth is not that bad (there are 6 DDR4 channels). Here are the perf I'm measuring on the same test with 8 to 80 threads:
So the limit sits between 24 and 32. However, if I start several processes in parallel, I can almost double the performance, indicating that up to half of the time in the current situation is probably spent synchronizing threads. I've cut the machine in 1..4 groups of 20 threads, 1..3 groups of 26 and 5 groups of 16 to compare:
Here at 4x20 we're almost twice as fast as the fastest that a single process delivers. I know pretty well that many workloads cannot scale linearly for various reasons ranging from data dependency or the cost of distributing the work. I'm not trying to squeeze the last ounce out of it, just to make sure we use the hardware as efficiently as we can. In any case, here, 8 threads is far too low. With sched_yield() removed, I saw the performance totally ruined (divided by 16) due to extreme contention between the threads. I don't reproduce it anymore at the moment but I know that it depends a lot where the data are mapped in memory regarding the various channels. At the very least a pause instruction is definitely needed to make sure we don't hammer the other cores' caches while they're trying to use it. Also, I don't know how this affects prompt processing, as I'm also particularly interested in the speed of prompt processing to analyse contents. |
An here are some measurements on a large prompt (I'm using the engine to analyze Git commit messages). The table below shows execution times in seconds at 20, 40 and 80 threads. Note that at 80 threads in the master without sched_yield() I had to press Ctrl-C after several minutes as it was showing something like one word every 3 seconds. (Note: ed9a54e is before the removal of sched_yield, master is 1d16309, after the removal, pause_only is a single pause
So in this specific use case (which is my target), the 80 threads are almost twice as fast as 20 when avoiding a spinning loop, but are not usable anymore at all in the current master. If you agree I can send you a PR for the two simple patches above, but I'd first like to make sure we agree that the removal of sched_yield() here definitely causes a regression in such use cases. |
Actually the test above was run with the machine configured in NUMA mode. In default (unified) mode we have:
Wat it shows is that supporting 80 threads is absolutely mandatory on this platform, and that the performance is not as bad when NUMA is not involved (or conversely as I suspected the loss of sched_yield() really impacts certain memory access patterns). The sched_yield() version is better in every aspect here, but the other ones involving a pause are not as good anymore. But it gives me a better target to analyze now, I need to figure how to recover the original performance without degrading other platforms. I've also measured output generation per thread number at commit ed9a54e:
Thus actually it scales pretty well and does not really collapse past the inflection point, indicating that it's worth making efforts to maintain a good thread scalability. |
I'd tend to agree. |
Maybe we can choose the fastest way at the runtime? e.g. MAB (multi-armed bandit optimization). IIRC it was used in Clickhose Edit. I found an article https://clickhouse.com/blog/lz4-compression-in-clickhouse. Scroll down to "Testing on different CPUs" but I suggest to read it from top to bottom |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Platform: macOS / Apple M2 Pro
Currently the current thread finalisation / synchronisation logic in ggml_graph_compute_thread relies on sched_yield() to spin idly waiting for other threads to complete :
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L16045
The problem with that is that this is causing absolutely gigantic amounts of overhead due to context switching and falling back to the kernel with no known time as to when the thread will come back to execution.
When profiling time on an M2 Pro:
We see that in terms of time spent, the thing is wasting a huge amount of time in
swtch_pri
which is the system library for thread yielding, further then switching into kernel mode.Simply commenting out sched_yield():
Results in a massive performance boost:
As new profile looks like this:
In the above interactive Terminal command-line example we see a 43% performance boost. The % will differ depending on where you call this from due to macOS thread QoS (ssh is around +25%).
This seems like a gigantic performance issue on the operating systems which do actually yield back to the scheduler.
My solution of just commenting out the yield and just looping probably isn't the best for power efficiency, you'd probably want to implement whatever tight atomic lock / wait across the threads. Up to you what to do here, but anything is better than sched_yield().
Another issue is the thread creation; as the threads get created and joined on each layer/operation, we see the cumulative overhead to be extremely large:
It's likely we could see another 7% performance boost here by switching the work dispatch architecture to a thread pool.
Thank you for the work on the otherwise great implementation!
The text was updated successfully, but these errors were encountered: