-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caffe + openblas not using cpu to the max #923
Comments
What is your hardware, and how many threads are you letting OpenBLAS create ? If this is on a CPU that sports hyperthreading, using more threads than your number of actual full-featured cores will only hold everything up. |
This is a normal Broadwell system; 2 cores (so 4 threads) (I'll not agree with your characterization of hyperthreading; the cpu can do 2 vector FMA's per cycle... that's slightly hard to feed if you don't do hyperthreading) |
even with running just 2 threads, the 2nd core is not more than 63% utilized. |
Somewhere in alexnet description:
Even if you multiply all data sample is quite small... |
Sorry for wasting your time with that suggestion then, could be I underestimated the capabilities of Broadwell (and beyond) based on experience with older hardware (also reflected in the openblas faq). |
I'll certainly try current git if there's things there that should help. (does the openblas team have sufficient access to recent Intel hardware, or is that something that I should try to help fix?) |
"Machine list" in the wiki has Haswell as the most modern in Xianyi's pool, but I suspect the primary limiting factor to be spare time of experienced developers (Note that I am just a user trying to provide some pointers in the absence of the true wizards) |
Just tried top of git (437c7d6) with the same CPU idle time pattern and same overall performance |
sticking a debugger on it: so it's not a huge matrix by itself. |
(there's also some larger ones fwiw 1: M = 384 2: N = 169 3: K = 2304 or M = 10 2: N = 4096 3: K = 9216 |
I tried to repeat what you did. There is a problem with caffe build system, it links to file name libopenblas.so and picks up system-wide old(er) openblas if that is in library paths, even you insist in configuration to use your library. My source was here: Approx result on a 12 core ivy bridge, i will try to draw better summaries at the next run
|
(interesting that you use PTS, I added the openblas support to that) the problem you are seeing is that openblas in the worker threads will just spin the cpu if there is no useful work for the thread to do, so the time command will show them as busy; (even if one does not like usleep(), yield() is sort of a worst case thing to do, asm("rep nop") would be nicer for both the kernel and a hyperthreading scenario) --- OpenBLAS-0.2.16/common.h~ 2016-03-15 18:49:10.000000000 +0000 #ifndef YIELDING /*** |
#731 already had some criticism of using sched_yield() with busy waiting instead of synchronization primitives, but nobody wanted to go there and only the stopgap solution of avoiding multithreading for small matrix sizes was implemented. |
In my previous dayjob as kernel person, I can say that sched_yield() is just about the worst possible thing you can do. Its semantics are horrific and extremely expensive to implement. It really is better to either just do asm("rep nop") or a short udelay like above. |
Hi, on some platforms, sched_yield() is very slow. asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n"); Please see common.h of OpenBLAS. Best regards On 07/14/2016 04:33 PM, Martin Kroeker wrote:
|
on x86, that should really be "pause", also known as "rep nop".. (or a series of those); |
Speedup is missing for me (superspin is gone)
YIELDING nanosleep(0,0) // 100 loops
YIELDING {}
YIELDING pthread_yield()
YIELDING asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n");
|
interesting; if it's easy to test, can you also try usleep(1) and "rep nop" instead of just "nop" |
Tomorrow around same time...
To keep chat alive - did you have a problem with libopenblas.so.0 in ld.so paths? Did you manage to spin up more than one core? |
On the ld so paths; I'm the OS builder.... so I just update the so in the OS to avoid the problem. more cores spin up, that works, they just are only 50% used. |
I have a feeling it is similar to my usleep0 measurement. |
Ermm I think everything assembly is wrong. threads may move between CPUs all the time, usleep is the winner unless nanosleep shows some wonder. |
patch seems to show "no harm" at least. |
I tested on AMD Piledriver with 2MB L2 cache per core.... i got mostly ones (which almost matches 12 cores on intel) with like 1 in 20 bigger numbers.... |
ah I tested on an Intel cpu with 256K l2, but huge L3 |
problem is transfers between core-exclusive caches that may be at the speed of slower higher level shared cache. |
I managed to get time on a bigger system (28 cores/56 threads) and your patch improved performance by over 30%. |
It is reasonable cpu usage.... |
YIELDING nanosleep 0,1 - high system CPU consumption
|
BTW playing with your patch some, it may have an off-by-one/rounding error; the divide will round down, so if there's 1.9 worth of cores, it will still schedule 1 core not 2 cores... |
Yes, thats intentional, there is no performance in cache to cache transfer |
So I see you went from kicking sched_yield to fiddling with the GEMM threading magic. :) |
usleep 0 and usleep 1 measured best. There is no documented drawback for 0, though it looks/feels suspicious. |
usleep is probably linux-specific (?), and if I followed the thread correctly it seems one would want to use asm(pause) when NO_AFFINITY is not set (so as to allow rescheduling of threads using the same core) ? |
CONFORMING TO usleep is pretty universal (and on linux internally maps to nanosleep) On Wed, Aug 3, 2016 at 4:16 AM, Martin Kroeker notifications@github.com
|
weird is that most sleep functions account enormous kernel time consumption |
going into and out of the scheduler is not a cheap operation for the OS yield has that behavior AND a nasty semantics ("run anyone but me" which On Thu, Aug 4, 2016 at 4:43 AM, Andrew notifications@github.com wrote:
|
When using the combination of caffe + openblas, it turns out that all cpus cores except cpu 0 are only used about 50% (AlexNet benchmark), the rest is idle time.
(this is sort of hidden if you're not careful due to the polling nature of the default openblas config, but it shows up if you fix that, or if you use the openmp setup. In the default polling setup you can see it by how many cycles are spent in the polling loop)
profile below:
63.28% libopenblas_haswellp-r0.2.16.so [.] sgemm_kernel
8.78% libopenblas_haswellp-r0.2.16.so [.] sgemm_incopy
6.10% libcaffe.so.1.0.0-rc3 [.] caffe::im2col_cpu
4.55% libopenblas_haswellp-r0.2.16.so [.] sgemm_itcopy
2.59% libopenblas_haswellp-r0.2.16.so [.] sgemm_oncopy
1.32% libcaffe.so.1.0.0-rc3 [.] caffe::PoolingLayer::Forward_cpu
1.27% libgomp.so.1.0.0 [.] gomp_barrier_wait_end
1.22% libm-2.23.so [.] __powf
(and then a tail of very small things)
I'm kind of lost in terms of how to diagnose this further; increasing the number of worker threads does nothing; each thread does proportionally less work but the total cpu use does not change significant.
It feels (subjectively) that something is either not feeding the worker threads enough, or that there is a sequential step going on that is not threaded but that takes about 50% of wallclock time.
Is this a known behavior? Any suggestions on how I can diagnose this further?
The text was updated successfully, but these errors were encountered: