caffe + openblas not using cpu to the max #923

fenrus75 · 2016-07-13T15:44:00Z

When using the combination of caffe + openblas, it turns out that all cpus cores except cpu 0 are only used about 50% (AlexNet benchmark), the rest is idle time.

(this is sort of hidden if you're not careful due to the polling nature of the default openblas config, but it shows up if you fix that, or if you use the openmp setup. In the default polling setup you can see it by how many cycles are spent in the polling loop)

profile below:

63.28% libopenblas_haswellp-r0.2.16.so [.] sgemm_kernel
8.78% libopenblas_haswellp-r0.2.16.so [.] sgemm_incopy
6.10% libcaffe.so.1.0.0-rc3 [.] caffe::im2col_cpu
4.55% libopenblas_haswellp-r0.2.16.so [.] sgemm_itcopy
2.59% libopenblas_haswellp-r0.2.16.so [.] sgemm_oncopy
1.32% libcaffe.so.1.0.0-rc3 [.] caffe::PoolingLayer::Forward_cpu
1.27% libgomp.so.1.0.0 [.] gomp_barrier_wait_end
1.22% libm-2.23.so [.] __powf

(and then a tail of very small things)

I'm kind of lost in terms of how to diagnose this further; increasing the number of worker threads does nothing; each thread does proportionally less work but the total cpu use does not change significant.

It feels (subjectively) that something is either not feeding the worker threads enough, or that there is a sequential step going on that is not threaded but that takes about 50% of wallclock time.

Is this a known behavior? Any suggestions on how I can diagnose this further?

martin-frbg · 2016-07-13T20:51:53Z

What is your hardware, and how many threads are you letting OpenBLAS create ? If this is on a CPU that sports hyperthreading, using more threads than your number of actual full-featured cores will only hold everything up.

fenrus75 · 2016-07-13T21:35:06Z

This is a normal Broadwell system; 2 cores (so 4 threads)

(I'll not agree with your characterization of hyperthreading; the cpu can do 2 vector FMA's per cycle... that's slightly hard to feed if you don't do hyperthreading)

fenrus75 · 2016-07-13T21:39:29Z

even with running just 2 threads, the 2nd core is not more than 63% utilized.

brada4 · 2016-07-13T22:16:13Z

Somewhere in alexnet description:

shape {
      dim: 128
      dim: 3
      dim: 224
      dim: 224
    }

Even if you multiply all data sample is quite small...

martin-frbg · 2016-07-13T22:17:04Z

Sorry for wasting your time with that suggestion then, could be I underestimated the capabilities of Broadwell (and beyond) based on experience with older hardware (also reflected in the openblas faq).
Unfortunately most of the multithreading code dates back to K.Goto's original implementation, i.e. reflecting the state of commonly available hardware ten years ago. Some issues have been adressed recently though, and I see now that you are still using 0.2.16 so it might be worthwile to try the current development branch to get incremental improvements like jeromerobert's fix for #873.

fenrus75 · 2016-07-13T22:41:57Z

I'll certainly try current git if there's things there that should help.

(does the openblas team have sufficient access to recent Intel hardware, or is that something that I should try to help fix?)

martin-frbg · 2016-07-13T22:51:01Z

"Machine list" in the wiki has Haswell as the most modern in Xianyi's pool, but I suspect the primary limiting factor to be spare time of experienced developers (Note that I am just a user trying to provide some pointers in the absence of the true wizards)

fenrus75 · 2016-07-14T05:35:46Z

Just tried top of git (437c7d6) with the same CPU idle time pattern and same overall performance

fenrus75 · 2016-07-14T05:42:59Z

sticking a debugger on it:
1: M = 128
2: N = 729
3: K = 1200

so it's not a huge matrix by itself.

fenrus75 · 2016-07-14T05:45:18Z

(there's also some larger ones fwiw 1: M = 384 2: N = 169 3: K = 2304 or M = 10 2: N = 4096 3: K = 9216
and a pile of 4k x 4k. )

brada4 · 2016-07-14T08:58:04Z

I tried to repeat what you did.
Mine runs on 10 (real) cores just fine

There is a problem with caffe build system, it links to file name libopenblas.so and picks up system-wide old(er) openblas if that is in library paths, even you insist in configuration to use your library.
You might need to LD_PRELOAD the shiny fresh build of libopenblas.so

My source was here:
https://openbenchmarking.org/innhold/9fea910968ca453c51d63fee105de22dd328efb9

Approx result on a 12 core ivy bridge, i will try to draw better summaries at the next run

LD_PRELOAD=../OpenBLAS-0.2.18/libopenblas.so /usr/bin/time ./build-cpu/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt -iterations 1000
blah blah blah
I0714 11:23:47.323763 81367 caffe.cpp:381] Average Forward-Backward: 886.298 ms.
I0714 11:23:47.323801 81367 caffe.cpp:383] Total Time: 886298 ms.
I0714 11:23:47.323838 81367 caffe.cpp:384] *** Benchmark ends ***
3531.13user 5938.36system 14:47.39elapsed 1067%CPU (0avgtext+0avgdata 647472maxresident)k
9344inputs+0outputs (5major+6846581minor)pagefaults 0swaps

fenrus75 · 2016-07-14T13:50:04Z

(interesting that you use PTS, I added the openblas support to that)

the problem you are seeing is that openblas in the worker threads will just spin the cpu if there is no useful work for the thread to do, so the time command will show them as busy;
you can either look at "perf top" during the run and see that the bulk of the time is spent in this spinner, or you can apply a small patch (below) that turns the spin into a short sleep which then accounts better

(even if one does not like usleep(), yield() is sort of a worst case thing to do, asm("rep nop") would be nicer for both the kernel and a hyperthreading scenario)

--- OpenBLAS-0.2.16/common.h~ 2016-03-15 18:49:10.000000000 +0000
+++ OpenBLAS-0.2.16/common.h 2016-06-12 16:12:41.864694505 +0000
@@ -349,7 +349,7 @@
*/

#ifndef YIELDING
-#define YIELDING sched_yield()
+#define YIELDING usleep(1)
#endif

/***

martin-frbg · 2016-07-14T14:33:43Z

#731 already had some criticism of using sched_yield() with busy waiting instead of synchronization primitives, but nobody wanted to go there and only the stopgap solution of avoiding multithreading for small matrix sizes was implemented.

fenrus75 · 2016-07-14T14:40:01Z

In my previous dayjob as kernel person, I can say that sched_yield() is just about the worst possible thing you can do. Its semantics are horrific and extremely expensive to implement. It really is better to either just do asm("rep nop") or a short udelay like above.
EIther of those gave me roughly a 4% gain on the pts/caffe load.

wernsaar · 2016-07-14T14:51:36Z

Hi,

on some platforms, sched_yield() is very slow.
Also usleep has some overhead and the time is static.
I found, that the best solution on tested platforms is:

asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n");

Please see common.h of OpenBLAS.

Best regards
Werner

On 07/14/2016 04:33 PM, Martin Kroeker wrote:

#731 #731 already had some
criticism of using sched_yield() with busy waiting instead of
synchronization primitives, but nobody wanted to go there and only the
stopgap solution of avoiding multithreading for small matrix sizes was
implemented.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#923 (comment), or
mute the thread
https://github.com/notifications/unsubscribe/AC1e6GQcjb5NTM1Blh3cnQ6ijovfgEz4ks5qVkjJgaJpZM4JLj0H.

fenrus75 · 2016-07-14T14:53:28Z

on x86, that should really be "pause", also known as "rep nop".. (or a series of those);
difference between a "pure" nop and pause is that pause is a hyperthreading "yield" inside the CPU

brada4 · 2016-07-14T16:25:02Z

Speedup is missing for me (superspin is gone)
YIELDING usleep(0)

I0714 17:47:31.689795 51224 caffe.cpp:381] Average Forward-Backward: 1005.42 ms.
I0714 17:47:31.689807 51224 caffe.cpp:383] Total Time: 1.00542e+06 ms.
I0714 17:47:31.689821 51224 caffe.cpp:384] *** Benchmark ends ***
3401.25user 39.70system 16:46.37elapsed 341%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+8outputs (0major+4319038minor)pagefaults 0swaps

YIELDING nanosleep(0,0) // 100 loops

I0714 17:59:03.957892  3577 caffe.cpp:381] Average Forward-Backward: 895.11 ms.
I0714 17:59:03.957908  3577 caffe.cpp:383] Total Time: 89511 ms.
I0714 17:59:03.957923  3577 caffe.cpp:384] *** Benchmark ends ***
264.77user 693.81system 1:30.45elapsed 1059%CPU (0avgtext+0avgdata 647432maxresident)k
0inputs+0outputs (0major+1866537minor)pagefaults 0swaps

YIELDING {}

I0714 18:04:35.693410 47561 caffe.cpp:381] Average Forward-Backward: 879.67 ms.
I0714 18:04:35.693423 47561 caffe.cpp:383] Total Time: 87967 ms.
I0714 18:04:35.693435 47561 caffe.cpp:384] *** Benchmark ends ***
938.65user 7.66system 1:28.85elapsed 1065%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+0outputs (0major+3071869minor)pagefaults 0swaps

YIELDING pthread_yield()

I0714 18:12:05.685952 95161 caffe.cpp:381] Average Forward-Backward: 884.89 ms.
I0714 18:12:05.685964 95161 caffe.cpp:383] Total Time: 88489 ms.
I0714 18:12:05.685976 95161 caffe.cpp:384] *** Benchmark ends ***
343.83user 608.28system 1:29.35elapsed 1065%CPU (0avgtext+0avgdata 647484maxresident)k
0inputs+0outputs (0major+3299519minor)pagefaults 0swaps

YIELDING asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n");

I0714 18:18:04.776260 39520 caffe.cpp:381] Average Forward-Backward: 889.02 ms.
I0714 18:18:04.776273 39520 caffe.cpp:383] Total Time: 88902 ms.
I0714 18:18:04.776284 39520 caffe.cpp:384] *** Benchmark ends ***
949.99user 3.86system 1:29.75elapsed 1062%CPU (0avgtext+0avgdata 647428maxresident)k
0inputs+8outputs (0major+1552663minor)pagefaults 0swaps

fenrus75 · 2016-07-14T16:26:24Z

interesting; if it's easy to test, can you also try usleep(1) and "rep nop" instead of just "nop"

brada4 · 2016-07-14T17:26:15Z

Tomorrow around same time...

pause; (rumor has it it is 10-100 cycles but certainly causes HT reconfiguration for better)
usleep(1)
nanosleep(0,1)

To keep chat alive - did you have a problem with libopenblas.so.0 in ld.so paths? Did you manage to spin up more than one core?

fenrus75 · 2016-07-14T17:27:52Z

On the ld so paths; I'm the OS builder.... so I just update the so in the OS to avoid the problem.

more cores spin up, that works, they just are only 50% used.

brada4 · 2016-07-14T17:32:26Z

I have a feeling it is similar to my usleep0 measurement.

brada4 · 2016-07-14T19:43:35Z

Ermm I think everything assembly is wrong. threads may move between CPUs all the time, usleep is the winner unless nanosleep shows some wonder.
@martin-frbg don count it as code contribution, @xianyi I think this ultimately fixes all caffee/opencv woes
@fenrus75 this is low quality patch that improves sgemm(you use) and dgemm (you dont) and adds some misfortune to complex number workalikes (which means it would be nice if you dont use it more than with LD_PRELOAD just for caffee)
Basic theory - prevent one core's L2 to be accessed by other core. Correct threshold there has to be a bit bigger than L2_SIZE, but I did not pursue perfection. Basically it spends significanly less CPU time by using less than all CPUs, but completes in same time as all-or-one default calibration.
It is generated by subversion, probably $ patch < gemm-patch.txt applies it, then just run make, it will rebuild changed pieces (it is the same about small data samples that cannot utilize too many cores)
You can uncommend diagnostic printf and rebuild again to see how many cores each blas call uses.
gemm-patch.txt

fenrus75 · 2016-07-14T22:11:47Z

patch seems to show "no harm" at least.
it also does not seem to help, so now compiling with the print added; turns out that 4 cores are not enough to generally trigger your code, most of the time the "hint" is larger than 4. For 10 to 22 cores it will make a difference.
(if nothing else, if you have fewer cores active, the clock frequency of the active ones will be higher)

brada4 · 2016-07-14T22:34:01Z

I tested on AMD Piledriver with 2MB L2 cache per core.... i got mostly ones (which almost matches 12 cores on intel) with like 1 in 20 bigger numbers....

fenrus75 · 2016-07-14T22:34:46Z

ah I tested on an Intel cpu with 256K l2, but huge L3

brada4 · 2016-07-14T22:51:29Z

problem is transfers between core-exclusive caches that may be at the speed of slower higher level shared cache.

fenrus75 · 2016-07-14T23:32:18Z

I managed to get time on a bigger system (28 cores/56 threads) and your patch improved performance by over 30%.
The system is still VERY idle ... but I'll take 30% ;-)

brada4 · 2016-07-15T06:46:35Z

It is reasonable cpu usage....

brada4 · 2016-07-15T08:21:21Z

YIELDING nanosleep 0,1 - high system CPU consumption
usleep 1 - almost usleep 0, but a bit higher kernel time

I0715 10:04:34.495175 68074 caffe.cpp:381] Average Forward-Backward: 979.99 ms.
I0715 10:04:34.495189 68074 caffe.cpp:383] Total Time: 97999 ms.
I0715 10:04:34.495203 68074 caffe.cpp:384] *** Benchmark ends ***
343.27user 11.23system 1:38.83elapsed 358%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+8outputs (0major+1530046minor)pagefaults 0swaps

fenrus75 · 2016-07-17T01:18:47Z

BTW playing with your patch some, it may have an off-by-one/rounding error; the divide will round down, so if there's 1.9 worth of cores, it will still schedule 1 core not 2 cores...

brada4 · 2016-07-17T02:16:29Z

Yes, thats intentional, there is no performance in cache to cache transfer

martin-frbg · 2016-08-02T19:27:44Z

So I see you went from kicking sched_yield to fiddling with the GEMM threading magic. :)
Do I get it that the conclusion for the first issue was to go with
#define YIELDING asm volatile("pause;\n");
on Intel processors, or was usleep(0) the winner (where usleep is available) ? Or is there no conclusion yet except that the current sched_yield() is the worst option of them all ?

brada4 · 2016-08-02T21:38:26Z

usleep 0 and usleep 1 measured best. There is no documented drawback for 0, though it looks/feels suspicious.
I think asm sleeps must be ifdef affinity... ermmm..

martin-frbg · 2016-08-03T11:06:54Z

usleep is probably linux-specific (?), and if I followed the thread correctly it seems one would want to use asm(pause) when NO_AFFINITY is not set (so as to allow rescheduling of threads using the same core) ?

fenrus75 · 2016-08-03T17:12:04Z

CONFORMING TO
4.3BSD, POSIX.1-2001. POSIX.1-2001 declares this function
obsolete;
use nanosleep(2) instead. POSIX.1-2008 removes the specification
of
usleep().

usleep is pretty universal (and on linux internally maps to nanosleep)

On Wed, Aug 3, 2016 at 4:16 AM, Martin Kroeker notifications@github.com
wrote:

usleep is probably linux-specific (?), and if I followed the thread
correctly it seems one would want to use asm(pause) when NO_AFFINITY is
not set (so as to allow rescheduling of threads using the same core) ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#923 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPeFev8nTYo_NG2kyI4TD0no57vZXrxks5qcHiUgaJpZM4JLj0H
.

brada4 · 2016-08-04T11:43:41Z

weird is that most sleep functions account enormous kernel time consumption

fenrus75 · 2016-08-04T13:59:31Z

going into and out of the scheduler is not a cheap operation for the OS
(saying that as a kernel dev)

yield has that behavior AND a nasty semantics ("run anyone but me" which
requires some global locks) which makes it double expensive.

On Thu, Aug 4, 2016 at 4:43 AM, Andrew notifications@github.com wrote:

weird is that most sleep functions account enormous kernel time consumption

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#923 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPeFRR4oT6eeFWtavFLlCg0Vt3xy5Arks5qcdBxgaJpZM4JLj0H
.

martin-frbg mentioned this issue Aug 23, 2016

how can i use all power of my cpus? #948

Closed

martin-frbg mentioned this issue Oct 8, 2016

OpenBlas crashes on Xeon processors #931

Closed

martin-frbg mentioned this issue Oct 20, 2016

numpy problem #990

Closed

martin-frbg mentioned this issue Jan 8, 2017

Replace sched_yield with usleep(1) on Linux #1051

Closed

This was referenced Jun 7, 2018

Use usleep instead of sched_yield by default #1600

Merged

performance in sched_yield (multisocket) #900

Closed

martin-frbg mentioned this issue Jun 21, 2018

Threading threshold tuning needed for sgemm/dgemm #1622

Closed

fenrus75 closed this as completed Dec 26, 2018

caffe + openblas not using cpu to the max #923

caffe + openblas not using cpu to the max #923

Comments

fenrus75 commented Jul 13, 2016

martin-frbg commented Jul 13, 2016 • edited Loading

fenrus75 commented Jul 13, 2016

fenrus75 commented Jul 13, 2016

brada4 commented Jul 13, 2016

martin-frbg commented Jul 13, 2016

fenrus75 commented Jul 13, 2016

martin-frbg commented Jul 13, 2016 • edited Loading

fenrus75 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016 • edited Loading

brada4 commented Jul 14, 2016 • edited Loading

fenrus75 commented Jul 14, 2016

martin-frbg commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

wernsaar commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

brada4 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

brada4 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

brada4 commented Jul 14, 2016

brada4 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016 • edited Loading

brada4 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

brada4 commented Jul 14, 2016

fenrus75 commented Jul 14, 2016

brada4 commented Jul 15, 2016

brada4 commented Jul 15, 2016

fenrus75 commented Jul 17, 2016

brada4 commented Jul 17, 2016

martin-frbg commented Aug 2, 2016

brada4 commented Aug 2, 2016

martin-frbg commented Aug 3, 2016

fenrus75 commented Aug 3, 2016

brada4 commented Aug 4, 2016

fenrus75 commented Aug 4, 2016

martin-frbg commented Jul 13, 2016 •

edited

Loading

martin-frbg commented Jul 13, 2016 •

edited

Loading

fenrus75 commented Jul 14, 2016 •

edited

Loading

brada4 commented Jul 14, 2016 •

edited

Loading

fenrus75 commented Jul 14, 2016 •

edited

Loading