Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caffe + openblas not using cpu to the max #923

Closed
fenrus75 opened this issue Jul 13, 2016 · 37 comments
Closed

caffe + openblas not using cpu to the max #923

fenrus75 opened this issue Jul 13, 2016 · 37 comments

Comments

@fenrus75
Copy link
Contributor

When using the combination of caffe + openblas, it turns out that all cpus cores except cpu 0 are only used about 50% (AlexNet benchmark), the rest is idle time.

(this is sort of hidden if you're not careful due to the polling nature of the default openblas config, but it shows up if you fix that, or if you use the openmp setup. In the default polling setup you can see it by how many cycles are spent in the polling loop)

profile below:

63.28% libopenblas_haswellp-r0.2.16.so [.] sgemm_kernel
8.78% libopenblas_haswellp-r0.2.16.so [.] sgemm_incopy
6.10% libcaffe.so.1.0.0-rc3 [.] caffe::im2col_cpu
4.55% libopenblas_haswellp-r0.2.16.so [.] sgemm_itcopy
2.59% libopenblas_haswellp-r0.2.16.so [.] sgemm_oncopy
1.32% libcaffe.so.1.0.0-rc3 [.] caffe::PoolingLayer::Forward_cpu
1.27% libgomp.so.1.0.0 [.] gomp_barrier_wait_end
1.22% libm-2.23.so [.] __powf

(and then a tail of very small things)

I'm kind of lost in terms of how to diagnose this further; increasing the number of worker threads does nothing; each thread does proportionally less work but the total cpu use does not change significant.

It feels (subjectively) that something is either not feeding the worker threads enough, or that there is a sequential step going on that is not threaded but that takes about 50% of wallclock time.

Is this a known behavior? Any suggestions on how I can diagnose this further?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 13, 2016

What is your hardware, and how many threads are you letting OpenBLAS create ? If this is on a CPU that sports hyperthreading, using more threads than your number of actual full-featured cores will only hold everything up.

@fenrus75
Copy link
Contributor Author

This is a normal Broadwell system; 2 cores (so 4 threads)

(I'll not agree with your characterization of hyperthreading; the cpu can do 2 vector FMA's per cycle... that's slightly hard to feed if you don't do hyperthreading)

@fenrus75
Copy link
Contributor Author

even with running just 2 threads, the 2nd core is not more than 63% utilized.

@brada4
Copy link
Contributor

brada4 commented Jul 13, 2016

Somewhere in alexnet description:

shape {
      dim: 128
      dim: 3
      dim: 224
      dim: 224
    }

Even if you multiply all data sample is quite small...

@martin-frbg
Copy link
Collaborator

Sorry for wasting your time with that suggestion then, could be I underestimated the capabilities of Broadwell (and beyond) based on experience with older hardware (also reflected in the openblas faq).
Unfortunately most of the multithreading code dates back to K.Goto's original implementation, i.e. reflecting the state of commonly available hardware ten years ago. Some issues have been adressed recently though, and I see now that you are still using 0.2.16 so it might be worthwile to try the current development branch to get incremental improvements like jeromerobert's fix for #873.

@fenrus75
Copy link
Contributor Author

I'll certainly try current git if there's things there that should help.

(does the openblas team have sufficient access to recent Intel hardware, or is that something that I should try to help fix?)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 13, 2016

"Machine list" in the wiki has Haswell as the most modern in Xianyi's pool, but I suspect the primary limiting factor to be spare time of experienced developers (Note that I am just a user trying to provide some pointers in the absence of the true wizards)

@fenrus75
Copy link
Contributor Author

Just tried top of git (437c7d6) with the same CPU idle time pattern and same overall performance

@fenrus75
Copy link
Contributor Author

sticking a debugger on it:
1: M = 128
2: N = 729
3: K = 1200

so it's not a huge matrix by itself.

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jul 14, 2016

(there's also some larger ones fwiw 1: M = 384 2: N = 169 3: K = 2304 or M = 10 2: N = 4096 3: K = 9216
and a pile of 4k x 4k. )

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

I tried to repeat what you did.
Mine runs on 10 (real) cores just fine

There is a problem with caffe build system, it links to file name libopenblas.so and picks up system-wide old(er) openblas if that is in library paths, even you insist in configuration to use your library.
You might need to LD_PRELOAD the shiny fresh build of libopenblas.so

My source was here:
https://openbenchmarking.org/innhold/9fea910968ca453c51d63fee105de22dd328efb9

Approx result on a 12 core ivy bridge, i will try to draw better summaries at the next run

LD_PRELOAD=../OpenBLAS-0.2.18/libopenblas.so /usr/bin/time ./build-cpu/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt -iterations 1000
blah blah blah
I0714 11:23:47.323763 81367 caffe.cpp:381] Average Forward-Backward: 886.298 ms.
I0714 11:23:47.323801 81367 caffe.cpp:383] Total Time: 886298 ms.
I0714 11:23:47.323838 81367 caffe.cpp:384] *** Benchmark ends ***
3531.13user 5938.36system 14:47.39elapsed 1067%CPU (0avgtext+0avgdata 647472maxresident)k
9344inputs+0outputs (5major+6846581minor)pagefaults 0swaps

@fenrus75
Copy link
Contributor Author

(interesting that you use PTS, I added the openblas support to that)

the problem you are seeing is that openblas in the worker threads will just spin the cpu if there is no useful work for the thread to do, so the time command will show them as busy;
you can either look at "perf top" during the run and see that the bulk of the time is spent in this spinner, or you can apply a small patch (below) that turns the spin into a short sleep which then accounts better

(even if one does not like usleep(), yield() is sort of a worst case thing to do, asm("rep nop") would be nicer for both the kernel and a hyperthreading scenario)

--- OpenBLAS-0.2.16/common.h~ 2016-03-15 18:49:10.000000000 +0000
+++ OpenBLAS-0.2.16/common.h 2016-06-12 16:12:41.864694505 +0000
@@ -349,7 +349,7 @@
*/

#ifndef YIELDING
-#define YIELDING sched_yield()
+#define YIELDING usleep(1)
#endif

/***

@martin-frbg
Copy link
Collaborator

#731 already had some criticism of using sched_yield() with busy waiting instead of synchronization primitives, but nobody wanted to go there and only the stopgap solution of avoiding multithreading for small matrix sizes was implemented.

@fenrus75
Copy link
Contributor Author

In my previous dayjob as kernel person, I can say that sched_yield() is just about the worst possible thing you can do. Its semantics are horrific and extremely expensive to implement. It really is better to either just do asm("rep nop") or a short udelay like above.
EIther of those gave me roughly a 4% gain on the pts/caffe load.

@wernsaar
Copy link
Contributor

Hi,

on some platforms, sched_yield() is very slow.
Also usleep has some overhead and the time is static.
I found, that the best solution on tested platforms is:

asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n");

Please see common.h of OpenBLAS.

Best regards
Werner

On 07/14/2016 04:33 PM, Martin Kroeker wrote:

#731 #731 already had some
criticism of using sched_yield() with busy waiting instead of
synchronization primitives, but nobody wanted to go there and only the
stopgap solution of avoiding multithreading for small matrix sizes was
implemented.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#923 (comment), or
mute the thread
https://github.com/notifications/unsubscribe/AC1e6GQcjb5NTM1Blh3cnQ6ijovfgEz4ks5qVkjJgaJpZM4JLj0H.

@fenrus75
Copy link
Contributor Author

on x86, that should really be "pause", also known as "rep nop".. (or a series of those);
difference between a "pure" nop and pause is that pause is a hyperthreading "yield" inside the CPU

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

Speedup is missing for me (superspin is gone)
YIELDING usleep(0)

I0714 17:47:31.689795 51224 caffe.cpp:381] Average Forward-Backward: 1005.42 ms.
I0714 17:47:31.689807 51224 caffe.cpp:383] Total Time: 1.00542e+06 ms.
I0714 17:47:31.689821 51224 caffe.cpp:384] *** Benchmark ends ***
3401.25user 39.70system 16:46.37elapsed 341%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+8outputs (0major+4319038minor)pagefaults 0swaps

YIELDING nanosleep(0,0) // 100 loops

I0714 17:59:03.957892  3577 caffe.cpp:381] Average Forward-Backward: 895.11 ms.
I0714 17:59:03.957908  3577 caffe.cpp:383] Total Time: 89511 ms.
I0714 17:59:03.957923  3577 caffe.cpp:384] *** Benchmark ends ***
264.77user 693.81system 1:30.45elapsed 1059%CPU (0avgtext+0avgdata 647432maxresident)k
0inputs+0outputs (0major+1866537minor)pagefaults 0swaps

YIELDING {}

I0714 18:04:35.693410 47561 caffe.cpp:381] Average Forward-Backward: 879.67 ms.
I0714 18:04:35.693423 47561 caffe.cpp:383] Total Time: 87967 ms.
I0714 18:04:35.693435 47561 caffe.cpp:384] *** Benchmark ends ***
938.65user 7.66system 1:28.85elapsed 1065%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+0outputs (0major+3071869minor)pagefaults 0swaps

YIELDING pthread_yield()

I0714 18:12:05.685952 95161 caffe.cpp:381] Average Forward-Backward: 884.89 ms.
I0714 18:12:05.685964 95161 caffe.cpp:383] Total Time: 88489 ms.
I0714 18:12:05.685976 95161 caffe.cpp:384] *** Benchmark ends ***
343.83user 608.28system 1:29.35elapsed 1065%CPU (0avgtext+0avgdata 647484maxresident)k
0inputs+0outputs (0major+3299519minor)pagefaults 0swaps

YIELDING asm volatile ("nop;nop;nop;nop;nop;nop;nop;nop;\n");

I0714 18:18:04.776260 39520 caffe.cpp:381] Average Forward-Backward: 889.02 ms.
I0714 18:18:04.776273 39520 caffe.cpp:383] Total Time: 88902 ms.
I0714 18:18:04.776284 39520 caffe.cpp:384] *** Benchmark ends ***
949.99user 3.86system 1:29.75elapsed 1062%CPU (0avgtext+0avgdata 647428maxresident)k
0inputs+8outputs (0major+1552663minor)pagefaults 0swaps

@fenrus75
Copy link
Contributor Author

interesting; if it's easy to test, can you also try usleep(1) and "rep nop" instead of just "nop"

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

Tomorrow around same time...

  • pause; (rumor has it it is 10-100 cycles but certainly causes HT reconfiguration for better)
  • usleep(1)
  • nanosleep(0,1)

To keep chat alive - did you have a problem with libopenblas.so.0 in ld.so paths? Did you manage to spin up more than one core?

@fenrus75
Copy link
Contributor Author

On the ld so paths; I'm the OS builder.... so I just update the so in the OS to avoid the problem.

more cores spin up, that works, they just are only 50% used.

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

I have a feeling it is similar to my usleep0 measurement.

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

Ermm I think everything assembly is wrong. threads may move between CPUs all the time, usleep is the winner unless nanosleep shows some wonder.
@martin-frbg don count it as code contribution, @xianyi I think this ultimately fixes all caffee/opencv woes
@fenrus75 this is low quality patch that improves sgemm(you use) and dgemm (you dont) and adds some misfortune to complex number workalikes (which means it would be nice if you dont use it more than with LD_PRELOAD just for caffee)
Basic theory - prevent one core's L2 to be accessed by other core. Correct threshold there has to be a bit bigger than L2_SIZE, but I did not pursue perfection. Basically it spends significanly less CPU time by using less than all CPUs, but completes in same time as all-or-one default calibration.
It is generated by subversion, probably $ patch < gemm-patch.txt applies it, then just run make, it will rebuild changed pieces (it is the same about small data samples that cannot utilize too many cores)
You can uncommend diagnostic printf and rebuild again to see how many cores each blas call uses.
gemm-patch.txt

@fenrus75
Copy link
Contributor Author

fenrus75 commented Jul 14, 2016

patch seems to show "no harm" at least.
it also does not seem to help, so now compiling with the print added; turns out that 4 cores are not enough to generally trigger your code, most of the time the "hint" is larger than 4. For 10 to 22 cores it will make a difference.
(if nothing else, if you have fewer cores active, the clock frequency of the active ones will be higher)

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

I tested on AMD Piledriver with 2MB L2 cache per core.... i got mostly ones (which almost matches 12 cores on intel) with like 1 in 20 bigger numbers....

@fenrus75
Copy link
Contributor Author

ah I tested on an Intel cpu with 256K l2, but huge L3

@brada4
Copy link
Contributor

brada4 commented Jul 14, 2016

problem is transfers between core-exclusive caches that may be at the speed of slower higher level shared cache.

@fenrus75
Copy link
Contributor Author

I managed to get time on a bigger system (28 cores/56 threads) and your patch improved performance by over 30%.
The system is still VERY idle ... but I'll take 30% ;-)

@brada4
Copy link
Contributor

brada4 commented Jul 15, 2016

It is reasonable cpu usage....

@brada4
Copy link
Contributor

brada4 commented Jul 15, 2016

YIELDING nanosleep 0,1 - high system CPU consumption
usleep 1 - almost usleep 0, but a bit higher kernel time

I0715 10:04:34.495175 68074 caffe.cpp:381] Average Forward-Backward: 979.99 ms.
I0715 10:04:34.495189 68074 caffe.cpp:383] Total Time: 97999 ms.
I0715 10:04:34.495203 68074 caffe.cpp:384] *** Benchmark ends ***
343.27user 11.23system 1:38.83elapsed 358%CPU (0avgtext+0avgdata 647480maxresident)k
0inputs+8outputs (0major+1530046minor)pagefaults 0swaps

@fenrus75
Copy link
Contributor Author

BTW playing with your patch some, it may have an off-by-one/rounding error; the divide will round down, so if there's 1.9 worth of cores, it will still schedule 1 core not 2 cores...

@brada4
Copy link
Contributor

brada4 commented Jul 17, 2016

Yes, thats intentional, there is no performance in cache to cache transfer

@martin-frbg
Copy link
Collaborator

So I see you went from kicking sched_yield to fiddling with the GEMM threading magic. :)
Do I get it that the conclusion for the first issue was to go with
#define YIELDING asm volatile("pause;\n");
on Intel processors, or was usleep(0) the winner (where usleep is available) ? Or is there no conclusion yet except that the current sched_yield() is the worst option of them all ?

@brada4
Copy link
Contributor

brada4 commented Aug 2, 2016

usleep 0 and usleep 1 measured best. There is no documented drawback for 0, though it looks/feels suspicious.
I think asm sleeps must be ifdef affinity... ermmm..

@martin-frbg
Copy link
Collaborator

usleep is probably linux-specific (?), and if I followed the thread correctly it seems one would want to use asm(pause) when NO_AFFINITY is not set (so as to allow rescheduling of threads using the same core) ?

@fenrus75
Copy link
Contributor Author

fenrus75 commented Aug 3, 2016

CONFORMING TO
4.3BSD, POSIX.1-2001. POSIX.1-2001 declares this function
obsolete;
use nanosleep(2) instead. POSIX.1-2008 removes the specification
of
usleep().

usleep is pretty universal (and on linux internally maps to nanosleep)

On Wed, Aug 3, 2016 at 4:16 AM, Martin Kroeker notifications@github.com
wrote:

usleep is probably linux-specific (?), and if I followed the thread
correctly it seems one would want to use asm(pause) when NO_AFFINITY is
not set (so as to allow rescheduling of threads using the same core) ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#923 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPeFev8nTYo_NG2kyI4TD0no57vZXrxks5qcHiUgaJpZM4JLj0H
.

@brada4
Copy link
Contributor

brada4 commented Aug 4, 2016

weird is that most sleep functions account enormous kernel time consumption

@fenrus75
Copy link
Contributor Author

fenrus75 commented Aug 4, 2016

going into and out of the scheduler is not a cheap operation for the OS
(saying that as a kernel dev)

yield has that behavior AND a nasty semantics ("run anyone but me" which
requires some global locks) which makes it double expensive.

On Thu, Aug 4, 2016 at 4:43 AM, Andrew notifications@github.com wrote:

weird is that most sleep functions account enormous kernel time consumption


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#923 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPeFRR4oT6eeFWtavFLlCg0Vt3xy5Arks5qcdBxgaJpZM4JLj0H
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants