openblas threads slow for small matrices #3965

guolas · 2013-08-06T22:04:44Z

Hi,

I have just downloaded the latest version available on Github, and I have compiled it in two different systems:

MacOS, Intel Core i5- @ 1.7GHz, with 4GB of memory.
Linux, Intel Core i7-3960X CPU @ 3.30GHz, with 32GB of memory.

In both systems I have compiled the following version:

Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC

After executing the same Julia code, I have that the execution times are as follows:

~100s on the Mac.
~135s on the Linux.

Does it really make any sense?

I mean, the Linux computer is way more powerful, with 12 cores, each of which almost double the clock speed. And as it is executing, I can see, using top that all the CPUs are being used in the Linux machine, and still, it takes longer to finish the execution.

I have tried this several times, and it consistently yields the same execution times.

Thank you for any comment on this.

Regards,

Juanjo.

The text was updated successfully, but these errors were encountered:

Keno · 2013-08-06T22:09:35Z

Without knowing the code you're running, we can't possibly make an assessment.

guolas · 2013-08-06T22:25:11Z

The code I am using can be found in:

julia_test

If that can help understanding what happens.

pao · 2013-08-06T22:32:06Z

Direct link to source: https://github.com/guolas/julia_test/blob/master/complete.jl

staticfloat · 2013-08-07T01:43:28Z

First, I would try using the profiler to figure out which lines of code are sucking up most of the CPU time. It's possible there's some operation that is just taking longer on Linux than OSX for some strange reason, and the profiler will help track that down.

If that doesn't solve the mystery, could you post the output of versioninfo() from both of your systems? It's possible we're comparing apples and oranges here, as your BLAS implementations could make a significant difference, as we've seen in some (admittedly linear algebra-heavy) tests, which may or may not be happening here. In any case, that should show up in the profiler as well, as it will show the BLAS/LAPACK functions taking up all the CPU time.

guolas · 2013-08-07T02:13:45Z

The versioninfo() I can provide it now. For the Mac (that runs faster):

julia> versioninfo()
Julia Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC
Platform Info:
  System: Darwin (x86_64-apple-darwin12.4.0)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

For the Linux:

julia> versioninfo()
Julia Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC
Platform Info:
  System: Linux (x86_64-redhat-linux)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

I guess that this does not help much, as the versions are the same in both systems. I will post information about the profiling when I got them.

Thank you.

ViralBShah · 2013-08-07T04:04:01Z

Yes, knowing what the profiler says will be useful. For example, is it a case of openblas tuning, or is it a julia compiler issue? We certainly need to chase this one down.

JeffBezanson · 2013-08-07T07:08:57Z

It would be interesting to try this with export OPENBLAS_NUM_THREADS=1. I get:

[jeff ~/src/julia]$ env|grep BLAS
OPENBLAS_NUM_THREADS=1
[jeff ~/src/julia]$ ./julia i3966.jl
elapsed time: 58.775533131 seconds
[jeff ~/src/julia]$ export OPENBLAS_NUM_THREADS=2
[jeff ~/src/julia]$ ./julia i3966.jl
elapsed time: 77.535673898 seconds

@xianyi

xianyi · 2013-08-07T07:20:32Z

Does it call GEMM or other BLAS functions? Could you give me a list?

JeffBezanson · 2013-08-07T07:23:49Z

I believe it is mostly dgemm, zgemm, and whatever zgesdd calls.

xianyi · 2013-08-07T07:29:18Z

I see. It is an old single thread / multi-threading adjustment issue.

JeffBezanson · 2013-08-07T07:32:23Z

More info: most of the time is doing many zgesdd calls of size 18x30. Can be reproduced by @time for i=1:10000; svd(rand(18,30),false); end

staticfloat · 2013-08-07T07:33:37Z

@guolas, can you tell how many threads are being started up in each case? Perhaps via htop or something....

ViralBShah · 2013-08-07T13:50:44Z

@xianyi Why not use multi threading only for problems bigger than something like 200x200? We can do some experiments and find these thresholds.

guolas · 2013-08-07T16:14:47Z

I have tried limiting the number of threads as @JeffBezanson suggested, and I get this:

For export OPENBLAS_NUM_THREADS=1:

Mac: elapsed time: 64.820198286 seconds
Linux: elapsed time: 59.656547279 seconds

For export OPENBLAS_NUM_THREADS=2:

Mac: elapsed time: 101.788748919 seconds
Linux: elapsed time: 83.597288193 seconds

And the time keeps increasing for the Linux machine, until I use the whole 12 threads that it can use.

When I don't set anything, according to top, it takes 1200% of the CPU, which probably means that it is using 12 threads.

ViralBShah · 2013-08-11T02:15:01Z

@blakejohnson In your experiments, how does MKL fare on small problems, compared to openblas?

blakejohnson · 2013-08-16T03:19:56Z

I just got around to running the BLAS tests in test/perf/blas on my 32-bit Ubuntu 12.04LTS machine where I have access to MKL. Surprisingly, OpenBLAS is actually faster in all the level-1 tests, and only struggles in small level-2 and level-3 instances.

blakejohnson · 2013-08-16T18:02:02Z

Since it is probably relevant, the above results were run on a quad core Intel Xeon E5410.

xianyi · 2013-08-19T02:21:25Z

Hi @blakejohnson ,

I didn't know GEMV has this significant performance gap before.
Thank you for the test.

Xianyi

ViralBShah · 2013-08-19T05:09:05Z

Cc @nutsiepully

blakejohnson · 2013-08-27T20:09:56Z

Perhaps the wrong place to post this, but I've also looked at LAPACK performance of MKL vs OpenBLAS. The general results are all over the board. It is easier to see things when separated into two groups. First, tests where MKL does well (x-axis is performance relative to MKL, plotting results > 1.1):

There are also places where MKL does poorly (results < 0.7):

Some comments
These tests make it clear that there is room for optimization of threading thresholds in OpenBLAS, as it is often the case on small problems that single-threaded OpenBLAS is fastest. In some cases, the margin is quite large, such as on schurtest_medium.

Also interesting: there are some tests where OpenBLAS absolutely crushes MKL, such as medium-huge LU and QR factorizations. I find this surprising since I was under the impression that OpenBLAS used a reference implementation of LAPACK, whereas Intel has made attempts to accelerate some of these methods.

The unfortunate thing is that this means there is no universal "best" choice for BLAS/LAPACK library. My work heavily relies on Hermitian eigenvalue problems, where MKL is better, but for users, OpenBLAS may make more sense.

xianyi · 2013-08-28T08:43:27Z

Thank you for the report.

GotoBLAS/OpenBLAS optimized LU and Cholesky factorization.

ViralBShah · 2013-11-27T13:21:39Z

@xianyi How come gemv with openblas is so much slower than MKL? See the first plot in this issue.

xianyi · 2013-11-29T14:58:28Z

@ViralBShah ,
For gemv, OpenBLAS used one kernel for all matrix size. Obviously, MKL used different implementation for the different size.

JeffBezanson · 2014-01-27T05:01:24Z

See update here:
OpenMathLib/OpenBLAS#103 (comment)

Looks like we will have to patch openblas' Makefile.rule.

ViralBShah · 2015-03-06T15:42:27Z

We do pass GEMM_MULTITHREADING_THRESHOLD=50 now, so perhaps this is no longer an issue. Please open if it still is.

axsk · 2015-12-28T02:08:32Z

I was experiencing a similar problem running some code first on my PC and later on cluster, launching 32 julia instances on 32 cores. Then I was surprised how much slower (per process) the cluster was.

Weeks later I discovered this here, and indeed forcing export OPENBLAS_NUM_THREADS=1 lead to a 3x speed-up on the cluster.

Maybe this should be mentioned somewhere in the documentation.

ViralBShah · 2015-12-28T03:20:04Z

I believe it is mentioned, but if not, we certainly should.

ViralBShah · 2020-04-28T22:57:45Z

OpenMathLib/OpenBLAS#2587

blakejohnson mentioned this issue Aug 19, 2013

performance of our OpenBLAS configuration #2852

Closed

nutsiepully mentioned this issue Aug 20, 2013

Profiling code led to segmentation fault and non-terminating state (infinite loop). #4107

Closed

blakejohnson mentioned this issue Sep 17, 2013

Shipping MKL with julia #4272

Closed

ViralBShah mentioned this issue Dec 1, 2013

OpenBLAS vs. MKL performance OpenMathLib/OpenBLAS#322

Closed

pao closed this as completed Dec 3, 2013

pao reopened this Dec 3, 2013

ViralBShah closed this as completed Mar 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openblas threads slow for small matrices #3965

openblas threads slow for small matrices #3965

guolas commented Aug 6, 2013

Keno commented Aug 6, 2013

guolas commented Aug 6, 2013

pao commented Aug 6, 2013

staticfloat commented Aug 7, 2013

guolas commented Aug 7, 2013

ViralBShah commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

xianyi commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

xianyi commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

staticfloat commented Aug 7, 2013

ViralBShah commented Aug 7, 2013

guolas commented Aug 7, 2013

ViralBShah commented Aug 11, 2013

blakejohnson commented Aug 16, 2013

blakejohnson commented Aug 16, 2013

xianyi commented Aug 19, 2013

ViralBShah commented Aug 19, 2013

blakejohnson commented Aug 27, 2013

xianyi commented Aug 28, 2013

ViralBShah commented Nov 27, 2013

xianyi commented Nov 29, 2013

JeffBezanson commented Jan 27, 2014

ViralBShah commented Mar 6, 2015

axsk commented Dec 28, 2015

ViralBShah commented Dec 28, 2015

ViralBShah commented Apr 28, 2020

openblas threads slow for small matrices #3965

openblas threads slow for small matrices #3965

Comments

guolas commented Aug 6, 2013

Keno commented Aug 6, 2013

guolas commented Aug 6, 2013

pao commented Aug 6, 2013

staticfloat commented Aug 7, 2013

guolas commented Aug 7, 2013

ViralBShah commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

xianyi commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

xianyi commented Aug 7, 2013

JeffBezanson commented Aug 7, 2013

staticfloat commented Aug 7, 2013

ViralBShah commented Aug 7, 2013

guolas commented Aug 7, 2013

ViralBShah commented Aug 11, 2013

blakejohnson commented Aug 16, 2013

blakejohnson commented Aug 16, 2013

xianyi commented Aug 19, 2013

ViralBShah commented Aug 19, 2013

blakejohnson commented Aug 27, 2013

xianyi commented Aug 28, 2013

ViralBShah commented Nov 27, 2013

xianyi commented Nov 29, 2013

JeffBezanson commented Jan 27, 2014

ViralBShah commented Mar 6, 2015

axsk commented Dec 28, 2015

ViralBShah commented Dec 28, 2015

ViralBShah commented Apr 28, 2020