Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openblas threads slow for small matrices #3965

Closed
guolas opened this issue Aug 6, 2013 · 28 comments
Closed

openblas threads slow for small matrices #3965

guolas opened this issue Aug 6, 2013 · 28 comments
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@guolas
Copy link

guolas commented Aug 6, 2013

Hi,

I have just downloaded the latest version available on Github, and I have compiled it in two different systems:

  • MacOS, Intel Core i5- @ 1.7GHz, with 4GB of memory.
  • Linux, Intel Core i7-3960X CPU @ 3.30GHz, with 32GB of memory.

In both systems I have compiled the following version:

Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC

After executing the same Julia code, I have that the execution times are as follows:

  • ~100s on the Mac.
  • ~135s on the Linux.

Does it really make any sense?

I mean, the Linux computer is way more powerful, with 12 cores, each of which almost double the clock speed. And as it is executing, I can see, using top that all the CPUs are being used in the Linux machine, and still, it takes longer to finish the execution.

I have tried this several times, and it consistently yields the same execution times.

Thank you for any comment on this.

Regards,

Juanjo.

@Keno
Copy link
Member

Keno commented Aug 6, 2013

Without knowing the code you're running, we can't possibly make an assessment.

@guolas
Copy link
Author

guolas commented Aug 6, 2013

The code I am using can be found in:

julia_test

If that can help understanding what happens.

@pao
Copy link
Member

pao commented Aug 6, 2013

@staticfloat
Copy link
Member

First, I would try using the profiler to figure out which lines of code are sucking up most of the CPU time. It's possible there's some operation that is just taking longer on Linux than OSX for some strange reason, and the profiler will help track that down.

If that doesn't solve the mystery, could you post the output of versioninfo() from both of your systems? It's possible we're comparing apples and oranges here, as your BLAS implementations could make a significant difference, as we've seen in some (admittedly linear algebra-heavy) tests, which may or may not be happening here. In any case, that should show up in the profiler as well, as it will show the BLAS/LAPACK functions taking up all the CPU time.

@guolas
Copy link
Author

guolas commented Aug 7, 2013

The versioninfo() I can provide it now. For the Mac (that runs faster):

julia> versioninfo()
Julia Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC
Platform Info:
  System: Darwin (x86_64-apple-darwin12.4.0)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

For the Linux:

julia> versioninfo()
Julia Version 0.2.0-prerelease+3005
Commit 0d39c66 2013-08-06 17:44:39 UTC
Platform Info:
  System: Linux (x86_64-redhat-linux)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

I guess that this does not help much, as the versions are the same in both systems. I will post information about the profiling when I got them.

Thank you.

@ViralBShah
Copy link
Member

Yes, knowing what the profiler says will be useful. For example, is it a case of openblas tuning, or is it a julia compiler issue? We certainly need to chase this one down.

@JeffBezanson
Copy link
Member

It would be interesting to try this with export OPENBLAS_NUM_THREADS=1. I get:

[jeff ~/src/julia]$ env|grep BLAS
OPENBLAS_NUM_THREADS=1
[jeff ~/src/julia]$ ./julia i3966.jl
elapsed time: 58.775533131 seconds
[jeff ~/src/julia]$ export OPENBLAS_NUM_THREADS=2
[jeff ~/src/julia]$ ./julia i3966.jl
elapsed time: 77.535673898 seconds

@xianyi

@xianyi
Copy link

xianyi commented Aug 7, 2013

Does it call GEMM or other BLAS functions? Could you give me a list?

@JeffBezanson
Copy link
Member

I believe it is mostly dgemm, zgemm, and whatever zgesdd calls.

@xianyi
Copy link

xianyi commented Aug 7, 2013

I see. It is an old single thread / multi-threading adjustment issue.

@JeffBezanson
Copy link
Member

More info: most of the time is doing many zgesdd calls of size 18x30. Can be reproduced by @time for i=1:10000; svd(rand(18,30),false); end

@staticfloat
Copy link
Member

@guolas, can you tell how many threads are being started up in each case? Perhaps via htop or something....

@ViralBShah
Copy link
Member

@xianyi Why not use multi threading only for problems bigger than something like 200x200? We can do some experiments and find these thresholds.

@guolas
Copy link
Author

guolas commented Aug 7, 2013

I have tried limiting the number of threads as @JeffBezanson suggested, and I get this:

For export OPENBLAS_NUM_THREADS=1:

  • Mac: elapsed time: 64.820198286 seconds
  • Linux: elapsed time: 59.656547279 seconds

For export OPENBLAS_NUM_THREADS=2:

  • Mac: elapsed time: 101.788748919 seconds
  • Linux: elapsed time: 83.597288193 seconds

And the time keeps increasing for the Linux machine, until I use the whole 12 threads that it can use.

When I don't set anything, according to top, it takes 1200% of the CPU, which probably means that it is using 12 threads.

@ViralBShah
Copy link
Member

@blakejohnson In your experiments, how does MKL fare on small problems, compared to openblas?

@blakejohnson
Copy link
Contributor

I just got around to running the BLAS tests in test/perf/blas on my 32-bit Ubuntu 12.04LTS machine where I have access to MKL. Surprisingly, OpenBLAS is actually faster in all the level-1 tests, and only struggles in small level-2 and level-3 instances.

BLAS results

@blakejohnson
Copy link
Contributor

Since it is probably relevant, the above results were run on a quad core Intel Xeon E5410.

@xianyi
Copy link

xianyi commented Aug 19, 2013

Hi @blakejohnson ,

I didn't know GEMV has this significant performance gap before.
Thank you for the test.

Xianyi

@ViralBShah
Copy link
Member

Cc @nutsiepully

@blakejohnson
Copy link
Contributor

Perhaps the wrong place to post this, but I've also looked at LAPACK performance of MKL vs OpenBLAS. The general results are all over the board. It is easier to see things when separated into two groups. First, tests where MKL does well (x-axis is performance relative to MKL, plotting results > 1.1):
lapack-results-good

There are also places where MKL does poorly (results < 0.7):
lapack-results-bad

Some comments
These tests make it clear that there is room for optimization of threading thresholds in OpenBLAS, as it is often the case on small problems that single-threaded OpenBLAS is fastest. In some cases, the margin is quite large, such as on schurtest_medium.

Also interesting: there are some tests where OpenBLAS absolutely crushes MKL, such as medium-huge LU and QR factorizations. I find this surprising since I was under the impression that OpenBLAS used a reference implementation of LAPACK, whereas Intel has made attempts to accelerate some of these methods.

The unfortunate thing is that this means there is no universal "best" choice for BLAS/LAPACK library. My work heavily relies on Hermitian eigenvalue problems, where MKL is better, but for users, OpenBLAS may make more sense.

@xianyi
Copy link

xianyi commented Aug 28, 2013

Thank you for the report.

GotoBLAS/OpenBLAS optimized LU and Cholesky factorization.

@ViralBShah
Copy link
Member

@xianyi How come gemv with openblas is so much slower than MKL? See the first plot in this issue.

@xianyi
Copy link

xianyi commented Nov 29, 2013

@ViralBShah ,
For gemv, OpenBLAS used one kernel for all matrix size. Obviously, MKL used different implementation for the different size.

@JeffBezanson
Copy link
Member

See update here:
OpenMathLib/OpenBLAS#103 (comment)

Looks like we will have to patch openblas' Makefile.rule.

@ViralBShah
Copy link
Member

We do pass GEMM_MULTITHREADING_THRESHOLD=50 now, so perhaps this is no longer an issue. Please open if it still is.

@axsk
Copy link
Contributor

axsk commented Dec 28, 2015

I was experiencing a similar problem running some code first on my PC and later on cluster, launching 32 julia instances on 32 cores. Then I was surprised how much slower (per process) the cluster was.

Weeks later I discovered this here, and indeed forcing export OPENBLAS_NUM_THREADS=1 lead to a 3x speed-up on the cluster.

Maybe this should be mentioned somewhere in the documentation.

@ViralBShah
Copy link
Member

I believe it is mentioned, but if not, we certainly should.

@ViralBShah
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

9 participants