-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openblas threads slow for small matrices #3965
Comments
Without knowing the code you're running, we can't possibly make an assessment. |
The code I am using can be found in: If that can help understanding what happens. |
Direct link to source: https://github.com/guolas/julia_test/blob/master/complete.jl |
First, I would try using the profiler to figure out which lines of code are sucking up most of the CPU time. It's possible there's some operation that is just taking longer on Linux than OSX for some strange reason, and the profiler will help track that down. If that doesn't solve the mystery, could you post the output of |
The
For the Linux:
I guess that this does not help much, as the versions are the same in both systems. I will post information about the profiling when I got them. Thank you. |
Yes, knowing what the profiler says will be useful. For example, is it a case of openblas tuning, or is it a julia compiler issue? We certainly need to chase this one down. |
It would be interesting to try this with
|
Does it call GEMM or other BLAS functions? Could you give me a list? |
I believe it is mostly dgemm, zgemm, and whatever zgesdd calls. |
I see. It is an old single thread / multi-threading adjustment issue. |
More info: most of the time is doing many zgesdd calls of size 18x30. Can be reproduced by |
@guolas, can you tell how many threads are being started up in each case? Perhaps via htop or something.... |
@xianyi Why not use multi threading only for problems bigger than something like 200x200? We can do some experiments and find these thresholds. |
I have tried limiting the number of threads as @JeffBezanson suggested, and I get this: For
For
And the time keeps increasing for the Linux machine, until I use the whole 12 threads that it can use. When I don't set anything, according to |
@blakejohnson In your experiments, how does MKL fare on small problems, compared to openblas? |
Since it is probably relevant, the above results were run on a quad core Intel Xeon E5410. |
Hi @blakejohnson , I didn't know GEMV has this significant performance gap before. Xianyi |
Cc @nutsiepully |
Thank you for the report. GotoBLAS/OpenBLAS optimized LU and Cholesky factorization. |
@xianyi How come |
@ViralBShah , |
See update here: Looks like we will have to patch openblas' Makefile.rule. |
We do pass |
I was experiencing a similar problem running some code first on my PC and later on cluster, launching 32 julia instances on 32 cores. Then I was surprised how much slower (per process) the cluster was. Weeks later I discovered this here, and indeed forcing Maybe this should be mentioned somewhere in the documentation. |
I believe it is mentioned, but if not, we certainly should. |
Hi,
I have just downloaded the latest version available on Github, and I have compiled it in two different systems:
In both systems I have compiled the following version:
After executing the same Julia code, I have that the execution times are as follows:
Does it really make any sense?
I mean, the Linux computer is way more powerful, with 12 cores, each of which almost double the clock speed. And as it is executing, I can see, using
top
that all the CPUs are being used in the Linux machine, and still, it takes longer to finish the execution.I have tried this several times, and it consistently yields the same execution times.
Thank you for any comment on this.
Regards,
Juanjo.
The text was updated successfully, but these errors were encountered: