-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS performance tests #3566
Comments
This should also help to decide a size threshold -- when the sizes of matrices are below this threshold, we should fallback to hand crafted light weight implementation. |
This would be useful to help make tuning decisions in openblas (OpenMathLib/OpenBLAS#103) |
The BLAS performance tests are already showing something interesting. @staticfloat, what is the sysblas on criid? It seems to be at least 50% faster than openblas on the level-1 tests. At level-2, openblas finally is faster at the "large" and "huge" sizes of gemv. At level-3, the differences are pretty minimal. |
sysblas is the flavor of julia built against the system-provided BLAS. On a Debian-based system, that means whatever is providing For the purposes of these benchmarks, sysblas is reference blas on Linux, and is Accelerate on OSX. |
I see, so since 'criid' is an OSX machine, I'm looking at OpenBLAS vs Accelerate. |
Yep. Exactly. We definitely want to compare the BLAS implementations on various systems and see how they affect us on all our benchmarks. |
I probably should have asked this before throwing a bunch of tests out there, but... What is the appropriate design for BLAS performance tests, in terms of how many times to loop an operation? I chose iteration counts such that each test would take >100ms on my machine, and scaled the counts inversely with the problem size to keep the runtime of each test within an order of magnitude of each other. It now occurs to me that this requires a little extra care in interpreting the results, because now the absolute time differences of the smaller problem sizes are 'levered up' by the large numbers of iterations. Fortunately, one can still compare the relative difference between different BLAS implementations, because the percent differences should still accurately convey the performance gaps. |
That was what I was going for to start with. This is also good enough to detect regressions over time. |
Indeed. I've also seen BLAS tests where the size of the problem is divided out so that the units being reported are no longer seconds, but rather FLOPS (Floating-Point Operations Per Second). I'm not sure we have a problem here yet, though. I still have yet to track down all the segfaults and bugs in codespeed that are preventing everything from working perfectly. :) |
Can/should this be closed? |
Closing this as too general and is somewhat addressed by all the benchmarking work in recent times. |
We should have a set of BLAS performance tests that test various problems of various sizes with different numbers of threads. We can then compare performance of different BLAS libraries on different platforms.
The text was updated successfully, but these errors were encountered: