-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZEN support #1133
Add ZEN support #1133
Conversation
In the second to last paragraph - is it slower with both threads active or no gain over single thread?
|
Performance depends on the number of threads in quite a complex way. Here are detailed timing results for zlinpack (last row, "200"): OPENBLAS_NUM_THREADS=1 taskset 0x1 ./zlinpack.goto => 10.4 GFLOPS
We see that going above 8 cores starts to use SMT threads and sort of kills performance. Odd number of threads also seems to exhibit strange behavior. Here are the results for slinpack: OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto => 13.5 GFLOPS Tests run on an AMD Ryzen 7 1700 at stock clock speeds (3.0 Ghz base, 3.2 Ghz all-core boost, I cannot see on Linux whether boost was enabled), on an MSI B350 Tomahawk board with 2x 4GB DDR4 2600 Mhz RAM. |
Something like this - 1st hyperthread vs both running for a second or 10, i.e if there is gain or loss in concurrent use of same core (as you see with ivy laptop i3 result is not regression):
|
Ok, I understand now (I purposefully avoided putting threads on the same SMT cores):
By the way, I'm very impressed by your laptop CPU. Fortunately, I've tested with 8 threads on my Ryzen and I get 195343.00 MFlops, so scaling seems to work well. With all 16 threads used, I get 157717.47 MFlops. |
Indeed yours back your point that 2 threads per core is a loss.... |
Here it is
|
Looks reasonable,most likely absolutely correct. |
Thanks for the patch - looks good to me, I had only held back on committing to give more senior team members a chance to comment. |
This patch adds the following features
Zen is currently heavily based on Haswell (Excavator param.h tuning, Haswell kernels). I've tried to tune OpenBLAS for Zen but started to get incorrect result notifications. This patch does not do any Zen-specific tuning.
If you are interested, here is what I have observed by trying to optimize several parameters from param.h using blind brute-force:
Please note that I have tested all the above values without really knowing what they mean. Some of them may not make any sense.
As a remaining problem, OpenBLAS detects 16 cores while my Ryzen CPU has 8 cores and 16 threads. Manually forcing OMP_NUM_THREADS to 8 leads to quite a nice performance boost as the threads stop competing for cache and memory accesses.
If you want SSH access to a Ryzen 1700 machine (that has a public IP address), we can arrange that.