-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vnni512 is faster than vnni256 on Xeon w5-2445 despite of MHz throttling (downlocking) on AVX-512-heavy code #5757
Comments
|
|
@maximmasiutin Lines 391 to 396 in c085670
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the issue
Despite the report at #3038 the downclocking does not held true on all CPUs. That report refers to a question at https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency asked more than 5 years ago.
Today Xeon w5-2445 with GCC 14 is faster on vnni512 compared to vnni256 (and all other options, including vnni256, avx2, bmi2, etc.). I also testes GCC 13 vs GCC 14 on vnn512: the difference is negligible but statistically present.
Here are attached the runs of https://github.com/hazzl/pyshbench
The "bench" parameter makes Stockfish run a single thread whereas "speedtest" uses all available threads on the CPU, that is 10 cores 20 threads in my case, that's why the speed increase is more noticeable with the "speedtest" parameter. Despite just 20 runs of "speedtest", they took more time combined than 500 runs on "bench".
@vondele correctly pointed out at #3038 (comment) that (quote): "The problem is that this frequency behavior will change over time, and presumably the widest vectors will eventually be most efficient."
It is probably GCC 14 and correct code that can be unrolled by the compiler, implemented by @mstembera here:
32e46fc47 (mstembera 2024-01-08 23:20:23 -0800 231) vec_add_dpbusd_32(acc[k], in0, col0[k]);
This new code does not have dependency on previous data as was the case before.
The StockFish at https://github.com/official-stockfish/Stockfish/blob/master/scripts/get_native_properties.sh seems to deliberately avoid vnni512 unless this target is explicitly specified as
make -j profile-build ARCH=x86-64-vnni512
, so does FishTest at https://github.com/official-stockfish/fishtest/blob/master/worker/games.py#L636 (and line 643).Attached files:
bench-pyshbench-log.txt
speedtest-pyshbench-log.txt
speedtest-13-vs-14.txt
Expected behavior
vnni512 is used by default on capable processors
Steps to reproduce (vnni256 vs vnni512 on GCC 14)
COMPCXX=g++-14
(or15
,16
, whichever applicable) parameter tomake
, e.g.make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14
~/1/
and~/2/
2.1 in first directory compile Stockfish with vnni256 target by running
make -j profile-build COMP=x86-64-vnni256 COMPCXX=g++-14
, and renamestockfish
executable to stockfish-x86-64-vnni256 and copy it to ~/1/stockfish-x86-64-vnni2562.2 in second directory compile Stockfish with vnni512 target by running
make -j profile-build COMP=x86-64-vnni512 COMPCXX=g++-14
, and renamestockfish
executable to stockfish-x86-64-vnni512 and copy it to ~/2/stockfish-x86-64-vnni512./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 500 > bench.txt
./pyshbench
and replace"bench"
to"speedtest"
, then run./pyshbench ~/1/stockfish-x86-64-vnni256 ~/2/stockfish-x86-64-vnni512 20 > speedtest.txt
Steps to reproduce (vnni512 GCC13 vs vnni512 on GCC 14)
make -j profile-build ARCH=x86-64-vnni512 COMPCXX=g++-13
Anything else?
I run the tests with Ubuntu 24.04.1 LTS and GCC on Windows Subsystem for Linux (WSL2) under Windows 11
Operating system
Linux
Stockfish version
dev-20250106-c76c1793
P.S. Thanks to @Disservin for guidance, and for help in finding the links to relevant code.
The text was updated successfully, but these errors were encountered: