-
Notifications
You must be signed in to change notification settings - Fork 48
SuperMUC docu
As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don’t support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.
I configured the latest bgq_omp
branch of my tmLQCD fork with the following options, after module load mkl
, for a mixed MPI + openMP executable:
configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2 --with-alignment=32 --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"
Here is an example for loadleveler job file
#!/bin/bash # #@ job_type = parallel #@ class = large #@ node = 8 #@ total_tasks= 128 #@ island_count=1 ### other example ##@ tasks_per_node = 16 ##@ island_count = 1,18 #@ wall_clock_limit = 0:15:00 #@ job_name = mytest #@ network.MPI = sn_all,not_shared,us #@ initialdir = $(home)/cu/head/testrun/ #@ output = job$(jobid).out #@ error = job$(jobid).err #@ notification=always #@ notify_user=you@there.com #@ queue . /etc/profile . /etc/profile.d/modules.sh export MP_SINGLE_THREAD=no export OMP_NUM_THREADS=2 # Pinning export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS mpiexec -n 128 ./benchmark
The performance I got from a benchmark
run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)
-
1193 Mflops per core with communication
-
2443 Mflops per core without communication
with 256 taks
-
1294 Mflops Mflops per core with communication
-
2473 Mflops per core without communication
with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.
Using --with-alignment=32 -axAVX
performance is better (on 256 tasks again):
-
1455 Mflops Mflops per core with communication
-
3029 Mflops per core without communication
Using halfspinor
gives again better performance
-
1813 Mflops Mflops per core with communication
-
2926 Mflops per core without communication
Don’t know what this is in % of peak performance right now.
The E5-2680 is a 2 hardware SMT per core design so this is not surprising.
Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)
On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!
It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX…
Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)
24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2
--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops
--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops
--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops
--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops