-
Notifications
You must be signed in to change notification settings - Fork 48
SuperMUC docu
As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don't support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.
I configured the latest bgq_omp
branch of my tmLQCD fork with the following options, after module load mkl
, for a mixed MPI + openMP executable:
configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2 --with-alignment=32 --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"
Here is an example for loadleveler job file
#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark
It is expected that pure MPI will perform best on this machine but it is worth trying a 2x8 hybrid approach with 2 processes per node and 8 threads per process. (as each node has two processors with 8 cores each)
From the Zeuthen cluster - which has older CPUs clocked at similar clockrates - it is expected that this machine should provide a performance of around 3.5 GFlops per core with communication and around 5 GFlops per core without. It is also expected that the halfspinor version of the code will underperform "node-locally" but overperform with many MPI processes and non-node-local communication. In a pure MPI approach the halfspinor version should be faster always.
The performance I got from a benchmark
run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)
- A comment on the local lattice size: the CPU has 20MB L2 cache and you're running 8 processes per CPU if I understand correctly. Therefore even your gauge field won't fit in the cache. Better to decrease the local lattice size by a factor of 2.
- Also, it is worth testing whether running 8 threads per task, two tasks per node wouldn't be faster.
- Finally, the OpenMP overhead might be so large on Intel that it makes more sense to simply run two processes per core!
- 1193 Mflops per core with communication
- 2443 Mflops per core without communication
with 256 tasks
- 1294 Mflops Mflops per core with communication
- 2473 Mflops per core without communication
with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.
Using --with-alignment=32 -axAVX
performance is better (on 256 tasks again):
- 1455 Mflops Mflops per core with communication
- 3029 Mflops per core without communication
Using halfspinor
gives again better performance
- 1813 Mflops Mflops per core with communication
- 2926 Mflops per core without communication
Don't know what this is in % of peak performance right now.
The E5-2680 is a 2 hardware SMT per core design so this is not surprising.
Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)
On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!
It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...
Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)
../../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort
24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2
--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops
--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops
--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops
--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops