Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

derbeyn · 2019-05-17T06:43:23Z

When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes.

Fixes issue #6677

Signed-off-by: Nadia Derbey Nadia.Derbey@atos.net

When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes. Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

ggouaillardet · 2019-05-20T05:44:41Z

@derbeyn I tried this PR but could not see any performance improvement with ICC 2019 nor gcc 9.1.0.
can you please share your configure command line and the version of icc you are using ?

derbeyn · 2019-05-21T12:07:44Z

@ggouaillardet sorry for the delay: I didn't get the notification!

for icc2019, I used
[derbeyn@genji0 BRRT-893]$ icc --version
icc (ICC) 19.0.3.199 20190206

Attaching the file that contains the configure

my_configure.txt

derbeyn · 2019-05-21T12:38:17Z

Note also that the difference between our settings is that OPAL_CUDA_SUPPORT is defined to 0 in my case. I intentionally didn't want to do any change if OPAL_CUDA_SUPPORT.

ggouaillardet · 2019-05-22T06:13:41Z

@derbeyn I am not seeing any improvements with this patch :-(

I am using #6569 and compared this commit with my hacky patch I posted inline #6617 and the following benchmark

subroutine pingpong_usempi(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer i, ierr

if (rank.eq.0) then
  do i=0,1048577
    buf(i) = i
  enddo
  call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
  call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  t0 = MPI_Wtime()
  do i=1,1024
    call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
    call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  end do
  t1 = MPI_Wtime()
  write (*,*) 'BW fortran = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
  buf = -1
  call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
  do i=1,1024
    call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
    call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
  end do
  do i=0,1048577,2
    if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
  enddo
  do i=1,1048576,2
    if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
  enddo
endif
end subroutine

subroutine pingpong_ddt(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer :: datatype
integer i, ierr

call mpi_type_vector(524288, 1, 2, MPI_INT, datatype, ierr)
call mpi_type_commit(datatype, ierr)

if (rank.eq.0) then
  do i=0,1048577
    buf(i) = i
  enddo
  call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
  call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  t0 = MPI_Wtime()
  do i=1,1024
    call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
    call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  end do
  t1 = MPI_Wtime()
  write (*,*) 'BW ddt = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
  buf = -1
  call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
  do i=1,1024
    call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
    call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
  end do
  do i=0,1048577,2
    if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
  enddo
  do i=1,1048576,2
    if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
  enddo
endif
end subroutine

program pingpong
use mpi
implicit none

integer :: buf(0:1048577)
integer :: datatype, ierr
integer type_size
integer rank

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)

call pingpong_usempi(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)

call pingpong_ddt(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)

call mpi_finalize(ierr)
end program

with this commit

$ ~/local/ompi-cdesc-intel2019/bin/mpirun -np 2 --mca btl vader,self --mca pml ob1 ./bench
 BW fortran =   0.604091396249432       GW/s
 BW ddt =   0.254574704512480       GW/s

but with my patch

mpirun --mca pml ob1 --mca btl vader,self ./bench 
 BW fortran =   0.484547829557082       GW/s
 BW ddt =    1.14550195763478       GW/s

I am using icc 19.0.3.199 and my configure command line is

configure '--enable-mpirun-prefix-by-default' '--prefix=/home/usersup/gilles/local/ompi-cdesc-intel2019' 'CC=icc' 'CXX=icpc' 'FC=ifort' 'CFLAGS=-DNDEBUG -O2 -g -pipe -m64' 'CXXFLAGS=-DNDEBUG -O2 -g -pipe -m64' 'FCFLAGS=-O2 -g -pipe -m64'

(I do not have your platform file but I do not expect this explains the difference gap)

can you try running the same benchmark with this PR on top of #6659 ?
the expected result is BW ddt will be significantly higher than BW fortran

bosilca · 2019-05-22T06:46:31Z

@ggouaillardet did you meant #6695 ?

I think it could be nice to transition all pending DDT improvements on that branch, it has significantly higher performance that the current datatype engine on all tests I had run so far, and a lot more potential for additional optimizations.

derbeyn · 2019-05-22T07:13:31Z

@ggouaillardet : OK I'll do that, but note 2 things, as I said in issue #6677

the patch is targetting a very special case: indexed datatypes that involve single basic types (such as MPI_INT's or MPI_DOUBLE's)
I did all the tests on the v3.x branch.

So I'll take @bosilca's branch, and apply my patch on top of the improvements he already did.

ggouaillardet · 2019-05-22T07:21:24Z

@bosilca sorry, I did mean #6569 but that is not relevant since the benchmark does not involve use mpi_f08. I will now try against #6695 since this is the next way to go.

@derbeyn I tried on (patched) master instead of v3.x branch.
The ddt of this benchmark is only made of MPI_INT
so you should really compare the bench with this PR vs the inline patch in #6617

derbeyn · 2019-05-22T08:28:08Z

@ggouaillardet , I think I understand why you're not getting the same results as I do: I'm using indexed datatypes, while you are using vectors. So we don't go through the same code path in pack_predefined_data().
On my side, I only patched the memcpy that is here:

ompi/opal/datatype/opal_datatype_pack.h

Line 56 in b780667

MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );

But when using MPI_Type_vector, we don't follow the same path (we go through

ompi/opal/datatype/opal_datatype_pack.h

Line 65 in b780667

MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );

).

I didn't want to be too intrusive, and wanted to do the change only for the indexed datatypes in a first step.

So what I'll do, when testing on top of George's changes, is that I'll call my "BASIC_DDT_MEMCPY_SUM()" in both cases.
I'll keep you informed.

derbeyn · 2019-05-22T09:49:07Z

OK, I can confirm what I was saying in my previous note: I have 2 reproducers:
. col_matrix_lat_indexed.icc_2019 : measures columns latency exchanges, using indexed datatypes
. col_matrix_lat_vector.icc_2019 : measures columns latency exchanges, using vector datatypes

============== reference ompi (still on v3.x branch):

[derbeyn@genji0 tests_on_master]$
[derbeyn@genji0 tests_on_master]$ type mpicc
mpicc is /home_nfs_robin_ib/derbeyn/DISTS/ompi-v3.x-bull-icc-19.0.2.187.ref/bin/mpicc

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double

(./col_matrix_lat_vector.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 269.285128 usecs

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double

(./col_matrix_lat_indexed.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 261.104827 usecs

============== reference ompi + patch (still v3.x branch):

[derbeyn@genji0 tests_on_master]$ type mpicc
mpicc is /home_nfs_robin_ib/derbeyn/DISTS/ompi-v3.x-bull-icc-19.0.2.187.new_inline/bin/mpicc

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double
(./col_matrix_lat_indexed.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 89.354105 usecs

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double
(./col_matrix_lat_vector.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 261.242199 usecs

Just tell me if you want the repoducers.

ggouaillardet · 2019-05-22T13:23:25Z

@derbeyn thanks for the explanations. I will appreciate you share the reproducers.

derbeyn · 2019-05-22T13:48:32Z

sure!
for_gilles.tar.gz

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

bosilca · 2019-05-22T15:01:53Z

I have imported and improved upon this patch in #6695 (commit 7e14d75). Let's move all future discussions there.

ggouaillardet · 2019-05-23T04:02:26Z

Thanks Nadia ! I made a simple change (and just realized @bosilca came up with a very similar one) and observed similar performances compared to my much bigger patch.

derbeyn · 2019-05-23T05:23:07Z

@bosilca thx a lot for taking care of that.

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

AboorvaDevarajan · 2019-05-31T01:41:48Z

Can one of the admins verify this patch?

gpaulsen · 2019-06-07T22:09:42Z

@derbeyn can we close this PR in favor of #6695 ?

derbeyn · 2019-06-11T06:06:35Z

yes, sure!

bosilca · 2019-06-13T21:36:48Z

For closure, I run @derbeyn tests with the old and new datatype engine. The results are satisfactory (on a Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz).

master
(./col_matrix_lat_vector) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 526.881516 usecs
(./col_matrix_lat_indexed) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 636.442979 usecs

New datatype
(./col_matrix_lat_vector) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 117.476873 usecs
(./col_matrix_lat_indexed) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 652.090864 usecs

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

@derbeyn

This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

Force memcpy inlining to assignments during pack/unpack

7f1e8e4

When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes. Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>

bwbarrett added the Target: main label Jun 4, 2019

jsquyres closed this Jun 11, 2019

jsquyres mentioned this pull request Jul 1, 2019

v3.0.x: Update to PMIx v2.2.3 pre-release candidate #6779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

derbeyn commented May 17, 2019

ggouaillardet commented May 20, 2019

derbeyn commented May 21, 2019

derbeyn commented May 21, 2019

ggouaillardet commented May 22, 2019

bosilca commented May 22, 2019

derbeyn commented May 22, 2019

ggouaillardet commented May 22, 2019

derbeyn commented May 22, 2019

derbeyn commented May 22, 2019

ggouaillardet commented May 22, 2019

derbeyn commented May 22, 2019

bosilca commented May 22, 2019

ggouaillardet commented May 23, 2019

derbeyn commented May 23, 2019

AboorvaDevarajan commented May 31, 2019

gpaulsen commented Jun 7, 2019

derbeyn commented Jun 11, 2019

bosilca commented Jun 13, 2019

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Conversation

derbeyn commented May 17, 2019

ggouaillardet commented May 20, 2019

derbeyn commented May 21, 2019

derbeyn commented May 21, 2019

ggouaillardet commented May 22, 2019

bosilca commented May 22, 2019

derbeyn commented May 22, 2019

ggouaillardet commented May 22, 2019

derbeyn commented May 22, 2019

derbeyn commented May 22, 2019

ggouaillardet commented May 22, 2019

derbeyn commented May 22, 2019

bosilca commented May 22, 2019

ggouaillardet commented May 23, 2019

derbeyn commented May 23, 2019

AboorvaDevarajan commented May 31, 2019

gpaulsen commented Jun 7, 2019

derbeyn commented Jun 11, 2019

bosilca commented Jun 13, 2019