Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Closed
wants to merge 1 commit into from

Conversation

derbeyn
Copy link
Contributor

@derbeyn derbeyn commented May 17, 2019

When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes.

Fixes issue #6677

Signed-off-by: Nadia Derbey Nadia.Derbey@atos.net

When using intel compilers, this patch enables a big latency improvement
when packing/unpacking some kind of indexed datatypes.

Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@ggouaillardet
Copy link
Contributor

@derbeyn I tried this PR but could not see any performance improvement with ICC 2019 nor gcc 9.1.0.
can you please share your configure command line and the version of icc you are using ?

@derbeyn
Copy link
Contributor Author

derbeyn commented May 21, 2019

@ggouaillardet sorry for the delay: I didn't get the notification!

for icc2019, I used
[derbeyn@genji0 BRRT-893]$ icc --version
icc (ICC) 19.0.3.199 20190206

Attaching the file that contains the configure

my_configure.txt

@derbeyn
Copy link
Contributor Author

derbeyn commented May 21, 2019

Note also that the difference between our settings is that OPAL_CUDA_SUPPORT is defined to 0 in my case. I intentionally didn't want to do any change if OPAL_CUDA_SUPPORT.

@ggouaillardet
Copy link
Contributor

@derbeyn I am not seeing any improvements with this patch :-(

I am using #6569 and compared this commit with my hacky patch I posted inline #6617 and the following benchmark

subroutine pingpong_usempi(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer i, ierr

if (rank.eq.0) then
  do i=0,1048577
    buf(i) = i
  enddo
  call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
  call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  t0 = MPI_Wtime()
  do i=1,1024
    call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
    call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  end do
  t1 = MPI_Wtime()
  write (*,*) 'BW fortran = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
  buf = -1
  call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
  do i=1,1024
    call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
    call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
  end do
  do i=0,1048577,2
    if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
  enddo
  do i=1,1048576,2
    if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
  enddo
endif
end subroutine

subroutine pingpong_ddt(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer :: datatype
integer i, ierr

call mpi_type_vector(524288, 1, 2, MPI_INT, datatype, ierr)
call mpi_type_commit(datatype, ierr)

if (rank.eq.0) then
  do i=0,1048577
    buf(i) = i
  enddo
  call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
  call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  t0 = MPI_Wtime()
  do i=1,1024
    call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
    call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  end do
  t1 = MPI_Wtime()
  write (*,*) 'BW ddt = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
  buf = -1
  call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
  call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
  do i=1,1024
    call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
    call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
  end do
  do i=0,1048577,2
    if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
  enddo
  do i=1,1048576,2
    if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
  enddo
endif
end subroutine

program pingpong
use mpi
implicit none

integer :: buf(0:1048577)
integer :: datatype, ierr
integer type_size
integer rank

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)

call pingpong_usempi(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)

call pingpong_ddt(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)

call mpi_finalize(ierr)
end program

with this commit

$ ~/local/ompi-cdesc-intel2019/bin/mpirun -np 2 --mca btl vader,self --mca pml ob1 ./bench
 BW fortran =   0.604091396249432       GW/s
 BW ddt =   0.254574704512480       GW/s

but with my patch

mpirun --mca pml ob1 --mca btl vader,self ./bench 
 BW fortran =   0.484547829557082       GW/s
 BW ddt =    1.14550195763478       GW/s

I am using icc 19.0.3.199 and my configure command line is

configure '--enable-mpirun-prefix-by-default' '--prefix=/home/usersup/gilles/local/ompi-cdesc-intel2019' 'CC=icc' 'CXX=icpc' 'FC=ifort' 'CFLAGS=-DNDEBUG -O2 -g -pipe -m64' 'CXXFLAGS=-DNDEBUG -O2 -g -pipe -m64' 'FCFLAGS=-O2 -g -pipe -m64'

(I do not have your platform file but I do not expect this explains the difference gap)

can you try running the same benchmark with this PR on top of #6659 ?
the expected result is BW ddt will be significantly higher than BW fortran

@bosilca
Copy link
Member

bosilca commented May 22, 2019

@ggouaillardet did you meant #6695 ?

I think it could be nice to transition all pending DDT improvements on that branch, it has significantly higher performance that the current datatype engine on all tests I had run so far, and a lot more potential for additional optimizations.

@derbeyn
Copy link
Contributor Author

derbeyn commented May 22, 2019

@ggouaillardet : OK I'll do that, but note 2 things, as I said in issue #6677

  1. the patch is targetting a very special case: indexed datatypes that involve single basic types (such as MPI_INT's or MPI_DOUBLE's)
  2. I did all the tests on the v3.x branch.

So I'll take @bosilca's branch, and apply my patch on top of the improvements he already did.

@ggouaillardet
Copy link
Contributor

@bosilca sorry, I did mean #6569 but that is not relevant since the benchmark does not involve use mpi_f08. I will now try against #6695 since this is the next way to go.

@derbeyn I tried on (patched) master instead of v3.x branch.
The ddt of this benchmark is only made of MPI_INT
so you should really compare the bench with this PR vs the inline patch in #6617

@derbeyn
Copy link
Contributor Author

derbeyn commented May 22, 2019

@ggouaillardet , I think I understand why you're not getting the same results as I do: I'm using indexed datatypes, while you are using vectors. So we don't go through the same code path in pack_predefined_data().
On my side, I only patched the memcpy that is here:

MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );

But when using MPI_Type_vector, we don't follow the same path (we go through
MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );
).

I didn't want to be too intrusive, and wanted to do the change only for the indexed datatypes in a first step.

So what I'll do, when testing on top of George's changes, is that I'll call my "BASIC_DDT_MEMCPY_SUM()" in both cases.
I'll keep you informed.

@derbeyn
Copy link
Contributor Author

derbeyn commented May 22, 2019

OK, I can confirm what I was saying in my previous note: I have 2 reproducers:
. col_matrix_lat_indexed.icc_2019 : measures columns latency exchanges, using indexed datatypes
. col_matrix_lat_vector.icc_2019 : measures columns latency exchanges, using vector datatypes

============== reference ompi (still on v3.x branch):

[derbeyn@genji0 tests_on_master]$
[derbeyn@genji0 tests_on_master]$ type mpicc
mpicc is /home_nfs_robin_ib/derbeyn/DISTS/ompi-v3.x-bull-icc-19.0.2.187.ref/bin/mpicc

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double

(./col_matrix_lat_vector.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 269.285128 usecs

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double

(./col_matrix_lat_indexed.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 261.104827 usecs

============== reference ompi + patch (still v3.x branch):

[derbeyn@genji0 tests_on_master]$ type mpicc
mpicc is /home_nfs_robin_ib/derbeyn/DISTS/ompi-v3.x-bull-icc-19.0.2.187.new_inline/bin/mpicc

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double
(./col_matrix_lat_indexed.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 89.354105 usecs

[derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double
(./col_matrix_lat_vector.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 261.242199 usecs

Just tell me if you want the repoducers.

@ggouaillardet
Copy link
Contributor

@derbeyn thanks for the explanations. I will appreciate you share the reproducers.

@derbeyn
Copy link
Contributor Author

derbeyn commented May 22, 2019

sure!
for_gilles.tar.gz

bosilca added a commit to bosilca/ompi that referenced this pull request May 22, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@bosilca
Copy link
Member

bosilca commented May 22, 2019

I have imported and improved upon this patch in #6695 (commit 7e14d75). Let's move all future discussions there.

@ggouaillardet
Copy link
Contributor

Thanks Nadia ! I made a simple change (and just realized @bosilca came up with a very similar one) and observed similar performances compared to my much bigger patch.

@derbeyn
Copy link
Contributor Author

derbeyn commented May 23, 2019

@bosilca thx a lot for taking care of that.

bosilca added a commit to bosilca/ompi that referenced this pull request May 29, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@AboorvaDevarajan
Copy link
Member

Can one of the admins verify this patch?

@gpaulsen
Copy link
Member

gpaulsen commented Jun 7, 2019

@derbeyn can we close this PR in favor of #6695 ?

@derbeyn
Copy link
Contributor Author

derbeyn commented Jun 11, 2019

yes, sure!

@jsquyres jsquyres closed this Jun 11, 2019
@bosilca
Copy link
Member

bosilca commented Jun 13, 2019

For closure, I run @derbeyn tests with the old and new datatype engine. The results are satisfactory (on a Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz).

master
(./col_matrix_lat_vector) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 526.881516 usecs
(./col_matrix_lat_indexed) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 636.442979 usecs

New datatype
(./col_matrix_lat_vector) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 117.476873 usecs
(./col_matrix_lat_indexed) LATENCY MATRIX 4096 x 4096 doubles (usecs)
col=2047 -----------------> 652.090864 usecs

bosilca added a commit to bosilca/ompi that referenced this pull request Jun 19, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
bosilca added a commit to bosilca/ompi that referenced this pull request Jun 19, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
bosilca added a commit to bosilca/ompi that referenced this pull request Jun 19, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
bosilca added a commit to bosilca/ompi that referenced this pull request Jun 19, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
bosilca added a commit to bosilca/ompi that referenced this pull request Jun 23, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
bosilca added a commit to bosilca/ompi that referenced this pull request Jun 24, 2019
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to
be more inclusive (works now with both gcc and icc) and to cover more
standard size lengths (4, 8, 16).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants