-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force memcpy inlining to assignments during pack/unpack of some DDTs #6678
Conversation
When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes. Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@derbeyn I tried this PR but could not see any performance improvement with ICC 2019 nor gcc 9.1.0. |
@ggouaillardet sorry for the delay: I didn't get the notification! for icc2019, I used Attaching the file that contains the configure |
Note also that the difference between our settings is that OPAL_CUDA_SUPPORT is defined to 0 in my case. I intentionally didn't want to do any change if OPAL_CUDA_SUPPORT. |
@derbeyn I am not seeing any improvements with this patch :-( I am using #6569 and compared this commit with my hacky patch I posted inline #6617 and the following benchmark subroutine pingpong_usempi(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer i, ierr
if (rank.eq.0) then
do i=0,1048577
buf(i) = i
enddo
call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
t0 = MPI_Wtime()
do i=1,1024
call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr)
call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
end do
t1 = MPI_Wtime()
write (*,*) 'BW fortran = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
buf = -1
call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
do i=1,1024
call mpi_recv(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call mpi_send(buf(1:1048576:2), 524288, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, ierr)
end do
do i=0,1048577,2
if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
enddo
do i=1,1048576,2
if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
enddo
endif
end subroutine
subroutine pingpong_ddt(buf, rank)
use mpi
implicit none
integer :: buf(0:1048577)
integer, intent(in) :: rank
double precision :: t0, t1
integer :: datatype
integer i, ierr
call mpi_type_vector(524288, 1, 2, MPI_INT, datatype, ierr)
call mpi_type_commit(datatype, ierr)
if (rank.eq.0) then
do i=0,1048577
buf(i) = i
enddo
call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
t0 = MPI_Wtime()
do i=1,1024
call mpi_send(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, ierr)
call mpi_recv(buf(1), 1, datatype, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
end do
t1 = MPI_Wtime()
write (*,*) 'BW ddt = ', 1/(t1-t0), ' GW/s'
elseif (rank.eq.1) then
buf = -1
call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
do i=1,1024
call mpi_recv(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
call mpi_send(buf(1), 1, datatype, 0, 0, MPI_COMM_WORLD, ierr)
end do
do i=0,1048577,2
if (buf(i).ne.-1) write (*,*) 'buf(', i, ') = ', i, ' != -1'
enddo
do i=1,1048576,2
if (buf(i).ne.i) write (*,*) 'buf(', i, ') = ', i, ' != ', i
enddo
endif
end subroutine
program pingpong
use mpi
implicit none
integer :: buf(0:1048577)
integer :: datatype, ierr
integer type_size
integer rank
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, rank, ierr)
call pingpong_usempi(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)
call pingpong_ddt(buf, rank)
call mpi_barrier(MPI_COMM_WORLD, ierr)
call mpi_finalize(ierr)
end program with this commit
but with my patch
I am using icc 19.0.3.199 and my configure command line is
(I do not have your platform file but I do not expect this explains the difference gap) can you try running the same benchmark with this PR on top of #6659 ? |
@ggouaillardet did you meant #6695 ? I think it could be nice to transition all pending DDT improvements on that branch, it has significantly higher performance that the current datatype engine on all tests I had run so far, and a lot more potential for additional optimizations. |
@ggouaillardet : OK I'll do that, but note 2 things, as I said in issue #6677
So I'll take @bosilca's branch, and apply my patch on top of the improvements he already did. |
@bosilca sorry, I did mean #6569 but that is not relevant since the benchmark does not involve @derbeyn I tried on (patched) |
@ggouaillardet , I think I understand why you're not getting the same results as I do: I'm using indexed datatypes, while you are using vectors. So we don't go through the same code path in pack_predefined_data(). ompi/opal/datatype/opal_datatype_pack.h Line 56 in b780667
But when using MPI_Type_vector, we don't follow the same path (we go through ompi/opal/datatype/opal_datatype_pack.h Line 65 in b780667
I didn't want to be too intrusive, and wanted to do the change only for the indexed datatypes in a first step. So what I'll do, when testing on top of George's changes, is that I'll call my "BASIC_DDT_MEMCPY_SUM()" in both cases. |
OK, I can confirm what I was saying in my previous note: I have 2 reproducers: ============== reference ompi (still on v3.x branch): [derbeyn@genji0 tests_on_master]$ [derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double (./col_matrix_lat_vector.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs) [derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double (./col_matrix_lat_indexed.icc_2019) LATENCY MATRIX 4096 x 4096 doubles (usecs) ============== reference ompi + patch (still v3.x branch): [derbeyn@genji0 tests_on_master]$ type mpicc [derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_indexed.icc_2019 2047 double [derbeyn@genji0 tests_on_master]$ salloc --exclusive -p SKL-20c_edr-ib2_192gb_2666 -N 2 -n 2 mpirun ./col_matrix_lat_vector.icc_2019 2047 double Just tell me if you want the repoducers. |
@derbeyn thanks for the explanations. I will appreciate you share the reproducers. |
sure! |
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
Thanks Nadia ! I made a simple change (and just realized @bosilca came up with a very similar one) and observed similar performances compared to my much bigger patch. |
@bosilca thx a lot for taking care of that. |
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
Can one of the admins verify this patch? |
yes, sure! |
For closure, I run @derbeyn tests with the old and new datatype engine. The results are satisfactory (on a Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz).
|
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
This work is based on @derbeyn patch provided on open-mpi#6678. I reworked it to be more inclusive (works now with both gcc and icc) and to cover more standard size lengths (4, 8, 16). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
When using intel compilers, this patch enables a big latency improvement when packing/unpacking some kind of indexed datatypes.
Fixes issue #6677
Signed-off-by: Nadia Derbey Nadia.Derbey@atos.net