-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amesos2: Potential fix for issue #2270 #3300
Conversation
Fix multi-vector get1dCopy code error for non-contiguous GID case. Re-enable tests when DEBUG is enabled.
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3 # 1357 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 1051 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 600 (click to expand)
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3 # 1358 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 1052 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 601 (click to expand)
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3 # 1359 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 1053 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 602 (click to expand)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of broken in these wrappers, but we don't have to fix all of it at once :)
@@ -211,23 +211,14 @@ namespace Amesos2 { | |||
const size_t lclNumRows = redist_mv.getLocalLength(); | |||
for (size_t j = 0; j < redist_mv.getNumVectors(); ++j) { | |||
auto av_j = av(lda*j, lclNumRows); | |||
auto X_j = redist_mv.getVector(j); | |||
//auto X_j = redist_mv.getVector(j); | |||
auto X_lcl_j_2d = redist_mv.template getLocalView<host_execution_space> (); | |||
auto X_lcl_j_1d = Kokkos::subview (X_lcl_j_2d, Kokkos::ALL (), j); | |||
for ( size_t i = 0; i < lclNumRows; ++i ) { | |||
av_j[i] = X_lcl_j_1d(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken with respect to UVM -- you'll need a fence before accessing these data on host. That can be a separate issue, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll fix it now otherwise it may end up in a list of things that keeps growing.
@@ -211,23 +211,14 @@ namespace Amesos2 { | |||
const size_t lclNumRows = redist_mv.getLocalLength(); | |||
for (size_t j = 0; j < redist_mv.getNumVectors(); ++j) { | |||
auto av_j = av(lda*j, lclNumRows); | |||
auto X_j = redist_mv.getVector(j); | |||
//auto X_j = redist_mv.getVector(j); | |||
auto X_lcl_j_2d = redist_mv.template getLocalView<host_execution_space> (); | |||
auto X_lcl_j_1d = Kokkos::subview (X_lcl_j_2d, Kokkos::ALL (), j); | |||
for ( size_t i = 0; i < lclNumRows; ++i ) { | |||
av_j[i] = X_lcl_j_1d(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just do a deep_copy
here instead of writing a for loop.
@@ -335,7 +326,7 @@ namespace Amesos2 { | |||
// num_vecs = 1; stride does not matter | |||
auto mv_view_to_modify_2d = mv_->template getLocalView<host_execution_space>(); | |||
for ( size_t i = 0; i < lda; ++i ) { | |||
mv_view_to_modify_2d(i,0) = new_data[i]; | |||
mv_view_to_modify_2d(i,0) = new_data[i]; // Only one vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do a deep_copy
instead of writing a for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen thanks for feedback! This was code I mostly worked on during my first go at using Tpetra awhile back, I wasn't sure if deep_copy
from an ArrayView
(new_data
) and a 1d View was supported - I'm supposing it must be based on the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just to clarify -- the only way to use deep_copy
with ArrayView
right now is to turn the ArrayView
into an unowned HostSpace
View
. You did that above :) .)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also want to fix this bit in the same way that you fixed it above. Also, if you know that the MultiVector has only one column, you could turn that 2-D View into a 1-D View first, before doing the copy.
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
@ndellingwood : PR testing is failing, should I wait for that to be fixed before reviewing ? |
@srajama1 I think the PR testing issues were due to some other script issues addressed by #3302 unrelated to this PR, so go ahead and give a look though I have a couple small updates to push if you'd rather wait for that. I'm going to add a |
@ndellingwood wrote:
Not sure about that -- probably not. You'll have to turn the ArrayView into an unmanaged |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3 # 1364 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 1058 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 607 (click to expand)
|
I suggest keeping a WIP tag until you are done. |
@mhoemmen I replaced loop for element-wise copy with use of |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
@srajama1 just pushed extra changes to add a |
For reference from local testing:
|
for ( size_t i = 0; i < lclNumRows; ++i ) { | ||
av_j[i] = X_lcl_j_1d(i); | ||
} | ||
std::memcpy( &(av_j[0]), (X_lcl_j_1d.data()), sizeof( decltype(X_lcl_j_1d(0)) )*lclNumRows ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sizeof
thing will break for Stokhos and Sacado dynamic-length Scalar types. How about this instead:
using val_type = Kokkos::ArithTraits<Scalar>::val_type; // whatever Scalar is
Kokkos::View<val_type*, Kokkos::HostSpace> avk (reinterpret_cast<val_type*> (av.getRawPtr ()), av.size ());
Kokkos::deep_copy (avk, _lcl_j_1d);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen I didn't think about Fad types, thanks for suggested fix.
for ( size_t i = 0; i < lclNumRows; ++i ) { | ||
X_lcl_j_1d(i) = av_j[i]; | ||
} | ||
std::memcpy( (X_lcl_j_1d.data()), &(av_j[0]), sizeof( decltype(X_lcl_j_1d(0)) )*lclNumRows ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
} | ||
} | ||
|
||
typedef typename multivec_t::node_type::memory_space memory_space; | ||
redist_mv.template sync <memory_space> (); | ||
Kokkos::fence(); // For UVM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync
invokes fence
, so you don't have to fence again here.
There are comments from @mhoemmen about UQ types. I think this is important. |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3 # 1367 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 1061 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 610 (click to expand)
|
Added WIP label, more changes needed. |
The unit tests pass, here are those exercising the functionality:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndellingwood I don't want to be an obstacle to a bug fix, so I'm approving this, but please see comments. Thanks! :D
} | ||
|
||
using val_type = typename decltype( X_lcl_j_1d )::value_type; | ||
Kokkos::View<val_type*, Kokkos::HostSpace> umavj ( const_cast< val_type* > ( reinterpret_cast<const val_type*> ( av_j.getRawPtr () ) ), av_j.size () ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come av_j
is const? Now is not the time to dig into that, though ;-) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RHS MV (i.e. 'B') is stored as const in the MVAdapter (after being copied on root proc) to prevent modification, which is why av_j
ends up being const
.
@@ -335,7 +326,7 @@ namespace Amesos2 { | |||
// num_vecs = 1; stride does not matter | |||
auto mv_view_to_modify_2d = mv_->template getLocalView<host_execution_space>(); | |||
for ( size_t i = 0; i < lda; ++i ) { | |||
mv_view_to_modify_2d(i,0) = new_data[i]; | |||
mv_view_to_modify_2d(i,0) = new_data[i]; // Only one vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just to clarify -- the only way to use deep_copy
with ArrayView
right now is to turn the ArrayView
into an unowned HostSpace
View
. You did that above :) .)
@@ -335,7 +326,7 @@ namespace Amesos2 { | |||
// num_vecs = 1; stride does not matter | |||
auto mv_view_to_modify_2d = mv_->template getLocalView<host_execution_space>(); | |||
for ( size_t i = 0; i < lda; ++i ) { | |||
mv_view_to_modify_2d(i,0) = new_data[i]; | |||
mv_view_to_modify_2d(i,0) = new_data[i]; // Only one vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also want to fix this bit in the same way that you fixed it above. Also, if you know that the MultiVector has only one column, you could turn that 2-D View into a 1-D View first, before doing the copy.
|
||
using val_type = typename decltype( X_lcl_j_1d )::value_type; | ||
Kokkos::View<val_type*, Kokkos::HostSpace> umavj ( const_cast< val_type* > ( reinterpret_cast<const val_type*> ( av_j.getRawPtr () ) ), av_j.size () ); | ||
Kokkos::deep_copy (umavj, X_lcl_j_1d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This code begs for a refactor -- this function ("copy from an ArrayView to a MultiVector") seems to come up a few times.
- Does
new_data
come from a MultiVector? If so, you could justTpetra::deep_copy
between the two MultiVectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen: "this function ("copy from an ArrayView to a MultiVector") seems to come up a few times"
Yeah, we've reimplemented some optimizations from Amesos to avoid this for single process cases when the data pointer can be used directly without worrying about collectives to bring all data to the root proc. Adding @srajama1, we can discuss some possible options, estimate amount of work, availability etc.
"Does new_data
come from a MultiVector"
No, new_data
is an ArrayView storing the result (on root proc) of the solve. It needs to be copied back and redistributed to the MultiVector passed in through the solve(...)
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndellingwood : Can you file a different issue for optimizing this code. Let us get the PR with the current changes pushed in and we can look at optimizing this as a next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srajama1 sounds good :)
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Using Repos:
Pull Request Author: ndellingwood |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ mhoemmen ]! |
Status Flag 'Pull Request AutoTester' - AutoMerge IS ENABLED, but the Label AT: AUTOMERGE is not set. Either set Label AT: AUTOMERGE or manually merge the PR... |
Fix multi-vector get1dCopy code error for non-contiguous GID case.
Re-enable tests when DEBUG is enabled.
@trilinos/amesos2
Description
Fix #2270 for broken non-contiguous GID cases (found in debug builds).
How Has This Been Tested?
Local laptop (gcc/6.3) with mpi and debug modes enabled.
Checklist