Ensure we correctly identify local vs non-local peers #13111
+125
−83
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: THIS PR SWITCHES THE PRRTE SUBMODULE TO POINT TO THE UPSTREAM MASTER BRANCH. THIS WAS DONE TO ALLOW RESOLUTION OF THE PROBLEM IDENTIFIED IN #13059, WHICH CAN ONLY BE FIXED BY CHANGES IN BOTH PMIX AND OMPI. UNFORTUNATELY, YOUR PRRTE FORK IS BROKEN AND CANNOT WORK WITH THE PMIX MASTER BRANCH.
PMIX_LOCALITY is a value that is computed by OMPI when we do connect/accept - it is computed in opal_hwloc_compute_relative_locality and the value is locally stored on each proc. The reason is that PMIX_LOCALITY provides the location of a process relative to you - it isn't an absolute value representing the location of the process on the node.
The absolute location of the proc is provided by the runtime in PMIX_LOCALITY_STRING. This is what was retrieved in dpm.c - and then used to compute the relative locality of that proc, which is then stored as PMIX_LOCALITY.
So the reason procs from two unconnected jobs aren't able to get each others PMIX_LOCALITY values is simply because (a) they didn't go thru connect/accept, and therefore (b) they never computed and saved those values.
Second, the runtime provides PMIX_LOCALITY_STRING only for those procs that have a defined location - i.e., procs that are BOUND. If a process is not bound, then it has no fixed location on the node, and so the runtime doesn't provide a locality string for it. Thus, getting "not found" for a modex retrieval on PMIX_LOCALITY_STRING is NOT a definitive indicator that the proc is on a different node.
The only way to determine that a proc is on a different node is to get the list (or array) of procs on the node and see if the proc is on it. We do this in the dpm, but that step was missing from the comm code.
So what I've done here is create a new function ompi_dpm_set_locality that both connect/accept and get_rprocs can use since the required functionality is identical. This will hopefully avoid similar mistakes in the future.