Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PMIx tool connections and queries. #1801

Merged
merged 1 commit into from
Jun 30, 2016
Merged

Add support for PMIx tool connections and queries. #1801

merged 1 commit into from
Jun 30, 2016

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 20, 2016

Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information

Also, modify the OMPI and PMIx configury so that we install libpmix and its associated headers when -with-devel-headers is given. This allows PMIx developers to work using the PMIx version embedded in OMPI, thus ensuring that it matches the ORTE server

Refs openpmix/openpmix#68

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 29, 2016

bot:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

@jjhursey Can you give me some idea of the xl failure? I honestly cannot filter thru all that output to find the issue

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

huh - now it shows everything as passed. sigh

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

I am not able to reproduce this make distclean failure - tarballs build fine, distclean works fine. Any suggestions as to why this fails on Jenkins?

@jsquyres
Copy link
Member

@rhc54 The failure is earlier in configure:

17:27:55 configure: WARNING: LIBEVENT SUPPORT NOT FOUND
17:27:55 configure: error: CANNOT CONTINE
17:27:55 configure: /bin/sh './configure' *failed* for opal/mca/pmix/pmix2x/pmix
17:27:55 checking if MCA component pmix:pmix2x can compile... no

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

sigh - so why then does the jenkins server continue past that point?? why doesn't it stop upon the failure to configure?

@jsquyres
Copy link
Member

configure didn't fail because there was no error -- the pmix2x component simply elected not to build, and configure continued on, as usual. But later, since pmix2x didn't set itself up properly, make dist failed.

The key is that even for components that elect not to build, they need to finish setting themselves up properly. I.e., component configure.m4 scripts should never call AC_MSG_ERROR or otherwise abort before they have completed (and generated all their Makefiles, etc.).

…port a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information

Update to match PMIx RFC

Fix configury to point to correct libevent and hwloc locations
@ibm-ompi
Copy link

Build Failed with XL compiler! Please review the log, and get in touch if you have questions.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

Mellanox failure is the rdmacm one again

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

@jjhursey I surrender - is there any way you can make your Jenkins output be a little more obvious as to the problem that caused it to be flagged as failed? I honestly cannot find the error in the midst of these thousands of lines of output. Perhaps just stop at the error instead of bulldozing ahead?

@jjhursey
Copy link
Member

@rhc54 So we get complaints that there is not enough output to diagnose the problem when we hide things that pass - to reduce output. And we get complaints that there is too much output when we add it back - to give context. I welcome suggestions on how to improve the setup, but there is only so much that we can do at the end of the day. We can continue this discussion on the devel mailing list, if you like.

This particular error is at the very bottom of the output (the file is long enough that you have to click on the view the full file link). I added the summary at the bottom of our results to help highlight where to start looking. Failure is identified below by the IBM_CI_FAIL marker.

#################
Run Examples
#################
+ cd /gpfs/gpfs_stage1/jhursey/jenkins/workspace/ompi_public_pr_master_xl/ompi-src/examples
+ timeout --preserve-status -k 22s 20s mpirun -np 2 -mca btl tcp,vader,sm,self hello_c
[p10a601:18357] listen_thread: accept() failed: Invalid argument (22).
+ RC=1
+ echo 'IBM_CI_FAIL : Run examples'
IBM_CI_FAIL : Run examples
+ exit 1

#################
# Summary
#################
# Tests Passed: 5 of 7
#---------------------
IBM_CI_SUCCESS : Autogen
IBM_CI_SUCCESS : Configure
IBM_CI_SUCCESS : Make
IBM_CI_SUCCESS : Make Install
IBM_CI_SUCCESS : Make examples
#################

This might be an intermittent failure, since it passed earlier in the evening for that configuration.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

@jjhursey I was unaware that the log (minus asking for the raw file) wasn't showing me the end, nor that there was some magic marker that I could search against. That helps a bit. Frankly, this is all becoming rather overwhelming.

What we really need is for Jenkins, when it sees an apparent failure, to run the unmodified code and compare the output. This might help tell us if this specific PR is actually the source of a problem or not.

Barring that, I guess I'm not sure where to go. We are buried in data, but lacking in information, and I'm not sure if an automated system is capable of sifting between the two.

@rhc54 rhc54 merged commit 063f848 into open-mpi:master Jun 30, 2016
@rhc54 rhc54 deleted the topic/toolconnect branch June 30, 2016 03:48
@jsquyres
Copy link
Member

@rhc54 @jjhursey Sounds like a good topic for the August meeting -- whether it's part of the agenda or an informal/beer discussion.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 30, 2016

Agreed - I'm not bashing the test support, just trying to figure out how we make it more valuable and easy for developers to use.

@jjhursey
Copy link
Member

Sounds great. Thanks for putting it on the agenda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants