-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SST crashes with LibFabric #1728
Comments
trying now... |
@philip-davis nope, I get the same error. |
Here's the output with the environment variables set. I changed to only 2 total ranks to keep the noise to a minimum:
|
Does this machine have an attached omnipath fabric? |
@philip-davis I just learned from our sysadmin that the omnipath isn't working on that machine, so this could very well be a wild goose chase. Although, would you be able to detect if omnipath was working during cmake configuration? |
If Omnipath is present, but disabled that might explain the issue. When libfabric builds, it detects what fabrics it can build against, but doesn't do any runtime validation of them. ADIOS builds against libfabric, and it inherits that configuration. We do runtime checks to see if there's a "valid" fabric that we can run against, but we're at libfabric's mercy for those checks. In this case, it believes that the Omnipath fabric exists but crashes when it tries to use it. It looks like the crash is coming from inside a thread that libfabric is launching for progress management, so this probably comes down to a bug in libfabric in handling unexpected psm2 states. In any case, since there's no RDMA fabric working in the machine (unless there's also a viable IB fabric besides the OPA?) you are better off sticking to the WAN/evpath dataplane. |
@khuck unrelated to the issue at hand, I noticed when building you're using the |
@chuckatkins I wish that were true with all CMake configurations, but not every project sets up a proper configuration. :) So my habit is to try with the compiler and when that doesn't work (some file can't compile because |
@philip-davis Are there remaining things to sort out WRT this issue? Or can we close? |
close it. I haven't had problems with SST since, and other stability (i.e. crash) issues with ADIOS2 have been resolved. |
Thanks @khuck ! |
ADIOS2 was configured on a 36-core Linux workstation with the following cmake output:
When trying to run the heatTransfer example on this workstation with SST, the following crash happened (similar/same crash happens without the --mca arguments):
And the backtrace from one of the ranks:
The text was updated successfully, but these errors were encountered: