Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sanity check so ParameterInput is not allowed to be different on different MPI ranks #1173

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Sep 11, 2024

PR Summary

Multiple times we've been hit by the issue that ParameterInput is stateful because GetOrAdd modifies the object. If ParameterInput is different on different MPI ranks, then HDF5 output will hang, because writing to params is a collective action.

This is a minimal fix that at least helps debugging when we hit this issue. I add the ability to compute the hash of ParameterInput. Then, before output in HDF5 I check that this hash is the same on all MPI ranks with one MPI broadcast and one MPI reduce.

I added tests for the hashing machinery in the unit tests. I couldn't test the MPI bit in the unit tests but I did test it by hand by modifying one of the examples to add a param on only rank zero and got the desired behavior.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur added the bug Something isn't working label Sep 11, 2024
@Yurlungur Yurlungur self-assigned this Sep 11, 2024
Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this will definitely prevent some bug hunts in the future!

I don't have a sense for how well the Boost hash combine function works so I worry a little about possible hash collisions. That being said, I don't see a better way to do things without a lot more work.

src/outputs/output_utils.cpp Outdated Show resolved Hide resolved
@Yurlungur
Copy link
Collaborator Author

I don't have a sense for how well the Boost hash combine function works so I worry a little about possible hash collisions. That being said, I don't see a better way to do things without a lot more work.

Yeah hash collisions are possible. But I figured this is better than nothing.

Copy link
Collaborator

@bprather bprather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not worried about collisions at all, as the two ParameterInputs will be highly correlated with a single mistaken change, not drawn randomly. Since the combine op is non-associative, you'd need to find a parameter that hashes to the inverse of another, not just the same thing. i.e. a genuine collision in the underlying function.

I could imagine someday synchronizing automatically to rank 0 rather than just erroring, but that sounds like a lot of code to maintain for what amounts to hand-holding

@Yurlungur Yurlungur enabled auto-merge September 13, 2024 20:38
@Yurlungur
Copy link
Collaborator Author

I could imagine someday synchronizing automatically to rank 0 rather than just erroring, but that sounds like a lot of code to maintain for what amounts to hand-holding

I considered implementing that---I don't actually think it would be too difficult, but I was concerned this might not be the desired behavior. I think it's better to die and make the developer/user figure out what's wrong.

@Yurlungur Yurlungur merged commit 8966c18 into develop Sep 13, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants