-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sanity check so ParameterInput is not allowed to be different on different MPI ranks #1173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this will definitely prevent some bug hunts in the future!
I don't have a sense for how well the Boost hash combine function works so I worry a little about possible hash collisions. That being said, I don't see a better way to do things without a lot more work.
Yeah hash collisions are possible. But I figured this is better than nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am not worried about collisions at all, as the two ParameterInput
s will be highly correlated with a single mistaken change, not drawn randomly. Since the combine op is non-associative, you'd need to find a parameter that hashes to the inverse of another, not just the same thing. i.e. a genuine collision in the underlying function.
I could imagine someday synchronizing automatically to rank 0 rather than just erroring, but that sounds like a lot of code to maintain for what amounts to hand-holding
I considered implementing that---I don't actually think it would be too difficult, but I was concerned this might not be the desired behavior. I think it's better to die and make the developer/user figure out what's wrong. |
PR Summary
Multiple times we've been hit by the issue that
ParameterInput
is stateful becauseGetOrAdd
modifies the object. IfParameterInput
is different on different MPI ranks, then HDF5 output will hang, because writing to params is a collective action.This is a minimal fix that at least helps debugging when we hit this issue. I add the ability to compute the hash of
ParameterInput
. Then, before output in HDF5 I check that this hash is the same on all MPI ranks with one MPI broadcast and one MPI reduce.I added tests for the hashing machinery in the unit tests. I couldn't test the MPI bit in the unit tests but I did test it by hand by modifying one of the examples to add a param on only rank zero and got the desired behavior.
PR Checklist