-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #609 Restarts with Full/Empty GPUs #611
Fix #609 Restarts with Full/Empty GPUs #611
Conversation
cabdc40
to
e07289c
Compare
- particle read during restart hangs if some GPUs have no particles and others do
I'll have a look asap |
take your time, I don't want to push the runtime test on you (but you can test it if you like), I just wanted to prepare the pull already. |
For me, a run time test with |
please do provide more details, e.g. the SPLASH_VERBOSE output or at least "exactly the same error". update: found your notes, thx. |
actually, I think the complete @PrometheusPi for I/O errors enable picLog::INPUT_OUTPUT by setting e.g. Update: information on that in the Wiki - Debugging added. Pls also add |
@f-schmitt-zih to me it still looks like gpus with particles keep hanging at the first attribute while the zero-particle gpus seem to proceed. |
I added The rights of my simulations have been adjusted. I started a verbose simulation with your fixes : |
Simulation is done: What confises me is that we have Counting
It looks like something goes wrong when loading the attribute What confuses me is, that it looks like only ranks without particles can load particle attributes. |
|
||
if (totalNumParticles != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not allowed because Splash
use collective operations and all processes have to participate.
It's correct that you move this condition because all ranks have to participate in all Splash
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in the diff, the collective read
is not in the if
branch any more.
The change looks good to me in general. I does not look like I will have much time soon to test this myself unfortunately. They keep me busy over here :) |
@psychocoderHPC |
a full roll-back is most certainly not possible, but that's a good point to check the offsets again. the weird thing is: for me it looks like all hit the collective read command but not all return. |
@f-schmitt-zih It looks like this Splash read hang if some of the ranks read zero data. All rank which reads zero data terminate valid and go back to picongpu. All ranks which must read data hangs in
|
@f-schmitt-zih can we write a minimal-libSplash test for that to see if the error comes from the HDF5 implementation? |
Yes, I can do that |
also, I think @psychocoderHPC made a discovery by tapering the arguments for libSplash to read "one fake element" instead of NULL today and could restart the bunch example just before going home. he might can get an update to that tomorrow - so it really looks like it is something in libSplash and the test would be great. |
- This is a workaround to avoid the bug that processes hangs if some call `DataCollector::read()` where elements to read is zero and the destination pointer for the read data is set to `NULL`. splash bug issu: ComputationalRadiationPhysics/libSplash#148
@ax3l, @f-schmitt-zih I pushed the workaround to this pull request |
workaround for read hangs during restart
Should I still look into the repro case? I won't be able to do that before the weekend, though. However, I will fix the libsplash issue. |
That's great, thank you! |
@f-schmitt-zih can you pls comment/merge this pull if you find the time? :) |
The change is fine! |
lol, "Closed with unmerged commits" @f-schmitt-zih ;) |
I should really work less... |
|
@f-schmitt-zih @psychocoderHPC tested |
Fix #609 Restarts with Full/Empty GPUs
* - `libSplash` issu: https://github.com/ComputationalRadiationPhysics/libSplash/issues/148 | ||
* \todo: please remove this workaround after the libsplash bug is fixed | ||
*/ | ||
tmpArray = new ComponentType[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I just run a test with two species, where the sim did not hang but lost 75% and 95% of the particles during restart respectively. That work-around seems to be invalid, I'll test the new 1.2.4
libSplash patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: no, the work-around is valid but we have an other bug with static load balancing restarts right now #637
- the work around was valid but is not needed any more (as of libSplash 1.2.4) - simplify code-base again
- the work around was valid but is not needed any more (as of libSplash 1.2.4) - simplify code-base again
To Do