Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #609 Restarts with Full/Empty GPUs #611

Merged

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Dec 15, 2014

To Do

  • check at runtime

@ax3l ax3l added bug a bug in the project's code component: plugin in PIConGPU plugin labels Dec 15, 2014
@ax3l ax3l added this to the Open Beta milestone Dec 15, 2014
@ax3l ax3l force-pushed the fix-restartEmptyGPUs branch from cabdc40 to e07289c Compare December 15, 2014 20:04
- particle read during restart hangs if some GPUs
  have no particles and others do
@f-schmitt
Copy link
Member

I'll have a look asap

@ax3l
Copy link
Member Author

ax3l commented Dec 15, 2014

take your time, I don't want to push the runtime test on you (but you can test it if you like), I just wanted to prepare the pull already.

@PrometheusPi
Copy link
Member

For me, a run time test with LWFA failed.

@ax3l
Copy link
Member Author

ax3l commented Dec 16, 2014

please do provide more details, e.g. the SPLASH_VERBOSE output or at least "exactly the same error".

update: found your notes, thx.

@ax3l
Copy link
Member Author

ax3l commented Dec 16, 2014

actually, I think the complete if (totalNumParticles != 0) is not necessary.
Would also avoid the transaction weirdness.

@PrometheusPi for I/O errors enable picLog::INPUT_OUTPUT by setting e.g. 32+1+16+8=57 during compile with ccmake . or cmake [...] -DPIC_VERBOSE_LVL=57.

Update: information on that in the Wiki - Debugging added.

Pls also add g+r to your files (very important: stderr/stdout) :)

@ax3l
Copy link
Member Author

ax3l commented Dec 16, 2014

@f-schmitt-zih to me it still looks like gpus with particles keep hanging at the first attribute while the zero-particle gpus seem to proceed.

@PrometheusPi
Copy link
Member

I added need-info/blocked label to avoid an accidental merge because this pull request compiles but does not yet solves the problem.

The rights of my simulations have been adjusted.

I started a verbose simulation with your fixes : /net/cns/projects/HPL/xray/pausch/PIConGPU/runs/_restartBunch/011_LWFA_ax3l_verbose

@PrometheusPi
Copy link
Member

Simulation is done:
output for restart is stored in stdout_1 and stderr
output after restart is stored in stdout_2 and stderr_2

What confises me is that we have 16 GPUs = 2 x 4 x 2
The first, starting from y=0, 2 x 1 x 2 = 4 GPUs are completly empty. The other 2 x 3 x 2 = 12 GPUs have particles.

Counting Begin loading field 'FieldE'or 'FieldB' in stdout_2 leads to 16 entries. - Perfect.
If I do this for particles I end up with HDF5: ( [X] ) load species attribute: [Y]:

attribute [Y] X = start X = end
position 16 4
momentum 4 4
weighting 4 4
globalCellIdx 4 4

It looks like something goes wrong when loading the attribute position and only 4 ranks survive. Those however run through the entire restart process.

What confuses me is, that it looks like only ranks without particles can load particle attributes.


if (totalNumParticles != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not allowed because Splash use collective operations and all processes have to participate.

It's correct that you move this condition because all ranks have to participate in all Splash operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in the diff, the collective read is not in the if
branch any more.

@f-schmitt
Copy link
Member

The change looks good to me in general. I does not look like I will have much time soon to test this myself unfortunately. They keep me busy over here :)

@f-schmitt
Copy link
Member

@psychocoderHPC
I assume that the problem has been introduced with this change. Beforehand, we didn't use totalNumParticles and the resulting condition at all, therefore all ranks always participated. I would propose to either see what changed in detail there and try to roll back as much as possible.

@ax3l
Copy link
Member Author

ax3l commented Dec 17, 2014

a full roll-back is most certainly not possible, but that's a good point to check the offsets again.
also, the scalar index (rank) in that loop for totalNumParticles not necessarily sorted.

the weird thing is: for me it looks like all hit the collective read command but not all return.
do we have tests in libSplash for zero-reads grid/polydata, too (not only for writes)?

@psychocoderHPC
Copy link
Member

@f-schmitt-zih It looks like this Splash read hang if some of the ranks read zero data. All rank which reads zero data terminate valid and go back to picongpu. All ranks which must read data hangs in H5Dread.
Normaly this BUG was fixed in HDF5 1.4.0

Fixed H5Dread or H5Dwrite calls with H5FD_MPIO_COLLECTIVE requests
     that may hang because not all processes are transfer the same amount
     of data. (A.K.A. prematured collective return when zero amount data
     requested.) Collective calls that may cause hanging is done via the
     corresponding MPI-IO independent calls.

@ax3l ax3l added the affects latest release a bug that affects the latest stable release label Jan 5, 2015
@ax3l
Copy link
Member Author

ax3l commented Jan 6, 2015

@f-schmitt-zih can we write a minimal-libSplash test for that to see if the error comes from the HDF5 implementation?

@f-schmitt
Copy link
Member

Yes, I can do that

@ax3l
Copy link
Member Author

ax3l commented Jan 7, 2015

also, I think @psychocoderHPC made a discovery by tapering the arguments for libSplash to read "one fake element" instead of NULL today and could restart the bunch example just before going home. he might can get an update to that tomorrow - so it really looks like it is something in libSplash and the test would be great.

- This is a workaround to avoid the bug that processes hangs if some call `DataCollector::read()` where elements to read is zero and the destination pointer for the read data is set to `NULL`.

splash bug issu: ComputationalRadiationPhysics/libSplash#148
@psychocoderHPC
Copy link
Member

@ax3l, @f-schmitt-zih I pushed the workaround to this pull request
and opened a issue in libSplash

workaround for read hangs during restart
@f-schmitt
Copy link
Member

Should I still look into the repro case? I won't be able to do that before the weekend, though. However, I will fix the libsplash issue.

@ax3l
Copy link
Member Author

ax3l commented Jan 8, 2015

That's great, thank you!
I think it might be useful just to add the test to debug and fix the libSplash issue itself.

@ax3l
Copy link
Member Author

ax3l commented Jan 12, 2015

@f-schmitt-zih can you pls comment/merge this pull if you find the time? :)

@f-schmitt
Copy link
Member

The change is fine!

@f-schmitt f-schmitt closed this Jan 12, 2015
@ax3l ax3l deleted the fix-restartEmptyGPUs branch January 12, 2015 16:19
@ax3l ax3l restored the fix-restartEmptyGPUs branch January 14, 2015 16:45
@ax3l
Copy link
Member Author

ax3l commented Jan 14, 2015

lol, "Closed with unmerged commits" @f-schmitt-zih ;)

@ax3l ax3l assigned psychocoderHPC and unassigned f-schmitt Jan 14, 2015
@ax3l ax3l reopened this Jan 14, 2015
@f-schmitt
Copy link
Member

I should really work less...

@ax3l
Copy link
Member Author

ax3l commented Jan 14, 2015

work party 🎅

@ax3l
Copy link
Member Author

ax3l commented Jan 14, 2015

@f-schmitt-zih @psychocoderHPC tested :shipit:

f-schmitt pushed a commit that referenced this pull request Jan 15, 2015
Fix #609 Restarts with Full/Empty GPUs
@f-schmitt f-schmitt merged commit 4e85460 into ComputationalRadiationPhysics:dev Jan 15, 2015
@ax3l ax3l deleted the fix-restartEmptyGPUs branch January 15, 2015 10:10
* - `libSplash` issu: https://github.com/ComputationalRadiationPhysics/libSplash/issues/148
* \todo: please remove this workaround after the libsplash bug is fixed
*/
tmpArray = new ComponentType[1];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I just run a test with two species, where the sim did not hang but lost 75% and 95% of the particles during restart respectively. That work-around seems to be invalid, I'll test the new 1.2.4 libSplash patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: no, the work-around is valid but we have an other bug with static load balancing restarts right now #637

ax3l added a commit to ax3l/picongpu that referenced this pull request Jan 21, 2015
- the work around was valid but is not needed any more
  (as of libSplash 1.2.4)
- simplify code-base again
ax3l added a commit to ax3l/picongpu that referenced this pull request Jan 22, 2015
- the work around was valid but is not needed any more
  (as of libSplash 1.2.4)
- simplify code-base again
@ax3l ax3l mentioned this pull request Jan 22, 2015
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects latest release a bug that affects the latest stable release bug a bug in the project's code component: plugin in PIConGPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants