Fix #609 Restarts with Full/Empty GPUs #611

ax3l · 2014-12-15T20:02:47Z

fix Restart Simulations: GPUs with/without Particles Hang #609 particle read during restart hangs if some GPUs have no particles and others do

To Do

check at runtime

- particle read during restart hangs if some GPUs have no particles and others do

f-schmitt · 2014-12-15T20:32:06Z

I'll have a look asap

ax3l · 2014-12-15T21:17:25Z

take your time, I don't want to push the runtime test on you (but you can test it if you like), I just wanted to prepare the pull already.

PrometheusPi · 2014-12-15T23:08:06Z

For me, a run time test with LWFA failed.

ax3l · 2014-12-16T00:03:56Z

please do provide more details, e.g. the SPLASH_VERBOSE output or at least "exactly the same error".

update: found your notes, thx.

ax3l · 2014-12-16T00:36:08Z

actually, I think the complete if (totalNumParticles != 0) is not necessary.
Would also avoid the transaction weirdness.

@PrometheusPi for I/O errors enable picLog::INPUT_OUTPUT by setting e.g. 32+1+16+8=57 during compile with ccmake . or cmake [...] -DPIC_VERBOSE_LVL=57.

Update: information on that in the Wiki - Debugging added.

Pls also add g+r to your files (very important: stderr/stdout) :)

ax3l · 2014-12-16T01:53:14Z

@f-schmitt-zih to me it still looks like gpus with particles keep hanging at the first attribute while the zero-particle gpus seem to proceed.

PrometheusPi · 2014-12-16T10:01:11Z

I added need-info/blocked label to avoid an accidental merge because this pull request compiles but does not yet solves the problem.

The rights of my simulations have been adjusted.

I started a verbose simulation with your fixes : /net/cns/projects/HPL/xray/pausch/PIConGPU/runs/_restartBunch/011_LWFA_ax3l_verbose

PrometheusPi · 2014-12-16T13:49:04Z

Simulation is done:
output for restart is stored in stdout_1 and stderr
output after restart is stored in stdout_2 and stderr_2

What confises me is that we have 16 GPUs = 2 x 4 x 2
The first, starting from y=0, 2 x 1 x 2 = 4 GPUs are completly empty. The other 2 x 3 x 2 = 12 GPUs have particles.

Counting Begin loading field 'FieldE'or 'FieldB' in stdout_2 leads to 16 entries. - Perfect.
If I do this for particles I end up with HDF5: ( [X] ) load species attribute: [Y]:

attribute `[Y]`	`X = start`	`X = end`
`position`	16	4
`momentum`	4	4
`weighting`	4	4
`globalCellIdx`	4	4

It looks like something goes wrong when loading the attribute position and only 4 ranks survive. Those however run through the entire restart process.

What confuses me is, that it looks like only ranks without particles can load particle attributes.

psychocoderHPC · 2014-12-16T15:58:37Z

src/picongpu/include/plugins/hdf5/restart/LoadSpecies.hpp


+        if (totalNumParticles != 0)


~~is not allowed because Splash use collective operations and all processes have to participate.~~

It's correct that you move this condition because all ranks have to participate in all Splash operations.

As you can see in the diff, the collective read is not in the if
branch any more.

f-schmitt · 2014-12-17T06:20:19Z

The change looks good to me in general. I does not look like I will have much time soon to test this myself unfortunately. They keep me busy over here :)

f-schmitt · 2014-12-17T06:27:24Z

@psychocoderHPC
I assume that the problem has been introduced with this change. Beforehand, we didn't use totalNumParticles and the resulting condition at all, therefore all ranks always participated. I would propose to either see what changed in detail there and try to roll back as much as possible.

ax3l · 2014-12-17T06:36:51Z

a full roll-back is most certainly not possible, but that's a good point to check the offsets again.
also, the scalar index (rank) in that loop for totalNumParticles not necessarily sorted.

the weird thing is: for me it looks like all hit the collective read command but not all return.
do we have tests in libSplash for zero-reads grid/polydata, too (not only for writes)?

psychocoderHPC · 2014-12-17T13:20:22Z

@f-schmitt-zih It looks like this Splash read hang if some of the ranks read zero data. All rank which reads zero data terminate valid and go back to picongpu. All ranks which must read data hangs in H5Dread.
Normaly this BUG was fixed in HDF5 1.4.0

Fixed H5Dread or H5Dwrite calls with H5FD_MPIO_COLLECTIVE requests
     that may hang because not all processes are transfer the same amount
     of data. (A.K.A. prematured collective return when zero amount data
     requested.) Collective calls that may cause hanging is done via the
     corresponding MPI-IO independent calls.

ax3l · 2015-01-06T13:26:21Z

@f-schmitt-zih can we write a minimal-libSplash test for that to see if the error comes from the HDF5 implementation?

f-schmitt · 2015-01-07T01:51:40Z

Yes, I can do that

ax3l · 2015-01-07T02:03:57Z

also, I think @psychocoderHPC made a discovery by tapering the arguments for libSplash to read "one fake element" instead of NULL today and could restart the bunch example just before going home. he might can get an update to that tomorrow - so it really looks like it is something in libSplash and the test would be great.

- This is a workaround to avoid the bug that processes hangs if some call `DataCollector::read()` where elements to read is zero and the destination pointer for the read data is set to `NULL`. splash bug issu: ComputationalRadiationPhysics/libSplash#148

psychocoderHPC · 2015-01-07T09:44:56Z

@ax3l, @f-schmitt-zih I pushed the workaround to this pull request
and opened a issue in libSplash

workaround for read hangs during restart

f-schmitt · 2015-01-08T05:37:48Z

Should I still look into the repro case? I won't be able to do that before the weekend, though. However, I will fix the libsplash issue.

ax3l · 2015-01-08T09:25:29Z

That's great, thank you!
I think it might be useful just to add the test to debug and fix the libSplash issue itself.

ax3l · 2015-01-12T10:29:59Z

@f-schmitt-zih can you pls comment/merge this pull if you find the time? :)

f-schmitt · 2015-01-12T15:59:31Z

The change is fine!

ax3l · 2015-01-14T16:45:51Z

lol, "Closed with unmerged commits" @f-schmitt-zih ;)

f-schmitt · 2015-01-14T16:53:56Z

I should really work less...

ax3l · 2015-01-14T16:55:47Z

~~work~~ party 🎅

ax3l · 2015-01-14T17:19:35Z

@f-schmitt-zih @psychocoderHPC tested

Fix #609 Restarts with Full/Empty GPUs

ax3l · 2015-01-19T09:11:15Z

src/picongpu/include/plugins/hdf5/restart/LoadParticleAttributesFromHDF5.hpp

+             *       - `libSplash` issu: https://github.com/ComputationalRadiationPhysics/libSplash/issues/148
+             * \todo: please remove this workaround after the libsplash bug is fixed
+             */
+            tmpArray = new ComponentType[1];


Update: I just run a test with two species, where the sim did not hang but lost 75% and 95% of the particles during restart respectively. That work-around seems to be invalid, I'll test the new 1.2.4 libSplash patch.

Update: no, the work-around is valid but we have an other bug with static load balancing restarts right now #637

- the work around was valid but is not needed any more (as of libSplash 1.2.4) - simplify code-base again

ax3l added bug a bug in the project's code component: plugin in PIConGPU plugin labels Dec 15, 2014

ax3l added this to the Open Beta milestone Dec 15, 2014

ax3l assigned f-schmitt Dec 15, 2014

ax3l force-pushed the fix-restartEmptyGPUs branch from cabdc40 to e07289c Compare December 15, 2014 20:04

Fix ComputationalRadiationPhysics#609 Restarts with Full/Empty GPUs

e07289c

- particle read during restart hangs if some GPUs have no particles and others do

PrometheusPi added the need-info/blocked label Dec 16, 2014

psychocoderHPC reviewed Dec 16, 2014
View reviewed changes

ax3l added the affects latest release a bug that affects the latest stable release label Jan 5, 2015

Merge pull request #3 from psychocoderHPC/fix-bugRestartEmptyGPUs

5cf451c

workaround for read hangs during restart

ax3l mentioned this pull request Jan 8, 2015

collective read in ParallelDataCollector hangs ComputationalRadiationPhysics/libSplash#148

Closed

ax3l removed the need-info/blocked label Jan 8, 2015

f-schmitt closed this Jan 12, 2015

ax3l deleted the fix-restartEmptyGPUs branch January 12, 2015 16:19

ax3l restored the fix-restartEmptyGPUs branch January 14, 2015 16:45

ax3l assigned psychocoderHPC and unassigned f-schmitt Jan 14, 2015

ax3l reopened this Jan 14, 2015

f-schmitt pushed a commit that referenced this pull request Jan 15, 2015

Merge pull request #611 from ax3l/fix-restartEmptyGPUs

4e85460

Fix #609 Restarts with Full/Empty GPUs

f-schmitt merged commit 4e85460 into ComputationalRadiationPhysics:dev Jan 15, 2015

ax3l deleted the fix-restartEmptyGPUs branch January 15, 2015 10:10

ax3l mentioned this pull request Jan 15, 2015

Restart Simulations: GPUs with/without Particles Hang #609

Closed

1 task

ax3l reviewed Jan 19, 2015
View reviewed changes

ax3l added a commit to ax3l/picongpu that referenced this pull request Jan 21, 2015

Remove work-around from ComputationalRadiationPhysics#611

b06dc2f

- the work around was valid but is not needed any more (as of libSplash 1.2.4) - simplify code-base again

ax3l added a commit to ax3l/picongpu that referenced this pull request Jan 22, 2015

Remove work-around from ComputationalRadiationPhysics#611

be27f6b

- the work around was valid but is not needed any more (as of libSplash 1.2.4) - simplify code-base again

ax3l mentioned this pull request Jan 22, 2015

Fix #609 Splash NULL Reads #642

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #609 Restarts with Full/Empty GPUs #611

Fix #609 Restarts with Full/Empty GPUs #611

ax3l commented Dec 15, 2014

f-schmitt commented Dec 15, 2014

ax3l commented Dec 15, 2014

PrometheusPi commented Dec 15, 2014

ax3l commented Dec 16, 2014

ax3l commented Dec 16, 2014

ax3l commented Dec 16, 2014

PrometheusPi commented Dec 16, 2014

PrometheusPi commented Dec 16, 2014

psychocoderHPC Dec 16, 2014

ax3l Dec 16, 2014

f-schmitt commented Dec 17, 2014

f-schmitt commented Dec 17, 2014

ax3l commented Dec 17, 2014

psychocoderHPC commented Dec 17, 2014

ax3l commented Jan 6, 2015

f-schmitt commented Jan 7, 2015

ax3l commented Jan 7, 2015

psychocoderHPC commented Jan 7, 2015

f-schmitt commented Jan 8, 2015

ax3l commented Jan 8, 2015

ax3l commented Jan 12, 2015

f-schmitt commented Jan 12, 2015

ax3l commented Jan 14, 2015

f-schmitt commented Jan 14, 2015

ax3l commented Jan 14, 2015

ax3l commented Jan 14, 2015

ax3l Jan 19, 2015

ax3l Jan 21, 2015

Fix #609 Restarts with Full/Empty GPUs #611

Fix #609 Restarts with Full/Empty GPUs #611

Conversation

ax3l commented Dec 15, 2014

To Do

f-schmitt commented Dec 15, 2014

ax3l commented Dec 15, 2014

PrometheusPi commented Dec 15, 2014

ax3l commented Dec 16, 2014

ax3l commented Dec 16, 2014

ax3l commented Dec 16, 2014

PrometheusPi commented Dec 16, 2014

PrometheusPi commented Dec 16, 2014

psychocoderHPC Dec 16, 2014

Choose a reason for hiding this comment

ax3l Dec 16, 2014

Choose a reason for hiding this comment

f-schmitt commented Dec 17, 2014

f-schmitt commented Dec 17, 2014

ax3l commented Dec 17, 2014

psychocoderHPC commented Dec 17, 2014

ax3l commented Jan 6, 2015

f-schmitt commented Jan 7, 2015

ax3l commented Jan 7, 2015

psychocoderHPC commented Jan 7, 2015

f-schmitt commented Jan 8, 2015

ax3l commented Jan 8, 2015

ax3l commented Jan 12, 2015

f-schmitt commented Jan 12, 2015

ax3l commented Jan 14, 2015

f-schmitt commented Jan 14, 2015

ax3l commented Jan 14, 2015

ax3l commented Jan 14, 2015

ax3l Jan 19, 2015

Choose a reason for hiding this comment

ax3l Jan 21, 2015

Choose a reason for hiding this comment