diagnostics of restart run #20

phyax · 2017-10-22T05:32:01Z

Hi,

I have some large scale runs and need to restart the simulation every 12 hours. I checked the initial results and found some interesting time chunk that I should take a closer look. So more diagnostics (specifically, screen diagnostics) were added in the restart run, which was not present in the initial run. But the output came up with the warning:

[WARNING] Cannot find attribute DiagScreen0

The result is that the screen diagnostics were not written to disk. I would like to check that if adding more diagnostics is allowed in the restart run.

Thanks,
Xin

iltommi · 2017-10-22T05:44:23Z

In principle it should be possible since the checkpoint files contain only particles and fields.

@mccoys any ideas why we store the Screen diag number as attribute in the checkpoint?

the warning comes from this block:
https://github.com/SmileiPIC/Smilei/blob/master/src/Checkpoint/Checkpoint.cpp#L415
and the dump writing is here:
https://github.com/SmileiPIC/Smilei/blob/master/src/Checkpoint/Checkpoint.cpp#L211

iltommi · 2017-10-22T05:52:09Z

Ok, I see the problem is that the screen is "incremental" and we need to store the particles that crossed the screen.

In your case, the screen "appears" at restart timestep, the diagnostic should be created and the values should restart from 0.

phyax · 2017-10-22T07:28:12Z

I did not fully understand why extra diagnostics at the restart run have a problem. Note that it does not have any problem when the diagnostics at the restart run is the same as that of the initial run.

Here is some extra error messages for one of the mpi processes besides the warning:

HDF5-DIAG: Error detected in HDF5 (1.8.18) MPI-process 289:
#000: H5G.c line 467 in H5Gopen2(): unable to open group
major: Symbol table
minor: Can't open object
#1: H5Gint.c line 320 in H5G__open_name(): group not found
major: Symbol table
minor: Object not found
#2: H5Gloc.c line 430 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#3: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#4: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#5: H5Gloc.c line 385 in H5G_loc_find_cb(): object 'FieldsForDiag0' doesn't exist
major: Symbol table
minor: Object not found
HDF5-DIAG: Error detected in HDF5 (1.8.18) MPI-process 289:
#000: H5G.c line 812 in H5Gclose(): not a group
major: Invalid arguments to routine
minor: Inappropriate type

mccoys · 2017-10-23T09:35:47Z

The issue is that the Screen need to integrate all the data collected since the beginning of the simulation. Every timestep, it increments its arrays of data with the new data collected from this timestep. This is why the Screen needs to store its data in the checkpoints. When you restart the simulation, the Screen expects to find its previous data stored in the checkpoint. In your case, as you did not have a Screen in your first simulation, there was no stored Screen data, and the restart failed.

We can still fix this problem by forcing the data to be put at 0, even when the diag did not exist before. We should be able to file a bugfix soon.

iltommi · 2017-10-23T11:07:45Z

Hi @phyax,
upon reading the code around the lines mentioned above, smilei should run fine¹ when you add a DiagScreen between two checkpoints.

On the other end, the the HDF5 errors you mention later, look like the patches structure changed between the two runs.

What changes did you make between the first run and restart?

¹There might be an issue in case you changed the Screen size between restart. We are investigating this

phyax · 2017-10-23T17:14:30Z

Hi @iltommi ,
The changes between the first run and restart are:

turn off probe diagnostics in the first run.
add screen diagnostics and probes at the screen location in order to do field-particle correlation.
All other setup remains the same. So I believe the patch structure should not change between the first run and restart.

In fact, I tried to use DiagParticles in the restart run at the beginning. It did not work. Then I tried to use DiagScreen. It also did not work. Perhaps I need to have these diagnostics in the first run so that they can work in the restart run.

iltommi · 2017-10-23T20:19:17Z

We did some modifications to the code (but these should not impact much the DiagScreen).

Anyway, here is what I tested (it's based on ../benchmarks/tst1d_4_radiation_pressure_acc.py):

suppose you're on the smilei root:
create 2 dirs run1 and run2:

mkdir run{1,2}

split the bechmark file in two at line 100 (separating DiagScreens in the run2/run2.py):

split -l 100 benchmarks/tst1d_4_radiation_pressure_acc.py
mv xaa run1/run1.py
mv xab run2/run2.py

run the first simulation (it will create checkpoint files at timestep 10000):

cd run1
mpirun -np 4 ../smilei run1.py 'DumpRestart(dump_step = 10000)'

and run the restart (with the DiagScreens)

cd ../run2
mpirun -np 4 ../smilei ../run1/run1.py run2.py 'DumpRestart(restart_dir="../run1")'

And here are the resulting files:

run1
├── Fields0.h5
├── ParticleDiagnostic0.h5
├── ParticleDiagnostic1.h5
├── checkpoints
│   ├── dump-00000-0000000000.h5
│   ├── dump-00000-0000000001.h5
│   ├── dump-00000-0000000002.h5
│   └── dump-00000-0000000003.h5
├── patch_load.txt
├── profil.txt
├── run1.py
├── scalars.txt
└── smilei.py
run2
├── Fields0.h5
├── ParticleDiagnostic0.h5
├── ParticleDiagnostic1.h5
├── Screen0.h5
├── Screen1.h5
├── Screen2.h5
├── Screen3.h5
├── Screen4.h5
├── Screen5.h5
├── Screen6.h5
├── Screen7.h5
├── profil.txt
├── run2.py
├── scalars.txt
└── smilei.py

Can you confirm this behaviour?

phyax · 2017-10-23T22:14:18Z

Hi @iltommi ,

I tried the example you gave and got the same results as yours. I am not really sure what caused the error of restart diagnostics in my large scale runs at this time.

mccoys · 2017-10-24T07:29:42Z

At what time during the restarted simulation did this problem appear?
Was it during initialization, in the first iteration, or in subsequent iterations?
Is it possible for you to attach part of your stdout ?

mccoys · 2017-10-24T07:37:51Z

Oh, another possibility. Do you have time-averaged DiagFields? If yes, then the version of the code here on GitHub may require to be patched before you can restart properly the simulation.

phyax · 2017-10-25T01:15:19Z

Hi @iltommi @mccoys ,

I did some research on the hdf5 error and found that it is not a restart issue. It is due to the number of grids in the velocity space being too large. To reproduce my error, in the DiagScreen part of the input deck ../benchmarks/tst1d_4_radiation_pressure_acc.py, replace "axes = [["ekin", 0., 0.4, 10]]" with

axes = [["vx", -1., 1., 30],
["vy", -1., 1., 30],
["vz", -1., 1., 30]]

You will find the hdf5 error message without doing restart. But it seems some screen data is still dumped. I did not check if these data is usable.

The reason I need such a fine grid is that I intend to correlate the particle gyro-phase with the wave phase for different perpendicular and parallel velocity. Is there anyway to do it at this time?

mccoys · 2017-10-25T06:45:08Z

@phyax, I cannot reproduce this error, and I think the problem is somewhere else.

In your original error log, you can see the error object 'FieldsForDiag0' doesn't exist, which clearly is related to time-averaged fields diagnostics. This has been corrected in an upcoming version which we have not merged yet, but we can correct this problem today on the version you are using.

If you have another problem with the velocity space being too large, please post your error log. We can investigate what is going on, but I would be very surprised a space of 30x30x30 points be too much. Technically, HDF5 should be able to support 1000x1000x1000 at least.

iltommi · 2017-10-25T07:08:51Z

@phyax d3dcd5b should fix the problem.
Can you check it?

phyax · 2017-10-25T19:02:54Z

Hi @iltommi @mccoys ,

I update my version with the new commit. The error is still there. Please see the file run1.py.txt in the attachment. The only changes I made compared to tst1d_4_radiation_pressure_acc.py are:

"axes" in the DiagScreen
add "DumpRestart"

run1.py.txt

Here is the screen shot of error message:

The simulation is still completed though. These error messages were gone if I remove the "DumpRestart" block.

MickaelGrechX · 2017-10-25T19:43:01Z

Hi! Looking at your error message, this reminds that I've encountered something similar once. I'm not sure what happened, but it was always when I had some old files in the folder where Smilei was running. Also, it might have been because I was running low on disk free space.

…

On Wed, Oct 25, 2017 at 9:02 PM, phyax ***@***.***> wrote: Hi @iltommi <https://github.com/iltommi> @mccoys <https://github.com/mccoys> , I update my version with the new commit. The error is still there. Please see the file run1.py.txt in the attachment. The only changes I made compared to tst1d_4_radiation_pressure_acc.py are: 1. "axes" in the DiagScreen 2. add "DumpRestart" run1.py.txt <https://github.com/SmileiPIC/Smilei/files/1415797/run1.py.txt> Here is the screen shot of error message: [image: error] <https://user-images.githubusercontent.com/20483877/32017517-d2bb8af0-b97b-11e7-8fcc-8a36c396718f.png> The simulation is still completed though. These error messages were gone if I remove the "DumpRestart" block. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARsfTJsl3mA0g2x-OmQJLHwgp2WiJI0Fks5sv4XfgaJpZM4QB0xr> .

-- ------------------------------------------------------------- Mickael Grech Chargé de recherche CNRS --- Laboratoire d'Utilisation des Lasers Intenses Ecole Polytechnique Route de Saclay 91128 Palaiseau Cedex, France --- tel.: +33 (0)1 69 33 54 16 gsm: +33 (0)6 95 56 48 43 mickael.grech@polytechnique.edu -------------------------------------------------------------

mccoys · 2017-10-25T20:20:12Z

@phyax, the error that you obtain is now a different one. It seems that the problem is related to storing DiagScreen in the checkpoint files. It is currently stored as an HDF5 attribute, not a proper dataset. It turns out that some versions of HDF5 restrict the size of the attributes in a way I am not sure I understand yet. I will look at the possibility to change the way it works: the DiagScreen information will be stored as a proper dataset.

phyax · 2017-10-25T21:14:20Z

@mccoys @iltommi , yes. The error just reported is different from the original one. The original problem went away with the last commit.

iltommi · 2017-10-26T09:19:24Z

@phyax can you confirm the patch works? I tested your case like this:

2 run files (first withoud diags and second just diags for the restart):

run1/run1.py :

# ----------------------------------------------------------------------------------------
# 					SIMULATION PARAMETERS FOR THE PIC-CODE SMILEI
# ----------------------------------------------------------------------------------------

import math

l0 = 2.0*math.pi	# laser wavelength
t0 = l0				# optical cicle
Lsim = 10.*l0		# length of the simulation
Tsim = 40.*t0		# duration of the simulation
resx = 500.			# nb of cells in on laser wavelength
rest = resx/0.95	# time of timestep in one optical cycle (0.95 * CFL)

# plasma slab
def f(x):
    if l0 < x < 2.0*l0:
        return 1.0
    else :
        return 0.0

Main(
    geometry = "1d3v",
    
    interpolation_order = 2 ,
    
    cell_length = [l0/resx],
    sim_length  = [Lsim],
    
    number_of_patches = [ 8 ],
    
    timestep = t0/rest,
    sim_time = Tsim,
     
    bc_em_type_x = ['silver-muller'],
     
    random_seed = smilei_mpi_rank
)

Species(
	species_type = 'ion',
	initPosition_type = 'regular',
	initMomentum_type = 'cold',
	n_part_per_cell = 10,
	mass = 1836.0,
	charge = 1.0,
	nb_density = trapezoidal(10.,xvacuum=l0,xplateau=l0),
	temperature = [0.],
	bc_part_type_xmin = 'refl',
	bc_part_type_xmax = 'refl'
)
Species(
	species_type = 'eon',
	initPosition_type = 'regular',
	initMomentum_type = 'cold',
	n_part_per_cell = 10,
	mass = 1.0,
	charge = -1.0,
	nb_density = trapezoidal(10.,xvacuum=l0,xplateau=l0),
	temperature = [0.],
	bc_part_type_xmin = 'refl',
	bc_part_type_xmax = 'refl'
)

LaserPlanar1D(
	boxSide = 'xmin',
	a0 = 10.,
    omega = 1.,
    ellipticity = 1.,
    time_envelope = tconstant(),
)


every = int(rest/2.)

DumpRestart(
    restart_dir = None,
    dump_step = 10000,
    dump_minutes = 0., # dump before maximum wall-clock time
    dump_deflate = 0,
    exit_after_dump = True,
    dump_file_sequence = 2,
)

run the sim:
cd run1 && mpirun -n 2 ../smilei run1.py

and run2/run2.py :

DiagFields(
    every = every,
    fields = ['Ex','Ey','Ez','Rho_ion','Rho_eon']
)

DiagScalar(every=every)

DiagParticles(
	output = "density",
	every = every,
	species = ["ion"],
	axes = [
		["x",  0.,   Lsim, 200],
		["px", -10., 1000., 200]
	]
)

DiagParticles(
	output = "density",
	every = every,
	species = ["ion"],
	axes = [
		["ekin", 0., 200., 200, "edge_inclusive"]
	]
)


for direction in ["forward", "backward", "both", "canceling"]:
	DiagScreen(
	    shape = "sphere",
	    point = [0.],
	    vector = [Lsim/3.],
	    direction = direction,
	    output = "density",
	    species = ["eon"],
	    axes = [["ekin", 0., 0.4, 30],
              ["vx", -1., 1., 30],
              ["vy", -1., 1., 30],],
	    every = 3000
	)
	DiagScreen(
	    shape = "plane",
	    point = [Lsim/3.],
	    vector = [1.],
	    direction = direction,
	    output = "density",
	    species = ["eon"],
	    axes = [["ekin", 0., 0.4, 30],
              ["vx", -1., 1., 30],
              ["vy", -1., 1., 30],],
      every = 3000
	)

DumpRestart.restart_dir="../run1"

run the restart

cd ../run2 && mpirun -n 2 ../smilei ../run1/run1.py run2.py

phyax · 2017-10-26T18:40:59Z

@iltommi @mccoys I confirm that the last patch fixes the conflicts between data dump and DiagScreen. Thank you!

I have not looked at the output of screen diag yet. But does the histogram by screen diag counting particles during a time period defined by 'every', by Dt of the simulation, or from the beginning of the simulation? I read through the doc but did not find the answer. I understand that the output frequency of screen diag is 'every'. But I am not sure during which time period the screen diag accumulates particle data.

mccoys · 2017-10-27T06:22:54Z

@phyax, the data of a Screen is accumulated since the beginning of the simulation, or, in your case, since the point when you introduced the diagnostic. We will make this clearer in the doc. Thanks for this report.

fix #20

iltommi assigned mccoys Oct 22, 2017

mccoys mentioned this issue Oct 26, 2017

Fix an issue with Screen and checkpoints #21

Merged

phyax closed this as completed Oct 27, 2017

jderouillat pushed a commit that referenced this issue Oct 27, 2017

Merge branch 'master' of github.com:SmileiPIC/Smilei into develop

31e7fc4

fix #20

chiehjen mentioned this issue Feb 10, 2023

Error: Local QP operation on mlx5_2:1/IB #600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diagnostics of restart run #20

diagnostics of restart run #20

phyax commented Oct 22, 2017

iltommi commented Oct 22, 2017

iltommi commented Oct 22, 2017

phyax commented Oct 22, 2017

mccoys commented Oct 23, 2017

iltommi commented Oct 23, 2017 •

edited

Loading

phyax commented Oct 23, 2017

iltommi commented Oct 23, 2017

phyax commented Oct 23, 2017

mccoys commented Oct 24, 2017

mccoys commented Oct 24, 2017

phyax commented Oct 25, 2017 •

edited

Loading

mccoys commented Oct 25, 2017

iltommi commented Oct 25, 2017

phyax commented Oct 25, 2017

MickaelGrechX commented Oct 25, 2017 via email

mccoys commented Oct 25, 2017

phyax commented Oct 25, 2017

iltommi commented Oct 26, 2017

phyax commented Oct 26, 2017

mccoys commented Oct 27, 2017

diagnostics of restart run #20

diagnostics of restart run #20

Comments

phyax commented Oct 22, 2017

iltommi commented Oct 22, 2017

iltommi commented Oct 22, 2017

phyax commented Oct 22, 2017

mccoys commented Oct 23, 2017

iltommi commented Oct 23, 2017 • edited Loading

phyax commented Oct 23, 2017

iltommi commented Oct 23, 2017

phyax commented Oct 23, 2017

mccoys commented Oct 24, 2017

mccoys commented Oct 24, 2017

phyax commented Oct 25, 2017 • edited Loading

mccoys commented Oct 25, 2017

iltommi commented Oct 25, 2017

phyax commented Oct 25, 2017

MickaelGrechX commented Oct 25, 2017 via email

mccoys commented Oct 25, 2017

phyax commented Oct 25, 2017

iltommi commented Oct 26, 2017

phyax commented Oct 26, 2017

mccoys commented Oct 27, 2017

iltommi commented Oct 23, 2017 •

edited

Loading

phyax commented Oct 25, 2017 •

edited

Loading