Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Local QP operation on mlx5_2:1/IB #600

Closed
chiehjen opened this issue Feb 10, 2023 · 3 comments
Closed

Error: Local QP operation on mlx5_2:1/IB #600

chiehjen opened this issue Feb 10, 2023 · 3 comments
Labels

Comments

@chiehjen
Copy link

I tried to run a quite large 2D simulation on supercomputer g100, Cineca. However, the simulation stopped with the error:

[r500n001:59444:0:59444] ib_mlx5_log.c:143 Local QP operation on
mlx5_2:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[r500n001:59444:0:59444] ib_mlx5_log.c:143 RC QP 0x34eca wqe[184]:
RDMA_READ s-- [rva 0x7f200c45b92c rkey 0x669cc6] [va 0x7f200c0e9000
len 907924 lkey 0x150937]
==== backtrace (tid: 59444) ====

I found that if I run the same thing (i.e., the same mpi and
openmp threads---which is 25*48 in this case, the same code with same input) but just lower the
resolution by 16 times (4 in each direction), then it runs o.k. without any
problem. However, it is not likely a memory problem, because I am sure that the nodes have much more than
the required memory. Also, the disk has much more space (20 TB) than the estimated output (~200 GB).

Please help me. Thank you.

The job record is:

Cluster: g100
State: FAILED (exit code 143)
Nodes: 25
Cores per node: 48
CPU Utilized: 1-02:53:08
CPU Efficiency: 2.06% of 54-08:20:00 core-walltime
Job Wall-clock time: 01:05:13
Memory Utilized: 406.25 GB (estimated maximum)
Memory Efficiency: 0.55% of 72.33 TB (61.72 GB/core)

The jobscript:
#SBATCH --nodes=25
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=24:00:00
#SBATCH --error=test.err # standard error file
#SBATCH --output=test.out # standard output file
#SBATCH --account=icei_Tomass
module load hdf5/1.10.7--intel-oneapi-mpi--2021.4.0--intel--2021.4.0
module load spack/0.17.1-120
spack load py-matplotlib@3.4.3

export OMP_NUM_THREADS=48
export KMP_AFFINITY=compact # or OMP_PLACES=cores
srun --cpu-bind=cores -m block:block ./smilei plasmonics.py > out_1bmem

===============================================
The Smilei output is:
_ _
___ _ | | _ \ \ Version : 4.7-248-ge563595d9-master
/ | _ __ () | | ___ () | |
__ \ | ' \ _ | | / -) _ | |
|
/ |||| || || ___| || | |
/_/

Reading the simulation parameters

HDF5 version 1.10.7
Python version 3.8.12
Parsing pyinit.py
Parsing 4.7-248-ge563595d9-master
Parsing pyprofiles.py
Parsing plasmonics.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist
Calling python _smilei_check
Calling python _prepare_checkpoint_dir
Calling python _keep_python_running() :

WARNING src/Params/Params.cpp:656 (Params) Patches distribution: hilbertian

WARNING src/Params/Params.cpp:1082 (compute) simulation_time has been redefined from 634.601716 to 634.598172 to match timestep.

WARNING src/Params/Params.cpp:1187 (compute) Particles cluster width cluster_width set to : 1

Geometry: 2Dcartesian

 Interpolation order : 2
 Maxwell solver : Yee
 simulation duration = 634.598172,   total number of iterations = 98889
 timestep = 0.006417 = 0.947523 x CFL,   time resolution = 155.829317
 Grid length: 502.655, 351.858
 Cell length: 0.00957803, 0.00957803, 0
 Number of cells: 52480, 36736
 Spatial resolution: 104.406, 104.406

Electromagnetic boundary conditions

 xmin silver-muller, absorbing vector [1, 0]
 xmax silver-muller, absorbing vector [-1, -0]
 ymin silver-muller, absorbing vector [0, 1]
 ymax silver-muller, absorbing vector [-0, -1]

Vectorization:

 Mode: off

Initializing MPI

 MPI_THREAD_MULTIPLE enabled
 Number of MPI processes: 25
 Number of threads per MPI process : 48

 Number of patches: 64 x 64
 Number of cells in one patch: 820 x 574
 Dynamic load balancing: never

Initializing the restart environment

 Code will stop after 1200.000000 minutes
 Code will exit after dump

Initializing species

 Creating Species #0: aluminium
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #1: aluminium_pp
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #2: electron_target
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #3: electron_pp
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #4: carbon
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #5: proton
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function
 
 Creating Species #6: electron_l
	 > Pusher: boris
	 > Boundary conditions: remove remove remove remove
	 > Density profile: 2D user-defined function

Initializing laser parameters

	Laser #0: separable profile
		omega              : 1
		chirp_profile      : 1D built-in profile `tconstant` (0.000000)
		time envelope      : 1D built-in profile `tgaussian` (start: 0.000000, duration: 634.601716, sigma: 2388.135713, cente

r: 82.855191, order: 2.000000)
space envelope (y) : 1D user-defined function
space envelope (z) : 1D user-defined function
phase (y) : 1D user-defined function
phase (z) : 1D user-defined function
delay phase (y) : 0
delay phase (z) : 0

Initializing Patches

 First patch created
	 Approximately 10% of patches created
	 Approximately 20% of patches created
	 Approximately 30% of patches created
	 Approximately 40% of patches created
	 Approximately 50% of patches created
	 Approximately 60% of patches created
	 Approximately 70% of patches created
	 Approximately 80% of patches created
	 Approximately 90% of patches created
 All patches created

Creating Diagnostics, antennas, and external fields

 Created ParticleBinning #0: species aluminium
	 Axis x from 0 to 502.655 in 2000 steps
	 Axis y from 0 to 351.858 in 2000 steps
 Created ParticleBinning #1: species aluminium
	 Axis x from 0 to 502.655 in 500 steps
 Created ParticleBinning #2: species aluminium
	 Axis y from 38.3121 to 351.858 in 500 steps
 Created ParticleBinning #3: species aluminium_pp
	 Axis x from 0 to 502.655 in 500 steps
 Created ParticleBinning #4: species aluminium_pp
	 Axis y from 38.3121 to 351.858 in 500 steps
 Created ParticleBinning #5: species aluminium,aluminium_pp
	 Axis x from 0 to 502.655 in 500 steps
 Created ParticleBinning #6: species aluminium,aluminium_pp
	 Axis y from 38.3121 to 351.858 in 500 steps
 Created ParticleBinning #7: species carbon
	 Axis ekin from 0.2 to auto in 400 steps
 Created ParticleBinning #8: species proton
	 Axis ekin from 10 to auto in 400 steps [EDGE INCLUSIVE]
 Created ParticleBinning #9: species electron_target
	 Axis ekin from 10 to auto in 300 steps [EDGE INCLUSIVE]
 Created ParticleBinning #10: species electron_l
	 Axis ekin from 10 to auto in 300 steps [EDGE INCLUSIVE]
 Diagnostic Fields #0  :
	 Ex Ey 
 Probe diagnostic #0
	 500x500 points (total = 250000)
	 origin : 0, 0
	 corner 0 : 502.655, 0
	 corner 1 : 0, 351.858
 Created TrackParticles #0: species carbon
	 id,x,y,px,py,Ex,Ey,w
 Created TrackParticles #1: species proton
	 id,x,y,px,py,Ex,Ey,w
 Created TrackParticles #2: species electron_l
	 id,x,y,px,py,Ex,Ey,w
 Created TrackParticles #3: species electron_pp
	 id,x,y,px,py,Ex,Ey,w
 Created TrackParticles #4: species electron_target
	 id,x,y,px,py,Ex,Ey,w
 Created TrackParticles #5: species aluminium
	 id,x,y,px,py,Ex,Ey,q
 Created TrackParticles #6: species aluminium_pp
	 id,x,y,px,py,Ex,Ey,q

finalize MPI

 Done creating diagnostics, antennas, and external fields

Minimum memory consumption (does not include all temporary buffers)

          Particles: Master 0 MB;   Max 6924 MB;   Global 12.4 GB
             Fields: Master 7769 MB;   Max 7769 MB;   Global 190 GB
        scalars.txt: Master 0 MB;   Max 0 MB;   Global 0 GB
ParticleBinning0.h5: Master 30 MB;   Max 30 MB;   Global 0.745 GB
ParticleBinning1.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning2.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning3.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning4.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning5.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning6.h5: Master 0 MB;   Max 0 MB;   Global 9.31e-05 GB
ParticleBinning7.h5: Master 0 MB;   Max 0 MB;   Global 7.45e-05 GB
ParticleBinning8.h5: Master 0 MB;   Max 0 MB;   Global 7.45e-05 GB
ParticleBinning9.h5: Master 0 MB;   Max 0 MB;   Global 5.59e-05 GB

ParticleBinning10.h5: Master 0 MB; Max 0 MB; Global 5.59e-05 GB
Fields0.h5: Master 0 MB; Max 0 MB; Global 0 GB
Probes0.h5: Master 0 MB; Max 0 MB; Global 0.021 GB
TrackParticlesDisordered_carbon.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_proton.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_l.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_pp.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_target.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_aluminium.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_aluminium_pp.h5: Master 0 MB; Max 0 MB; Global 0 GB

Initial fields setup

 Solving Poisson at time t = 0

Initializing E field through Poisson solver

 Poisson solver converged at iteration: 0, relative err is ctrl = 0.000000 x 1e-14
 Poisson equation solved. Maximum err = 0.000000 at i= -1

Time in Poisson : 3.184701
Applying external fields at time t = 0
Applying prescribed fields at time t = 0
Applying antennas at time t = 0

Open files & initialize diagnostics

Running diags at time t = 0

---> The simulation crushed here !!!!!!!!!!!!!!!!

@chiehjen chiehjen added the bug label Feb 10, 2023
@mccoys
Copy link
Contributor

mccoys commented Feb 10, 2023

Do you have more backtrace? Is it possible for you to share your input file?

@chiehjen
Copy link
Author

Yes. My input file:

import numpy as np
import math
l0 = 2. * math.pi # laser wavelength [in code units]
t0 = l0 # optical cycle
Lsim = [80.*l0, 56.*l0] # length of the simulation
#Tsim = 140.*t0 # duration of the simulation
Tsim = 101.*t0
resx = 656.
resy = 656. # nb of cells in one laser wavelength #initially 41.
rest = resx/0.67 # nb of timesteps in one optical cycle

nppc = 36

l_wl = 0.82
I0 = 5.e21/1.e18

nm = l0/820. # 1nm is l0/820
fs = t0/2.73

def nc(wl):
return 1.1e21*wl**-2

def crit_den(part_den,wl):
return part_den/nc(wl)

def a0(l,I0):
return(0.85*math.sqrt(I0)*l)

def convert_temp(temp):
c=3e8
mc2 = 9.1e-31*c**2
eV = 1.602e-19

return temp/(mc2/eV)

param = 5.6/5.6
#real aluminium and carbon densities
al_ = 5.98e22
c_ = 4.28e22

al_target = crit_den(al_,l_wl)/param

#densities
c_d = 3.94/param
p_den = 7.89/param

#these are the parameters for the target with the "lower" angle, i.e. "more rectangular"
target_front = 20000.*nm
target_length = 5000.*nm
narrow_edge_y = 22750.*nm #lower right corner of target
ne_w = 500.*nm #width of the rectangle forming the "backbone" of target
ne_w2 = 0.*nm #width of th triangle edge

high = narrow_edge_y + ne_w + ne_w2
low = narrow_edge_y - ne_w2

target_rear = target_front+target_length
pl = 50.*nm

sl = 80. #pre-plasma profile scale length (in nm)
pp_height = 200. #lateral pre-plasma height (in nm)

al_target1 = np.sqrt(al_target)

def x_den(x):

if x >=  target_front-500.*nm and x <= target_front:
    dens = al_target1*np.exp((x-target_front)/(200.*nm))
elif x >=  target_front and x <= target_front+target_length:
    dens = al_target1
else:
    dens = 0.
return dens

def y_den(y):

if y >= high and y <= high+pp_height*nm:
    dens = al_target1*np.exp((-y+high)/(sl*nm))
elif y <= low and y >= low-pp_height*nm:
    dens = al_target1*np.exp((y-low)/(sl*nm))
elif y >= low and y <= high:
    dens = al_target1
else:
    dens = 0.

return dens

def aluminium(x,y):
if y>=low and y <= high and x>=target_front and x <= target_front+target_length:
dens = al_target
else:
dens = 0.

return dens

def electron_target(x,y):
if y>=low and y <= high and x>=target_front and x <= target_front+target_length:
dens = al_target
else:
dens = 0.

return 13.*dens

def aluminium_pp(x,y):

if y>=low and y <= high and x>=target_front and x <= target_front+target_length:
    dens = 0.
else:
    dens = np.tensordot(x_den(x),y_den(y),axes=0)

return dens

def electrons_pp(x,y):

if y>=low and y <= high and x>=target_front and x <= target_front+target_length:
    dens = 0.
else:
    dens = np.tensordot(x_den(x),y_den(y),axes=0)

return 13.*dens

def c_(x,y):

if (y >= low and y <= high):
    if (x >= target_rear and x <= target_rear+pl):                     #no more +5nm here since we dont use h5 profiles for target
        dens = c_d
    else:
        dens = 0.
else:
    dens = 0.

return dens

def p_(x,y):

if (y >= low and y <= high):
    if (x >= target_rear and x <= target_rear+pl):
        dens = p_den
    else:
        dens = 0.
else:
    dens = 0.

return dens

def e_(x,y):

if (y >= low and y <= high):
    if (x >= target_rear and x <= target_rear+pl):
        dens = 3.*c_d+p_den
    else:
        dens = 0.
else:
    dens = 0.

return dens

Main(
geometry = "2Dcartesian",
interpolation_order = 2,
cell_length = [l0/resx,l0/resy],
grid_length = Lsim,
number_of_patches = [ 64, 64 ],
#patch_arrangement = "linearized_XY",

timestep = t0/rest,
simulation_time = Tsim,

EM_boundary_conditions = [
    ['silver-muller'],
],
reference_angular_frequency_SI=2*math.pi*3.e8/(0.82*1.e-6)

)

bc = [["remove"]] #boundary conditions for all species

Checkpoints(
# restart_dir = "dump1",
#dump_step = 10000,
dump_minutes = 1200,
exit_after_dump = True,
keep_n_dumps = 2,
)
Species(
name = 'aluminium',
#ionization_model = 'tunnel',

#ionization_electrons = 'electron_al',                                                                                                    
                                                                                                                                
atomic_number = 13,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 27.*1836.0,
charge = 13.,
temperature = [convert_temp(10.)],
#number_density = trapezoidal(5.,xvacuum=10000.*nm,yplateau=1000.*nm,yvacuum=10000.*nm,yslope1=8000.*nm,                                  
                                                                                                                                
#                             yslope2=8000.*nm),                                                                                          
                                                                                                                                
#number_density = "test_profile.h5/some_group/the_profile",                                                                               
                                                                                                                                
#number_density = path+"/aluminium/the_profile",                                                                                          
                                                                                                                                
number_density = aluminium,
boundary_conditions = bc,
#boundary_conditions = [['reflective']]                                                                                                   
                                                                                                                                
#time_frozen=Tsim,                                                                                                                        

)

Species(
name = 'aluminium_pp',
#ionization_model = 'tunnel',
#ionization_electrons = 'electron_al',
atomic_number = 13,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 27.*1836.0,
charge = 13.,
temperature = [convert_temp(10.)],
#number_density = trapezoidal(5.,xvacuum=10000.*nm,yplateau=1000.*nm,yvacuum=10000.*nm,yslope1=8000.*nm,
# yslope2=8000.*nm),
#number_density = "test_profile.h5/some_group/the_profile",
#number_density = path+"/aluminium/the_profile",
#number_density = lambda x, y: np.tensordot(x_den(x),y_den(y),axes=0),
number_density = aluminium_pp,
boundary_conditions = bc,
#boundary_conditions = [['reflective']]
#time_frozen=Tsim,
)

Species(
name = 'electron_target',
#ionization_model = 'tunnel',

#ionization_electrons = 'electron_al',                                                                                                    
                                                                                                                                
atomic_number = 13,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 1.,
charge = -1.,
temperature = [convert_temp(10.)],
#number_density = trapezoidal(5.,xvacuum=10000.*nm,yplateau=1000.*nm,yvacuum=10000.*nm,yslope1=8000.*nm,                                  
                                                                                                                                
#                             yslope2=8000.*nm),                                                                                          
                                                                                                                                
#number_density = "test_profile.h5/some_group/the_profile",                                                                               
                                                                                                                                
#number_density = path+"/aluminium/the_profile",                                                                                          
                                                                                                                                
number_density = electron_target,
boundary_conditions = bc,
#boundary_conditions = [['reflective']]                                                                                                   
                                                                                                                                
#time_frozen=Tsim,                                                                                                                        

)

Species(
name = 'electron_pp',
#ionization_model = 'tunnel',
#ionization_electrons = 'electron_al',
atomic_number = 13,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 1.,
charge = -1.,
temperature = [convert_temp(10.)],
#number_density = trapezoidal(5.,xvacuum=10000.*nm,yplateau=1000.*nm,yvacuum=10000.*nm,yslope1=8000.*nm,
# yslope2=8000.*nm),
#number_density = "test_profile.h5/some_group/the_profile",
#number_density = path+"/aluminium/the_profile",
#number_density = lambda x, y: 13.*np.tensordot(x_den(x),y_den(y),axes=0),
number_density = electrons_pp,
boundary_conditions = bc,
#boundary_conditions = [['reflective']]
#time_frozen=Tsim,
)

Species(
name = 'carbon',
ionization_model = 'tunnel',
ionization_electrons = 'electron_l',
atomic_number = 6,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 12.*1836.0,
charge = 3.0,
temperature = [convert_temp(10.)],
number_density = c_,
boundary_conditions = bc
#boundary_conditions = [['reflective']]
)

Species(
name = 'proton',
#ionization_model = 'tunnel',
#ionization_electrons = 'electron',
atomic_number = 1,
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 1836.0,
charge = 1.0,
temperature = [convert_temp(10.)],
number_density = p_,
boundary_conditions = bc,
#boundary_conditions = [['reflective']]
#time_frozen=Tsim,
)

Species(
name = 'electron_l',
position_initialization = 'regular',
momentum_initialization = 'cold',
particles_per_cell = nppc,
mass = 1.0,
charge = -1.0,
temperature = [convert_temp(100.)],
number_density = e_,
boundary_conditions = bc,
#boundary_conditions = [['reflective']],
#time_frozen = 35.*fs,
)

#waistinI = 2200.*nm #waist of Gaussian laser pulse (in intensity)

FWHMtinI = 25.fs #FWHM of intensity time profile, from the EPOCH input (w is 15 fs, see below conversion formula or documentation)
FWHMtinE = FWHMtinI
math.sqrt(2.0) #conversion to FWHM of field time profile

#waistinE = (waistinI*(2.0math.sqrt(math.log(2.0))))/(math.sqrt(2.0math.log(2.0))) #conversion to waist of laser pulse in E field

fwhm_spot = 3600.nm #w from EPOCH is 2160 nm, so fwhm is 2w*sqrt(log(2)) (see documentation)

waistinE = fwhm_spot/np.sqrt(2.*np.log(2.))

LaserGaussian2D(
a0 = a0(l_wl,I0),
omega = 1.,
focus = [20000.nm, 23000.nm], # coordinates of laser focus
waist = waistinE,
incidence_angle = 0., # 22.5 degrees in radians
#time_envelope = tgaussian(fwhm=15
fs
2np.sqrt(np.log(2)),center=33fs),
time_envelope = tgaussian(fwhm=FWHMtinE,center=36.*fs)
#time_envelope = tgaussian(fwhm=FWHMtinE)
#space_envelope = [gaussian(xcenter=-8280.nm,xfwhm=21800.nmnp.log(2)),0.]
)

def target(particles):
return (particles.x > 32000.*nm)

#################################################################################################################

DiagFields(
every = 9000,
#fields = ['Ex','Ey','Rho_electron','Rho_carbon','Rho_hydrogen','Rho_proton','Rho_carbon3','Jx','Jy']

fields = ['Ex','Ey']

)

#DiagFields(
#every = 7000,
#fields = ['Ex','Ey','Rho_electron','Rho_carbon','Rho_hydrogen','Rho_proton','Rho_carbon3','Jx','Jy']
#fields = ['Rho_aluminium','Rho_aluminium_pp','Rho_electron_pp','Rho_electron_l','Rho_electron_target','Rho_proton','Rho_carbon'])

DiagScalar(
every = 8000
)

track_ts = 9000

DiagTrackParticles(
name="track_carbon",
species = "carbon",
every = track_ts,
attributes = ["x", "y","px","py","Ex","Ey","w"]
)

DiagTrackParticles(
name="track_proton",
species = "proton",
every = track_ts,
attributes = ["x", "y","px","py","Ex","Ey","w"]
)

DiagTrackParticles(
name="track_electron_layer",
species = "electron_l",
every = track_ts,
attributes = ["x", "y","px","py","Ex","Ey","w"]
)

DiagTrackParticles(
name="track_electron_pp",
species = "electron_pp",
every = track_ts,
attributes = ["x", "y","px","py","Ex","Ey","w"]
)

DiagTrackParticles(
name="track_electron_target",
species = "electron_target",
every = 6000,
attributes = ["x", "y","px","py","Ex","Ey","w"]
)

DiagTrackParticles(
name="track_al",
species = "aluminium",
every = 6000,
attributes = ["x", "y","px","py","Ex","Ey","q"]
)

DiagTrackParticles(
name="track_al_pp",
species = "aluminium_pp",
every = track_ts,
attributes = ["x", "y","px","py","Ex","Ey","q"]
)

track_bn = 9000

DiagParticleBinning(
name="aluminium_den",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium"],
axes = [ ["x", 0., Lsim[0], 2000],
["y", 0., Lsim[1], 2000] ]
)

DiagParticleBinning( #lineout in x of pp electrons
name = "pp_lineout_x_al",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium"],
axes = [
["x", 0., Lsim[0], 500],]
)

DiagParticleBinning( #lineout in y of pp_u electrons
name = "pp_lineout_y_al",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium"],
axes = [
["y", 5000.*nm, Lsim[1], 500],
]
)

DiagParticleBinning( #lineout in x of pp electrons

name = "pp_lineout_x_pp",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium_pp"],
axes = [
    ["x", 0., Lsim[0], 500],]

)

DiagParticleBinning( #lineout in y of pp_u electrons

name = "pp_lineout_y_pp",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium_pp"],
axes = [
    ["y", 5000.*nm, Lsim[1], 500],
]

)

DiagParticleBinning( #lineout in x of pp electrons

name = "pp_lineout_x",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium","aluminium_pp"],
axes = [
    ["x", 0., Lsim[0], 500],]

)

DiagParticleBinning( #lineout in y of pp_u electrons

name = "pp_lineout_y",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["aluminium","aluminium_pp"],
axes = [
    ["y", 5000.*nm, Lsim[1], 500],
]

)

DiagParticleBinning(
name="carbon_ekin",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["carbon"],
axes = [
["ekin", 0.2, "auto", 400],
]
)

DiagParticleBinning(
name="proton_ekin",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["proton"],
axes = [
["ekin", 10., "auto", 400,"edge_inclusive"],
]
)

DiagParticleBinning(
name="electron_ekin",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["electron_target"],
axes = [
["ekin", 10., "auto", 300,"edge_inclusive"],
]
)

DiagParticleBinning(
name="electron_layer_ekin",
deposited_quantity = "weight",
every = track_bn,
time_average = 1,
species = ["electron_l"],
axes = [
["ekin", 10., "auto", 300,"edge_inclusive"],
]
)

DiagProbe(
name = "fields",
every = track_bn,
origin = [0.,0.],
corners = [
[Lsim[0],0.],
[0.,Lsim[1]],
],
number = [500, 500],
fields = ["Ex","Ey","Bz","Bx"],
)

The full backtrace:

==== backtrace (tid: 59444) ====
0 0x0000000000055969 ucs_debug_print_backtrace() ???:0
1 0x00000000000200a9 uct_ib_mlx5_completion_with_err() ???:0
2 0x0000000000030669 uct_rc_mlx5_iface_progress() ???:0
3 0x0000000000030402 uct_rc_mlx5_iface_progress() ???:0
4 0x0000000000024c1a ucp_worker_progress() ???:0
5 0x000000000000a6d1 mlx_ep_progress() mlx_ep.c:0
6 0x00000000000227fd ofi_cq_progress() osd.c:0
7 0x0000000000022787 ofi_cq_readfrom() osd.c:0
8 0x000000000060e9d7 fi_cq_read() /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_eq.h:385
9 0x00000000001de002 MPIDI_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:145
10 0x00000000001de002 MPID_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:218
11 0x00000000001de002 MPID_Progress_wait() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:279
12 0x000000000079c74a MPIR_Waitall_impl() /build/impi/_buildspace/release/../../src/mpi/request/waitall.c:46
13 0x000000000079c74a MPID_Waitall() /build/impi/_buildspace/release/../../src/mpid/ch4/include/mpidpost.h:186
14 0x000000000079c74a MPIR_Waitall() /build/impi/_buildspace/release/../../src/mpi/request/waitall.c:173
15 0x000000000079c74a PMPI_Waitall() /build/impi/_buildspace/release/../../src/mpi/request/waitall.c:331
16 0x00000000000925da ADIOI_W_Exchange_data() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:757
17 0x000000000008ffa0 ADIOI_Exch_and_write() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:466
18 0x000000000008ffa0 ADIOI_GEN_WriteStridedColl() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/adio/common/ad_write_coll.c:189
19 0x0000000000b37bf4 MPIOI_File_write_all() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/mpi-io/write_all.c:114
20 0x0000000000b37ec2 PMPI_File_write_at_all() /build/impi/_buildspace/release/src/mpi/romio/../../../../../src/mpi/romio/mpi-io/write_atall.c:58
21 0x0000000000166ade H5FD_mpio_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5FDmpio.c:1636
22 0x00000000001604f7 H5FD_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5FDint.c:248
23 0x00000000001380dc H5F__accum_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Faccum.c:823
24 0x000000000027720a H5PB_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5PB.c:1031
25 0x00000000001457d9 H5F_block_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Fio.c:160
26 0x00000000000f9922 H5D__mpio_select_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dmpio.c:490
27 0x00000000000fb8d8 H5D__final_collective_io() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dmpio.c:2124
28 0x00000000000fb8d8 H5D__link_chunk_collective_io() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dmpio.c:1234
29 0x00000000001042b5 H5D__chunk_collective_io() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dmpio.c:883
30 0x00000000001042b5 H5D__chunk_collective_write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dmpio.c:960
31 0x00000000000f66e8 H5D__write() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dio.c:812
32 0x00000000000f5daf H5Dwrite() /scratch_local/propro01/spack-stage-hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/spack-src/src/H5Dio.c:334
33 0x000000000049e90c DiagnosticFields2D::writeField() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Tools/H5.h:362
34 0x000000000049e90c ???() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Tools/H5.h:402
35 0x000000000049e90c ???() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Tools/H5.h:394
36 0x000000000049e90c DiagnosticFields2D::writeField() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Diagnostic/DiagnosticFields2D.cpp:187
37 0x00000000004ecb47 DiagnosticFields::run() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Diagnostic/DiagnosticFields.cpp:323
38 0x00000000004ecb47 ???() /usr/include/c++/8/bits/basic_string.h:220
39 0x00000000004ecb47 ???() /usr/include/c++/8/bits/basic_string.h:657
40 0x00000000004ecb47 DiagnosticFields::run() /usr/include/c++/8/bits/basic_string.h:656
41 0x00000000007be5e0 VectorPatch::runAllDiags() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Patch/VectorPatch.cpp:1348
42 0x00000000007be5e0 VectorPatch::runAllDiags() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Patch/VectorPatch.cpp:1351
43 0x00000000004801c9 main() /g100/home/userexternal/ptomass1/install_smilei/Smilei/src/Smilei.cpp:332
44 0x0000000000023493 __libc_start_main() ???:0
45 0x0000000000481f1e _start() ???:0

Stack trace (most recent call last):
#31 Object "/g100_scratch/userexternal/ptomass1/plasmonics_1nm/./smilei", at 0x7be5df, in VectorPatch::runAllDiags(Params&, SmileiMPI*, unsigned int, Timers&, SimWindow*)
#30 Object "/g100_scratch/userexternal/ptomass1/plasmonics_1nm/./smilei", at 0x4ecb46, in DiagnosticFields::run(SmileiMPI*, VectorPatch&, int, SimWindow*, Timers&)
#29 Object "/g100_scratch/userexternal/ptomass1/plasmonics_1nm/./smilei", at 0x49e90b, in DiagnosticFields2D::writeField(H5Write*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, int)
#28 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a42939dae, in H5Dwrite
#27 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a4293a6e7, in H5D__write
#26 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a429482b4, in H5D__chunk_collective_write
#25 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a4293f8d7, in
#24 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a4293d921, in H5D__mpio_select_write
#23 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a429897d8, in H5F_block_write
#22 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a42abb209, in H5PB_write
#21 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a4297c0db, in H5F__accum_write
#20 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a429a44f6, in H5FD_write
#19 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-cascadelake/intel-2021.4.0/hdf5-1.10.7-qxqtgt3gmfukp4wkyu2w4gdpqqdeijb7/lib/libhdf5.so.103", at 0x154a429aaadd, in
#18 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a4039eec1, in PMPI_File_write_at_all
#17 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a4039ebf3, in
#16 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a3f8f6f9f, in
#15 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a3f8f95d9, in
#14 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a40003749, in PMPI_Waitall
#13 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a3fa45001, in
#12 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//lib/release/libmpi.so.12", at 0x154a3fe759d6, in
#11 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so", at 0x154a37b78786, in
#10 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so", at 0x154a37b787fc, in
#9 Object "/g100_work/PROJECTS/spack/v0.17/prod/0.17.1/install/0.17/linux-centos8-skylake_avx512/gcc-8.4.1/intel-oneapi-mpi-2021.4.0-lai2e7gyrel5cjlbg3f7cy5afpl56u4y/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so", at 0x154a37b606d0, in
#8 Object "/lib64/libucp.so.0", at 0x154a37910c19, in ucp_worker_progress
#7 Object "/usr/lib64/ucx/libuct_ib.so.0", at 0x154a367e1401, in uct_rc_mlx5_iface_progress
#6 Object "/usr/lib64/ucx/libuct_ib.so.0", at 0x154a367e1668, in
#5 Object "/usr/lib64/ucx/libuct_ib.so.0", at 0x154a367d10a8, in uct_ib_mlx5_completion_with_err
#4 Object "/usr/lib64/libucs.so.0", at 0x154a3738ae73, in ucs_log_dispatch
#3 Object "/usr/lib64/libucs.so.0", at 0x154a3738ad26, in ucs_log_default_handler
#2 Object "/usr/lib64/libucs.so.0", at 0x154a37386425, in ucs_fatal_error_message
#1 Object "/lib64/libc.so.6", at 0x154a3ead6db4, in abort
#0 Object "/lib64/libc.so.6", at 0x154a3eaec37f, in gsignal
Aborted (Signal sent by tkill() 59444 34924)
srun: error: r500n001: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=7464704.0
srun: error: r500n016: task 15: Terminated

@mccoys
Copy link
Contributor

mccoys commented Feb 10, 2023

This looks like related to hdf5 using MPI to write files in parallel. The error is seemingly caused by an error from the infiniband network on your system. You probably have to ask a system admin from Cineca.

@mccoys mccoys closed this as completed Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants