-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Local QP operation on mlx5_2:1/IB #600
Comments
Do you have more backtrace? Is it possible for you to share your input file? |
Yes. My input file: import numpy as np nppc = 36 l_wl = 0.82 nm = l0/820. # 1nm is l0/820 def nc(wl): def crit_den(part_den,wl): def a0(l,I0): def convert_temp(temp):
param = 5.6/5.6 al_target = crit_den(al_,l_wl)/param #densities #these are the parameters for the target with the "lower" angle, i.e. "more rectangular" high = narrow_edge_y + ne_w + ne_w2 target_rear = target_front+target_length sl = 80. #pre-plasma profile scale length (in nm) al_target1 = np.sqrt(al_target) def x_den(x):
def y_den(y):
def aluminium(x,y):
def electron_target(x,y):
def aluminium_pp(x,y):
def electrons_pp(x,y):
def c_(x,y):
def p_(x,y):
def e_(x,y):
Main(
) bc = [["remove"]] #boundary conditions for all species Checkpoints(
) Species( Species(
) Species( Species( Species( Species( #waistinI = 2200.*nm #waist of Gaussian laser pulse (in intensity) FWHMtinI = 25.fs #FWHM of intensity time profile, from the EPOCH input (w is 15 fs, see below conversion formula or documentation) #waistinE = (waistinI*(2.0math.sqrt(math.log(2.0))))/(math.sqrt(2.0math.log(2.0))) #conversion to waist of laser pulse in E field fwhm_spot = 3600.nm #w from EPOCH is 2160 nm, so fwhm is 2w*sqrt(log(2)) (see documentation) waistinE = fwhm_spot/np.sqrt(2.*np.log(2.)) LaserGaussian2D( def target(particles): ################################################################################################################# DiagFields(
) #DiagFields( DiagScalar( track_ts = 9000 DiagTrackParticles( DiagTrackParticles( DiagTrackParticles( DiagTrackParticles( DiagTrackParticles( DiagTrackParticles( DiagTrackParticles( track_bn = 9000 DiagParticleBinning( DiagParticleBinning( #lineout in x of pp electrons DiagParticleBinning( #lineout in y of pp_u electrons DiagParticleBinning( #lineout in x of pp electrons
) DiagParticleBinning( #lineout in y of pp_u electrons
) DiagParticleBinning( #lineout in x of pp electrons
) DiagParticleBinning( #lineout in y of pp_u electrons
) DiagParticleBinning( DiagParticleBinning( DiagParticleBinning( DiagParticleBinning( DiagProbe(
|
This looks like related to hdf5 using MPI to write files in parallel. The error is seemingly caused by an error from the infiniband network on your system. You probably have to ask a system admin from Cineca. |
I tried to run a quite large 2D simulation on supercomputer g100, Cineca. However, the simulation stopped with the error:
[r500n001:59444:0:59444] ib_mlx5_log.c:143 Local QP operation on
mlx5_2:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[r500n001:59444:0:59444] ib_mlx5_log.c:143 RC QP 0x34eca wqe[184]:
RDMA_READ s-- [rva 0x7f200c45b92c rkey 0x669cc6] [va 0x7f200c0e9000
len 907924 lkey 0x150937]
==== backtrace (tid: 59444) ====
I found that if I run the same thing (i.e., the same mpi and
openmp threads---which is 25*48 in this case, the same code with same input) but just lower the
resolution by 16 times (4 in each direction), then it runs o.k. without any
problem. However, it is not likely a memory problem, because I am sure that the nodes have much more than
the required memory. Also, the disk has much more space (20 TB) than the estimated output (~200 GB).
Please help me. Thank you.
The job record is:
Cluster: g100
State: FAILED (exit code 143)
Nodes: 25
Cores per node: 48
CPU Utilized: 1-02:53:08
CPU Efficiency: 2.06% of 54-08:20:00 core-walltime
Job Wall-clock time: 01:05:13
Memory Utilized: 406.25 GB (estimated maximum)
Memory Efficiency: 0.55% of 72.33 TB (61.72 GB/core)
The jobscript:
#SBATCH --nodes=25
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=24:00:00
#SBATCH --error=test.err # standard error file
#SBATCH --output=test.out # standard output file
#SBATCH --account=icei_Tomass
module load hdf5/1.10.7--intel-oneapi-mpi--2021.4.0--intel--2021.4.0
module load spack/0.17.1-120
spack load py-matplotlib@3.4.3
export OMP_NUM_THREADS=48
export KMP_AFFINITY=compact # or OMP_PLACES=cores
srun --cpu-bind=cores -m block:block ./smilei plasmonics.py > out_1bmem
===============================================
The Smilei output is:
_ _
___ _ | | _ \ \ Version : 4.7-248-ge563595d9-master
/ | _ __ () | | ___ () | |
__ \ | ' \ _ | | / -) _ | |
|/ |||| || || ___| || | |
/_/
Reading the simulation parameters
HDF5 version 1.10.7
Python version 3.8.12
Parsing pyinit.py
Parsing 4.7-248-ge563595d9-master
Parsing pyprofiles.py
Parsing plasmonics.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist
Calling python _smilei_check
Calling python _prepare_checkpoint_dir
Calling python _keep_python_running() :
WARNING src/Params/Params.cpp:656 (Params) Patches distribution: hilbertian
WARNING src/Params/Params.cpp:1082 (compute) simulation_time has been redefined from 634.601716 to 634.598172 to match timestep.
WARNING src/Params/Params.cpp:1187 (compute) Particles cluster width
cluster_width
set to : 1Geometry: 2Dcartesian
Electromagnetic boundary conditions
Vectorization:
Initializing MPI
Initializing the restart environment
Initializing species
Initializing laser parameters
r: 82.855191, order: 2.000000)
space envelope (y) : 1D user-defined function
space envelope (z) : 1D user-defined function
phase (y) : 1D user-defined function
phase (z) : 1D user-defined function
delay phase (y) : 0
delay phase (z) : 0
Initializing Patches
Creating Diagnostics, antennas, and external fields
finalize MPI
Minimum memory consumption (does not include all temporary buffers)
ParticleBinning10.h5: Master 0 MB; Max 0 MB; Global 5.59e-05 GB
Fields0.h5: Master 0 MB; Max 0 MB; Global 0 GB
Probes0.h5: Master 0 MB; Max 0 MB; Global 0.021 GB
TrackParticlesDisordered_carbon.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_proton.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_l.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_pp.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_electron_target.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_aluminium.h5: Master 0 MB; Max 0 MB; Global 0 GB
TrackParticlesDisordered_aluminium_pp.h5: Master 0 MB; Max 0 MB; Global 0 GB
Initial fields setup
Initializing E field through Poisson solver
Time in Poisson : 3.184701
Applying external fields at time t = 0
Applying prescribed fields at time t = 0
Applying antennas at time t = 0
Open files & initialize diagnostics
Running diags at time t = 0
---> The simulation crushed here !!!!!!!!!!!!!!!!
The text was updated successfully, but these errors were encountered: