-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation does not run #291
Comments
I launched this simulation on IRENE KNL with 200 nodes, 800MPI and 32 openMP per MPI. |
Hi Julien, Is your problem reproducible ? It's crashing during the parsing of the namelist. Calling python _smilei_check Either there is a problem with one node which didn't start the program correctly (the crash happened in the first Julien |
Hi Julien I tried 3 times and it always craches at the same point. |
The 2500 x 2500 configuration was submitted on the same resource distribution (with 200 nodes, 800MPI and 32 openMP per MPI) ? |
No just with 40 nodes |
The reproducible aspect could lead to the second hypothesis but I last year I ran simulations up to 512 KNL nodes on Irene ... Could you send me all log files and error files ? |
Thanks Julien for your help;) |
Could you check that it goes further if you don't read the hydro file ? |
Yes it goes further until Initializing MPIOn rank 0 [Python] NameError: global name 'profilene_BN' is not defined I tried with 40 nodes (160MPI) |
Ok, but on 40 nodes it was ok, the question was for 200 nodes. |
For 200 nodes it stops at the error above so it goes well further... |
Hi Julien, So do you think I can contact TGCC hotline for some help? Thanks Julien |
Not really, it's a problem of the namelist. x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True) By something which seems to : from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = 0
if rank==0 :
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
comm.bcast( x_prof, root = 0)
... |
Thanks Julien. I will study that. |
More there is data to read, more this problem can appear. During our last internal Smilei meeting (few hours before the creation of this issue !), we discussed how to provide a benchmark for this kind of problem which is at boundaries of the code itself thanks to Python namelists. |
Be careful with the mpi4py module that you will use, it must be compatible with your main MPI library. $ python setup.py build
$ python setup.py install --user And some corrections in the namelist : from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof = np.full(6600, 0.)
y_prof = np.full(6600, 0.)
...
if rank==0 :
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
x_prof = comm.bcast( x_prof, root = 0)
y_prof = comm.bcast( y_prof, root = 0)
... |
I tried this morning on several timesteps with low resolution and it works with: if rank==0 : comm.bcast( x_prof, root = 0) Do you think I have to recompile anymway? |
If it's work, no. The mpi4py should be compatible with the MPI library. Otherwise, it would crashed. |
I am sorry Julien but the simulation still craches at the same point even with brodcast... I will decrease the number of nodes and make some tests... I will come back after to tell you |
Now you can contact the hotline. |
I have just remember a Smilei case which deadlock on Irene KNL with large Python array. Waiting for a better solution, a workaround could consist in spliting |
It's quite strange! I have now this error message in my out: and this in the err file: Stack trace (most recent call last): I already ran simulations with hydro txt files (without broadcast) on 200 nodes with KNL and these txt files were more than 1Mo. Here it is 500ko... When you say many files, have you got an idea of number of files? Thanks again Julien |
This error is pure python. It means you assign several values to less variables, which is not allowed. For instance "x, y = a, b, c" |
Thanks Fred! I found my mistake. I relaunch the simulation... |
I tried to run your case in a new environment (OpenMPI 4) directly on KNL with 1600 MPI on 400 nodes. It crashes during interpolations done in the namelist. slurmstepd-irene3002: error: Detected 2 oom-kill event(s) in step 5326988.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. You can maybe try to use 1 MPI per node x 128 openMP threads per node. I will not run so large simulations without an access to your genci time. I just have for some developments. |
Thanks Julien. I will try with 1MPI pernode * 128openMP. |
I agree with you. I ran this kind of simulations 6 months ago with a preparatory access and it worked. Thanks again Julien |
Hi @jderouillat With hotline, we tried lots of things...until use openmpi. But smilei does not compile... I am not enough good in computational to understand their advices. Do you think that you can see the issue with them? |
Of course but it's maybe not necessary. You'll find below a protocol to define the environment (I write it here but more like a memo at which we can refer for other users). If IntelMPI is available we are still recommending to use it, in place of OpenMPI, so first : $ module unload mpi/openmpi Doing this, the compiler is unloaded, so reload it with the IntelMPI associated to : $ module load intel/19.0.5.281
$ module load mpi/intelmpi/2019.0.5.281 Then check if a HDF5 library is available and compatible with your MPI environment. I'm happi to discover that it's the case now : $ module show hdf5/1.8.20
...
4 : module load flavor/buildcompiler/intel/19 flavor/buildmpi/intelmpi/2019 flavor/hdf5/parallel
... Load it as recommended by the computing center : $ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT} # This is a variable that we are recommended in the Smilei documentation Compile with the ad hoc machine file (the recommended flag is $ make -j8 machine=joliot_curie_rome I ran a small simulation on 4 AMD nodes with the binary generated by this process. |
Argh ! It's not so simple with an additional Python. |
Yes with module load python when I compile I have this warning message: /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h(84): warning #2650: attributes ignored here |
This is a warning, it could be resolved reinstalling a Python toolchain in the new environment (not sure that the cost of a such burden worth it). $ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT}
$ module load python
$ make clean;make -j8 machine=irene_rome config=no_mpi_tm I ran the same small simulation on a single AMD node in this environment (few resources are available now). |
Since this morning, I try but it does not compile. I have an issue with the hdf5 module loading. |
Is your default environment modified (do you add something in |
My bashrc is empty (just some env variables export). module dfldatadir/own (Data Directory) cannot be unloaded |
I see the hdf5 flavor in your list but not the main hdf5 module (and the "export HDF5_ROOT_DIR" which is used to find the hdf5.h file. |
It is because I tried to load directly hdf5 parallel without switch. unload module mpi/openmpi/4.0.2 (Open MPI) Loading hdf5/1.8.20 Switching from flavor/hdf5/serial to flavor/hdf5/parallel I retried with a module load mpi/openmpi/4.0.2. I just have to check if with that python config scipy.interpolate can be import. I need it in many of my runs to interpolate hydrodynamics data ;) |
Hi @jderouillat The problem is always the use of scipy module. Hotline advice me to use smilei4.1 with load module smilei but there was lots of changes since 4.1 version.... I am puzzled with that. |
By default, the python module provided by the computing center do not set the following variable. The Python library associated to the smilei binary is the system one (check with export LD_LIBRARY_PATH=$PYTHON_ROOT/lib:$LD_LIBRARY_PATH I do not set |
I am sorry but the simulation crashes always at the same point ImportError: libmpi.so.20: cannot open shared object file: My rome env to compile is like that : My compile file: compile smileicd Smilei_hub I really don't understand... |
This environment does not try to load a |
/ccc/work/cont003/ra5390/bonvalej/Smilei/Smilei_hub/smileirome: error while loading shared libraries: libhdf5.so.10: cannot open shared object file: No such file or directory |
I can't access your directory (ask to the hotline to add my login to your project if you want), and even if I could, the result depends of the environment set when the command is executed. |
j'ai ceci pour le binaire généré par la hotline:
Et pour ma version : ldd smileirome |
You need to reinstall mpi4py in the targeted environment. |
It seems to work now ;) I am happi! |
It reads the namelist ! |
Great ! More precisely, it runs the namelist as a Python script. In your case, it reads the hydro file and interpolates the read quantities. |
Yes it seems to be long. Wait and see. module purge To compile, I put the no_mpi_tm config option as you advice. To use Scipy, before ccc_mprun hotline added : export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PYTHON3_ROOT/lib |
I am afraid that I have again a problem: in my out, I am locked on "python preprocess function does not exist" since 1hour... Could it be normal? |
You could try to use the binary I compiled with a script inspired by mine especially concerning module used (with a You will find too in this directory a namelist derived from yours and outputs from a 10 minutes single node run (the idea was to check that python function are executed). A few interpolation are not performed during these 10 minutes, 52 But in a first can you confirm that this test is going further than yours or not ? |
I copied your folder and your binary. I have always this issue:
___ _ | | _ \ \ Version : v4.4-784-gc3f8cc81-master Reading the simulation parametersHDF5 version 1.8.20 |
This morning you was using another Python environment, can you confirm that you reinstall the |
No I did not. I tried a simple run with a simulation of a 2D gaussian laser in an empty box. Smilei works with this configuration. |
Hi Julien, |
Dear Smilei experts,
Hope all of you are fine!
I have a simulation that does not begin.
I am afraid of having a too big simulation : 25000 * 20000 cells in 2D but I am not sure. The error message is:
Invalid knl_memoryside_cache header, expected "version: 1".
[irene3354][[26206,0],315][btl_portals4_component.c:1115] mca_btl_portals4_component_progress_event() ERROR 0: PTL_EVENT_ACK with ni_fail_type 10 (PTL_NI_TARGET_INVALID) with target (nid=508,pid=73) and initator (nid=507,pid=73) found
Stack trace (most recent call last):
#14 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#13 Object "./smileiKNL", at 0x458568, in
#12 Object "/lib64/libc.so.6", at 0x2b3e9d86f544, in __libc_start_main
#11 Object "./smileiKNL", at 0x8f379f, in main
#10 Object "./smileiKNL", at 0x6e93ab, in Params::Params(SmileiMPI*, std::vector<std::string, std::allocatorstd::string >)
#9 Object "/opt/selfie-1.0.2/lib64/selfie.so", at 0x2b3e9b907ab7, in MPI_Barrier
#8 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9ccdaea0, in MPI_Barrier
#7 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9cd15a82, in ompi_coll_base_barrier_intra_bruck
#6 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_pml_ob1.so", at 0x2b3ea7b527a6, in mca_pml_ob1_send
#5 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libopen-pal.so.20", at 0x2b3e9ff69330, in opal_progress
#4 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd384d, in mca_btl_portals4_component_progress
#3 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd3a59, in mca_btl_portals4_component_progress_event
#2 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libmca_common_portals4.so.20", at 0x2b3ea61defd8, in common_ptl4_printf_error
#1 Object "/lib64/libc.so.6", at 0x2b3e9d884a67, in abort
#0 Object "/lib64/libc.so.6", at 0x2b3e9d883377, in gsignal
Aborted (Signal sent by tkill() 150381 35221)
The simulation stops at :
HDF5 version 1.8.20
Python version 2.7.14
Parsing pyinit.py
Parsing v4.4-706-gb5c12a5a-master
Parsing pyprofiles.py
Parsing BNH2d.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist
The version of Smilei is : v4.4-706-gb5c12a5a-master
Thanks for your help.
Here is the input:
BNH2d.txt
The text was updated successfully, but these errors were encountered: