Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation does not run #291

Closed
juliencelia opened this issue Sep 3, 2020 · 65 comments
Closed

Simulation does not run #291

juliencelia opened this issue Sep 3, 2020 · 65 comments
Labels

Comments

@juliencelia
Copy link

Dear Smilei experts,

Hope all of you are fine!

I have a simulation that does not begin.
I am afraid of having a too big simulation : 25000 * 20000 cells in 2D but I am not sure. The error message is:

Invalid knl_memoryside_cache header, expected "version: 1".
[irene3354][[26206,0],315][btl_portals4_component.c:1115] mca_btl_portals4_component_progress_event() ERROR 0: PTL_EVENT_ACK with ni_fail_type 10 (PTL_NI_TARGET_INVALID) with target (nid=508,pid=73) and initator (nid=507,pid=73) found
Stack trace (most recent call last):
#14 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#13 Object "./smileiKNL", at 0x458568, in
#12 Object "/lib64/libc.so.6", at 0x2b3e9d86f544, in __libc_start_main
#11 Object "./smileiKNL", at 0x8f379f, in main
#10 Object "./smileiKNL", at 0x6e93ab, in Params::Params(SmileiMPI*, std::vector<std::string, std::allocatorstd::string >)
#9 Object "/opt/selfie-1.0.2/lib64/selfie.so", at 0x2b3e9b907ab7, in MPI_Barrier
#8 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9ccdaea0, in MPI_Barrier
#7 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9cd15a82, in ompi_coll_base_barrier_intra_bruck
#6 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_pml_ob1.so", at 0x2b3ea7b527a6, in mca_pml_ob1_send
#5 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libopen-pal.so.20", at 0x2b3e9ff69330, in opal_progress
#4 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd384d, in mca_btl_portals4_component_progress
#3 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd3a59, in mca_btl_portals4_component_progress_event
#2 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libmca_common_portals4.so.20", at 0x2b3ea61defd8, in common_ptl4_printf_error
#1 Object "/lib64/libc.so.6", at 0x2b3e9d884a67, in abort
#0 Object "/lib64/libc.so.6", at 0x2b3e9d883377, in gsignal
Aborted (Signal sent by tkill() 150381 35221)

The simulation stops at :
HDF5 version 1.8.20
Python version 2.7.14
Parsing pyinit.py
Parsing v4.4-706-gb5c12a5a-master
Parsing pyprofiles.py
Parsing BNH2d.py
Parsing pycontrol.py
Check for function preprocess()
python preprocess function does not exist

The version of Smilei is : v4.4-706-gb5c12a5a-master

Thanks for your help.
Here is the input:

BNH2d.txt

@juliencelia
Copy link
Author

I launched this simulation on IRENE KNL with 200 nodes, 800MPI and 32 openMP per MPI.

@jderouillat
Copy link
Contributor

Hi Julien,

Is your problem reproducible ?

It's crashing during the parsing of the namelist.
The next log should be :

         Calling python _smilei_check

Either there is a problem with one node which didn't start the program correctly (the crash happened in the first MPI_Barrier of the program), either there is a problem executing the python program simultaneously on all nodes.

Julien

@juliencelia
Copy link
Author

Hi Julien

I tried 3 times and it always craches at the same point.
Before that, I tried with 2500*2000 cells just to check the initial conditions and it worked...

@jderouillat
Copy link
Contributor

The 2500 x 2500 configuration was submitted on the same resource distribution (with 200 nodes, 800MPI and 32 openMP per MPI) ?

@juliencelia
Copy link
Author

No just with 40 nodes

@jderouillat
Copy link
Contributor

The reproducible aspect could lead to the second hypothesis but I last year I ran simulations up to 512 KNL nodes on Irene ...

Could you send me all log files and error files ?

@juliencelia
Copy link
Author

@juliencelia
Copy link
Author

Thanks Julien for your help;)

@jderouillat
Copy link
Contributor

Could you check that it goes further if you don't read the hydro file ?

@juliencelia
Copy link
Author

juliencelia commented Sep 4, 2020

Yes it goes further until

Initializing MPI

On rank 0 [Python] NameError: global name 'profilene_BN' is not defined
ERROR src/Profiles/Profile.cpp:276 (Profile) Profile nb_density eon: does not seem to return a correct value
due to non reading of hydro.txt

I tried with 40 nodes (160MPI)

@jderouillat
Copy link
Contributor

Ok, but on 40 nodes it was ok, the question was for 200 nodes.

@juliencelia
Copy link
Author

For 200 nodes it stops at the error above so it goes well further...

@mccoys mccoys added the bug label Sep 7, 2020
@juliencelia
Copy link
Author

Hi Julien,

So do you think I can contact TGCC hotline for some help?
It is more a problem between the machine and python no?

Thanks

Julien

@jderouillat
Copy link
Contributor

Not really, it's a problem of the namelist.
In which you ask to all process to read simultaneously the same file. It's known that it's not a good practice.
You should read the file by only one process and then broadcast it to all process.
Replace :

x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True)

By something which seems to :

from  mpi4py  import  MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = 0
if rank==0 :
    x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
comm.bcast( x_prof, root = 0)
...

@juliencelia
Copy link
Author

Thanks Julien. I will study that.
I am quite puzzled or I dont understand because I often run this kind of simulation reading hydro files with gas targets.
With simulations of 1mm60µm (200001200 cells) and it worked perfectly for 200 nodes / 800 MPI and 16 open MP.
For this simulation, the number of cells is more important but not the number of nodes...

@jderouillat
Copy link
Contributor

More there is data to read, more this problem can appear.

During our last internal Smilei meeting (few hours before the creation of this issue !), we discussed how to provide a benchmark for this kind of problem which is at boundaries of the code itself thanks to Python namelists.

@jderouillat
Copy link
Contributor

Be careful with the mpi4py module that you will use, it must be compatible with your main MPI library.
The best thing is to recompile it for your environment, you can download it on bitbucket (git clone https://bitbucket.org/mpi4py/mpi4py.git).
Then to compile it :

$ python setup.py build
$ python setup.py install --user

And some corrections in the namelist :

from  mpi4py  import  MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof = np.full(6600, 0.)
y_prof = np.full(6600, 0.)
...
if rank==0 :
    x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
x_prof = comm.bcast( x_prof, root = 0)
y_prof = comm.bcast( y_prof, root = 0)
...

@juliencelia
Copy link
Author

I tried this morning on several timesteps with low resolution and it works with:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()

if rank==0 :
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True)
else:
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.empty(6600,dtype='float64')

comm.bcast( x_prof, root = 0)
comm.bcast( y_prof, root = 0)
comm.bcast( Te_prof, root = 0)
comm.bcast( Ti_prof, root = 0)
comm.bcast( ne_prof, root = 0)
comm.bcast( vx0_prof, root = 0)
comm.bcast( vy0_prof, root = 0)

Do you think I have to recompile anymway?

@jderouillat
Copy link
Contributor

If it's work, no. The mpi4py should be compatible with the MPI library. Otherwise, it would crashed.

@juliencelia
Copy link
Author

juliencelia commented Sep 10, 2020

I am sorry Julien but the simulation still craches at the same point even with brodcast... I will decrease the number of nodes and make some tests... I will come back after to tell you

@jderouillat
Copy link
Contributor

Now you can contact the hotline.
Do not hesitate to put me in cc.

@jderouillat
Copy link
Contributor

I have just remember a Smilei case which deadlock on Irene KNL with large Python array.
It has been solved using smaller patches but we didn't really solve it.

Waiting for a better solution, a workaround could consist in spliting hydro.txt in many files with many arrays associated.
I will do some test on my own.

@juliencelia
Copy link
Author

juliencelia commented Sep 13, 2020

It's quite strange! I have now this error message in my out:
On rank 145 [Python] ValueError: too many values to unpack
ERROR src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

and this in the err file:

Stack trace (most recent call last):
#5 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#4 Object "./smileiKNL", at 0x458568, in
#3 Object "/lib64/libc.so.6", at 0x2ab2999d0544, in __libc_start_main
#2 Object "./smileiKNL", at 0x8f379f, in main
#1 Object "./smileiKNL", at 0x6fd5c7, in Params::Params(SmileiMPI*, std::vec$
#0 Object "/lib64/libpthread.so.0", at 0x2ac8b93cb4fb, in raise
Segmentation fault (Signal sent by tkill() [0x89950001b20e])

I already ran simulations with hydro txt files (without broadcast) on 200 nodes with KNL and these txt files were more than 1Mo. Here it is 500ko...
The only difference was that resolution was less important (15000*600).

When you say many files, have you got an idea of number of files?

Thanks again Julien

@mccoys
Copy link
Contributor

mccoys commented Sep 14, 2020

This error is pure python. It means you assign several values to less variables, which is not allowed. For instance "x, y = a, b, c"

@mccoys mccoys closed this as completed Sep 14, 2020
@mccoys mccoys reopened this Sep 14, 2020
@juliencelia
Copy link
Author

Thanks Fred! I found my mistake. I relaunch the simulation...

@jderouillat
Copy link
Contributor

I tried to run your case in a new environment (OpenMPI 4) directly on KNL with 1600 MPI on 400 nodes. It crashes during interpolations done in the namelist.
It should achieve 9600 interpolations, it crashes after 7168 (I print a message after each interp_prof).

slurmstepd-irene3002: error: Detected 2 oom-kill event(s) in step 5326988.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

You can maybe try to use 1 MPI per node x 128 openMP threads per node.

I will not run so large simulations without an access to your genci time. I just have for some developments.

@juliencelia
Copy link
Author

Thanks Julien. I will try with 1MPI pernode * 128openMP.
I will let you know what happens.

@juliencelia
Copy link
Author

I agree with you. I ran this kind of simulations 6 months ago with a preparatory access and it worked.
I will join hotline.

Thanks again Julien

@juliencelia
Copy link
Author

Hi @jderouillat

With hotline, we tried lots of things...until use openmpi.
They make me change CXXFLAGS in CXXFLAGS += -O2 -axCORE-AVX2,AVX,CORE-AVX512,MIC-AVX512 -mavx2 -ip -inline-factor=1000 -D__INTEL_SKYLAKE_8168 -qopt-zmm-usage=high -fno-alias #-ipo with this environment
module purge
module load intel/19.0.5.281
module load mpi/openmpi/4.0.2
module load hdf5/1.8.20 # savez-vous si votre hdf5 est compilé en parallèle ou en séquentiel
export HDF5_ROOT_DIR=${HDF5_ROOT}
module load python/2.7.17

But smilei does not compile...

I am not enough good in computational to understand their advices.

Do you think that you can see the issue with them?

@jderouillat
Copy link
Contributor

Of course but it's maybe not necessary. You'll find below a protocol to define the environment (I write it here but more like a memo at which we can refer for other users).

If IntelMPI is available we are still recommending to use it, in place of OpenMPI, so first :

$ module unload mpi/openmpi

Doing this, the compiler is unloaded, so reload it with the IntelMPI associated to :

$ module load intel/19.0.5.281
$ module load mpi/intelmpi/2019.0.5.281

Then check if a HDF5 library is available and compatible with your MPI environment. I'm happi to discover that it's the case now :

$ module show hdf5/1.8.20
...
    4 : module load flavor/buildcompiler/intel/19 flavor/buildmpi/intelmpi/2019 flavor/hdf5/parallel
...

Load it as recommended by the computing center :

$ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT} # This is a variable that we are recommended in the Smilei documentation

Compile with the ad hoc machine file (the recommended flag is -march=core-avx2)

$ make -j8 machine=joliot_curie_rome

I ran a small simulation on 4 AMD nodes with the binary generated by this process.
Don't hesitate to put me in copy of your exchange with the computing center.

@jderouillat
Copy link
Contributor

Argh ! It's not so simple with an additional Python.
Compilation is ok if we add a module load python/2.7.17 but at the runtime it seems that there is a conflict.

@juliencelia
Copy link
Author

Yes with module load python when I compile I have this warning message:

/ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h(84): warning #2650: attributes ignored here
NPY_CHAR NPY_ATTR_DEPRECATE("Use NPY_STRING"),
^
and with python2.7.17 same thing:
/ccc/products/python-2.7.17/intel--19.0.5.281__openmpi--4.0.1/default/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h(84): warning #2650: attributes ignored here
NPY_CHAR NPY_ATTR_DEPRECATE("Use NPY_STRING"),
^

@jderouillat
Copy link
Contributor

This is a warning, it could be resolved reinstalling a Python toolchain in the new environment (not sure that the cost of a such burden worth it).
The problem is that python/2.7.17 depends of OpenMPI, so the simplest is to forget IntelMPI and the THREAD_MULTIPLE feature using config=no_mpi_tm :

$ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT}
$ module load python
$ make clean;make -j8 machine=irene_rome config=no_mpi_tm

I ran the same small simulation on a single AMD node in this environment (few resources are available now).

@juliencelia
Copy link
Author

Since this morning, I try but it does not compile. I have an issue with the hdf5 module loading.
Hotline make tests. I will tell you ...

@jderouillat
Copy link
Contributor

Is your default environment modified (do you add something in ~/.bash_profile or ~/.bashrc) ? Mine is not.

@juliencelia
Copy link
Author

My bashrc is empty (just some env variables export).
When I compile I have always a problem of hdf5:

module dfldatadir/own (Data Directory) cannot be unloaded
module ccc/1.0 (CCC User Environment) cannot be unloaded
load module flavor/hdf5/parallel (HDF5 flavor)
load module feature/mkl/single_node (MKL without mpi interface)
load module flavor/buildcompiler/intel/19 (Compiler build flavor)
load module flavor/buildmpi/openmpi/4.0 (MPI build flavor)
load module feature/openmpi/mpi_compiler/intel (MPI Compiler feature)
load module flavor/openmpi/standard (Open MPI flavor)
load module feature/openmpi/net/auto (MPI Network backend feature)
load module licsrv/intel (License service)
load module c/intel/19.0.5.281 (Intel C Compiler)
load module c++/intel/19.0.5.281 (Intel C++ Compiler)
load module fortran/intel/19.0.5.281 (Intel Fortran compiler)
load module flavor/libccc_user/hwloc2 (libccc_user flavor)
load module hwloc/2.0.4 (Hwloc)
load module flavor/hcoll/standard (hcoll flavor)
load module feature/hcoll/multicast/enable (Hcoll features)
load module sharp/2.0 (Mellanox backend)
load module hcoll/4.4.2938 (Mellanox hcoll)
load module pmix/3.1.3 (Process Management Interface (PMI) for eXascale)
load module flavor/ucx/standard (ucx flavor)
load module ucx/1.7.0 (Mellanox backend)
load module feature/mkl/lp64 (MKL feature)
load module feature/mkl/sequential (MKL feature)
load module feature/mkl/vector/avx2 (MKL vectorization feature)
load module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces)
load module intel/19.0.5.281 (Intel Compiler Suite)
load module mpi/openmpi/4.0.2 (Open MPI)
load module python/2.7.14 (Python)
Cleaning build
Creating binary char for src/Python/pyprofiles.py
Creating binary char for src/Python/pycontrol.py
Creating binary char for src/Python/pyinit.py
Checking dependencies for src/Tools/tabulatedFunctions.cpp
Checking dependencies for src/Tools/Timer.cpp
Checking dependencies for src/Tools/Timers.cpp
Checking dependencies for src/Tools/userFunctions.cpp
Checking dependencies for src/Tools/Tools.cpp
Checking dependencies for src/Tools/H5.cpp
Checking dependencies for src/Tools/backward.cpp
Checking dependencies for src/Tools/PyTools.cpp
In file included from src/Tools/H5.cpp(1):
src/Tools/H5.h(4): catastrophic error: cannot open source file "hdf5.h"
#include <hdf5.h>

@jderouillat
Copy link
Contributor

I see the hdf5 flavor in your list but not the main hdf5 module (and the "export HDF5_ROOT_DIR" which is used to find the hdf5.h file.

@juliencelia
Copy link
Author

It is because I tried to load directly hdf5 parallel without switch.
With the same env as yours I have this issue:

unload module mpi/openmpi/4.0.2 (Open MPI)
unload module intel/19.0.5.281 (Intel Compiler Suite)
unload module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces)
unload module feature/mkl/vector/avx2 (MKL vectorization feature)
unload module feature/mkl/sequential (MKL feature)
unload module feature/mkl/lp64 (MKL feature)
unload module ucx/1.7.0 (Mellanox backend)
unload module flavor/ucx/standard (ucx flavor)
unload module pmix/3.1.3 (Process Management Interface (PMI) for eXascale)
unload module hcoll/4.4.2938 (Mellanox hcoll)
unload module sharp/2.0 (Mellanox backend)
unload module feature/hcoll/multicast/enable (Hcoll features)
unload module flavor/hcoll/standard (hcoll flavor)
unload module hwloc/2.0.4 (Hwloc)
unload module flavor/libccc_user/hwloc2 (libccc_user flavor)
unload module fortran/intel/19.0.5.281 (Intel Fortran compiler)
unload module c++/intel/19.0.5.281 (Intel C++ Compiler)
unload module c/intel/19.0.5.281 (Intel C Compiler)
unload module licsrv/intel (License service)
unload module feature/openmpi/net/auto (MPI Network backend feature)
unload module flavor/openmpi/standard (Open MPI flavor)
unload module feature/openmpi/mpi_compiler/intel (MPI Compiler feature)
unload module flavor/buildmpi/openmpi/4.0 (MPI build flavor)
unload module flavor/buildcompiler/intel/19 (Compiler build flavor)
unload module feature/mkl/single_node (MKL without mpi interface)
module dfldatadir/own (Data Directory) cannot be unloaded
module ccc/1.0 (CCC User Environment) cannot be unloaded
load module flavor/hdf5/serial (HDF5 flavor)
load module flavor/buildcompiler/intel/19 (Compiler build flavor)
load module licsrv/intel (License service)
load module c/intel/19.0.5.281 (Intel C Compiler)
load module c++/intel/19.0.5.281 (Intel C++ Compiler)
load module fortran/intel/19.0.5.281 (Intel Fortran compiler)
load module feature/mkl/lp64 (MKL feature)
load module feature/mkl/sequential (MKL feature)
load module feature/mkl/single_node (MKL without mpi interface)
load module feature/mkl/vector/avx2 (MKL vectorization feature)
load module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces)
load module intel/19.0.5.281 (Intel Compiler Suite)
load module hdf5/1.8.20 (HDF5)
unload module hdf5/1.8.20 (HDF5)
unload module flavor/hdf5/serial (HDF5 flavor)
load module flavor/hdf5/parallel (HDF5 flavor)

Loading hdf5/1.8.20
ERROR: hdf5/1.8.20 cannot be loaded due to missing prereq.
HINT: the following module must be loaded first: mpi

Switching from flavor/hdf5/serial to flavor/hdf5/parallel
WARNING: Reload of dependent hdf5/1.8.20 failed
load module flavor/buildmpi/openmpi/4.0 (MPI build flavor)
load module feature/openmpi/mpi_compiler/intel (MPI Compiler feature)
load module flavor/openmpi/standard (Open MPI flavor)
load module feature/openmpi/net/auto (MPI Network backend feature)
load module flavor/libccc_user/hwloc2 (libccc_user flavor)
load module hwloc/2.0.4 (Hwloc)
load module flavor/hcoll/standard (hcoll flavor)
load module feature/hcoll/multicast/enable (Hcoll features)
load module sharp/2.0 (Mellanox backend)
load module hcoll/4.4.2938 (Mellanox hcoll)
load module pmix/3.1.3 (Process Management Interface (PMI) for eXascale)
load module flavor/ucx/standard (ucx flavor)
load module ucx/1.7.0 (Mellanox backend)
load module mpi/openmpi/4.0.2 (Open MPI)
load module python/2.7.14 (Python)
Cleaning build
Creating binary char for src/Python/pyprofiles.py
Creating binary char for src/Python/pycontrol.py
Creating binary char for src/Python/pyinit.py
Checking dependencies for src/Tools/tabulatedFunctions.cpp
Checking dependencies for src/Tools/Timer.cpp
Checking dependencies for src/Tools/userFunctions.cpp
Checking dependencies for src/Tools/Timers.cpp
Checking dependencies for src/Tools/Tools.cpp
Checking dependencies for src/Tools/H5.cpp
Checking dependencies for src/Tools/backward.cpp
Checking dependencies for src/Tools/PyTools.cpp
In file included from src/Tools/H5.cpp(1):
src/Tools/H5.h(4): catastrophic error: cannot open source file "hdf5.h"
#include <hdf5.h>
^

I retried with a module load mpi/openmpi/4.0.2.
Smilei compiles now!

I just have to check if with that python config scipy.interpolate can be import. I need it in many of my runs to interpolate hydrodynamics data ;)

@juliencelia
Copy link
Author

Hi @jderouillat

The problem is always the use of scipy module.
I tried to compile SMILEI with python3 with this export in env but teh compilation crashes (export PYTHONEXE=$PYTHON_EXEDIR)

Hotline advice me to use smilei4.1 with load module smilei but there was lots of changes since 4.1 version....

I am puzzled with that.

@jderouillat
Copy link
Contributor

By default, the python module provided by the computing center do not set the following variable. The Python library associated to the smilei binary is the system one (check with ldd PATH_TO/smilei).
Can you resubmit your job adding in your batch script (after the module load python) :

export LD_LIBRARY_PATH=$PYTHON_ROOT/lib:$LD_LIBRARY_PATH

I do not set LD_PRELOAD, and just use ccc_mprun ./smilei BNH2d.py

@juliencelia
Copy link
Author

I am sorry but the simulation crashes always at the same point

ImportError: libmpi.so.20: cannot open shared object file:

My rome env to compile is like that :
module purge
module load mpi/openmpi/4.0.2
module load hdf5
module switch flavor/hdf5/serial flavor/hdf5/parallel
export HDF5_ROOT_DIR=${HDF5_ROOT}
module load python

My compile file:
source /ccc/cont003/home/ra5390/bonvalej/.env_smilei_rome

compile smilei

cd Smilei_hub
make clean
make -j8 machine=joliot_curie_rome config=no_mpi_tm
mv smilei smileirome
mv smilei_test smileirome_test
cd ../.

I really don't understand...

@jderouillat
Copy link
Contributor

This environment does not try to load a libmpi.so.20 but a libmpi.so.40.
Can you post the result of ldd smilei ?

@juliencelia
Copy link
Author

/ccc/work/cont003/ra5390/bonvalej/Smilei/Smilei_hub/smileirome: error while loading shared libraries: libhdf5.so.10: cannot open shared object file: No such file or directory

@jderouillat
Copy link
Contributor

I can't access your directory (ask to the hotline to add my login to your project if you want), and even if I could, the result depends of the environment set when the command is executed.
Could you answer to the question ? If the question is not clear, tell me.

@juliencelia
Copy link
Author

j'ai ceci pour le binaire généré par la hotline:

linux-vdso.so.1 =>  (0x00007ffe715cc000)
libhdf5.so.10 => not found
libpython3.7m.so.1.0 => not found
libm.so.6 => /lib64/libm.so.6 (0x00002b989a4cd000)
libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002b989a7cf000)
libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002b989a9eb000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b989ad26000)
libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002b989b02d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b989b422000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b989b638000)
libc.so.6 => /lib64/libc.so.6 (0x00002b989b854000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b989bc22000)
/lib64/ld-linux-x86-64.so.2 (0x00002b989a2a9000)
libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002b989be26000)
libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002b989c0eb000)
librt.so.1 => /lib64/librt.so.1 (0x00002b989c3b0000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b989c5b8000)
libz.so.1 => /lib64/libz.so.1 (0x00002b989c7bb000)
libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002b989c9d1000)
libudev.so.1 => /lib64/libudev.so.1 (0x00002b989cc1c000)
libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002b989ce32000)
libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b989d03c000)
libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002b989d3a6000)
libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b989d5ee000)
libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002b989d7f1000)
libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002b989de76000)
libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002b989e1e1000)
libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002b989e41e000)
libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002b989e690000)
libcap.so.2 => /lib64/libcap.so.2 (0x00002b98a011c000)
libdw.so.1 => /lib64/libdw.so.1 (0x00002b98a0321000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b98a0572000)
libattr.so.1 => /lib64/libattr.so.1 (0x00002b98a0798000)
libelf.so.1 => /lib64/libelf.so.1 (0x00002b98a099d000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b98a0bb5000)

Et pour ma version : ldd smileirome
linux-vdso.so.1 => (0x00007ffeb3483000)
libhdf5.so.10 => not found
libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00002b77bbd5e000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b77bc12a000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b77bc346000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b77bc54a000)
libm.so.6 => /lib64/libm.so.6 (0x00002b77bc74d000)
libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002b77bca4f000)
libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002b77bcc6b000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b77bcfa6000)
libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002b77bd2ad000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b77bd6a2000)
libc.so.6 => /lib64/libc.so.6 (0x00002b77bd8b8000)
/lib64/ld-linux-x86-64.so.2 (0x00002b77bbb3a000)
libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002b77bdc86000)
libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002b77bdf4b000)
librt.so.1 => /lib64/librt.so.1 (0x00002b77be210000)
libz.so.1 => /lib64/libz.so.1 (0x00002b77be418000)
libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002b77be62e000)
libudev.so.1 => /lib64/libudev.so.1 (0x00002b77be879000)
libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002b77bea8f000)
libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b77bec99000)
libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002b77bf003000)
libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b77bf24b000)
libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002b77bf44e000)
libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002b77bfad3000)
libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002b77bfe3e000)
libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002b77c007b000)
libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002b77c02ed000)
libcap.so.2 => /lib64/libcap.so.2 (0x00002b77c1d79000)
libdw.so.1 => /lib64/libdw.so.1 (0x00002b77c1f7e000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b77c21cf000)
libattr.so.1 => /lib64/libattr.so.1 (0x00002b77c23f5000)
libelf.so.1 => /lib64/libelf.so.1 (0x00002b77c25fa000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b77c2812000)

@jderouillat
Copy link
Contributor

You need to reinstall mpi4py in the targeted environment.
(You can also to try to do without, the problem that you observed on KNL could be less critical on a more classical architecture).

@juliencelia
Copy link
Author

It seems to work now ;) I am happi!
Just a general question: what smilei is doing during "parsing input.py"?

@mccoys
Copy link
Contributor

mccoys commented Oct 13, 2020

Just a general question: what smilei is doing during "parsing input.py"?

It reads the namelist !

@jderouillat
Copy link
Contributor

Great !

More precisely, it runs the namelist as a Python script. In your case, it reads the hydro file and interpolates the read quantities.
It can take times.

@juliencelia
Copy link
Author

Yes it seems to be long. Wait and see.
Actually, smilei works on IRENE. The env used is:

module purge
module load intel/19.0.5.281
module load mpi/openmpi/4.0.2
module load flavor/hdf5/parallel hdf5/1.8.20
export HDF5_ROOT_DIR=${HDF5_ROOT}
export PYTHONEXE=${PYTHON3_EXEDIR}
module load python3/3.7.5

To compile, I put the no_mpi_tm config option as you advice.

To use Scipy, before ccc_mprun hotline added : export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PYTHON3_ROOT/lib

@juliencelia
Copy link
Author

I am afraid that I have again a problem: in my out, I am locked on "python preprocess function does not exist" since 1hour... Could it be normal?

@jderouillat
Copy link
Contributor

You could try to use the binary I compiled with a script inspired by mine especially concerning module used (with a mpi4py installed in this environment).
Both are available in /ccc/work/cont003/smilei/derouilj/Issue291.

You will find too in this directory a namelist derived from yours and outputs from a 10 minutes single node run (the idea was to check that python function are executed). A few interpolation are not performed during these 10 minutes, 52 interp_prof printed while 56 are expected but there are operated on a very large grid (25600 x 20480) without been distributed. To accelerate, you can do the first interpolation with 1 process, while another MPI process is doing another interpolation ...

But in a first can you confirm that this test is going further than yours or not ?

@juliencelia
Copy link
Author

I copied your folder and your binary.
I add the hydro.txt file in the folder.

I have always this issue:
ImportError: libmpi.so.20: cannot open shared object file: No such file or directory

   linux-vdso.so.1 =>  (0x00007ffc8b7ad000)
    /opt/selfie-1.0.2/lib64/selfie.so (0x00002ac55ce98000)
    libhdf5.so.10 => /ccc/products/hdf5-1.8.20/intel--19.0.5.281__openmpi--4.0.1/parallel/lib/libhdf5.so.10 (0x00002ac55d11f000)
    libpython2.7.so.1.0 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libpython2.7.so.1.0 (0x00002ac55d6de000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac55dda0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ac55dfbc000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002ac55e1c0000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ac55e3c3000)
    libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002ac55e6c5000)
    libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002ac55e8e1000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002ac55ec1c000)
    libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002ac55ef23000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ac55f318000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ac55f52e000)
    libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x00002ac55f8fc000)
    libz.so.1 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libz.so.1 (0x00002ac55fb1c000)
    libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002ac55fe4b000)
    libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002ac5604d0000)
    libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002ac561f5c000)
    libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002ac5622c7000)
    libirc.so => /ccc/products2/ifort-17.0.4.196/Atos_7__x86_64/system/default/lib/intel64/libirc.so (0x00002ac562539000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ac55cc74000)
    libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002ac5627a3000)
    libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002ac562a68000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ac562d2d000)
    libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002ac562f35000)
    libudev.so.1 => /lib64/libudev.so.1 (0x00002ac563180000)
    libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002ac563396000)
    libxml2.so.2 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libxml2.so.2 (0x00002ac5635a0000)
    libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002ac563c60000)
    libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002ac563ea8000)
    libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002ac5640ab000)
    libcap.so.2 => /lib64/libcap.so.2 (0x00002ac5642e8000)
    libdw.so.1 => /lib64/libdw.so.1 (0x00002ac5644ed000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00002ac56473e000)
    libattr.so.1 => /lib64/libattr.so.1 (0x00002ac564964000)
    libelf.so.1 => /lib64/libelf.so.1 (0x00002ac564b69000)
    libbz2.so.1 => /lib64/libbz2.so.1 (0x00002ac564d81000)
                _            _

___ _ | | _ \ \ Version : v4.4-784-gc3f8cc81-master
/ | _ __ () | | ___ () | |
__ \ | ' \ _ | | / -) _ | |
|
/ |||| || || ___| || | |
/_/

Reading the simulation parameters

HDF5 version 1.8.20
Python version 2.7.14
Parsing pyinit.py
Parsing v4.4-784-gc3f8cc81-master
Parsing pyprofiles.py
Parsing BNH2d.py
On rank 12 [Python] ImportError: libmpi.so.20: cannot open shared object file: No such file or directory
ERROR src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

@jderouillat
Copy link
Contributor

This morning you was using another Python environment, can you confirm that you reinstall the mpi4py module in this environment ?

@juliencelia
Copy link
Author

juliencelia commented Oct 13, 2020

No I did not.
It is hotline that compile smilei with the openmpi env and python/3.7

I tried a simple run with a simulation of a 2D gaussian laser in an empty box. Smilei works with this configuration.

@jderouillat
Copy link
Contributor

Hi Julien,
I know that the situation is not completely stabilized since the opening of this issue but the problem evolved a lot (KNL, Rome, MPI, Python, deadlocks ...) and it runs. I propose you to close this issue and if necessary to open a new one dedicated to your eventual new problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants