Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wexac Cluster at Weizmann #3496

Closed
ax3l opened this issue Jan 24, 2021 · 65 comments
Closed

Wexac Cluster at Weizmann #3496

ax3l opened this issue Jan 24, 2021 · 65 comments
Assignees
Labels
documentation regarding documentation or wiki discussions install installation & packaging machine/system machine & HPC system specific issues

Comments

@ax3l
Copy link
Member

ax3l commented Jan 24, 2021

Hi,

I got contacted by Dan Levy @danlevy100 about help with setting up PIConGPU on the Wexac cluster at Weizmann Institute of Science (wexac-wis).
The cluster has 12 nodes of 8x V100 per node (plus some nodes with 4x V100).

The cluster uses LSF as a batch system but does not seem to use jsrun (maybe just use mpiexec).
He got PIConGPU installed via Spack already.

This is an interactive startup command for FBPIC:

bsub -J sim_fbpic -o out.%J -e err.%J -q gpu-short -gpu "num=1:mode=shared:j_exclusive=no" -R "rusage[mem=16000]" 'python lwfa_script.py’

Someone please needs to finalize with him the .tpl template for tbg and the picongpu.profile instructions for our manual.

Resources:

cc @PrometheusPi (recently published PIConGPU sims with Dan, maybe you can finalize this?)
cc @hightower8083 (not with Weizmann anymore but might have some hints)

@ax3l ax3l added documentation regarding documentation or wiki discussions install installation & packaging machine/system machine & HPC system specific issues labels Jan 24, 2021
@danlevy100
Copy link

Hi guys and welcome to my first github comment!

Here's the .tpl file Axel helped me to create:
gpu_batch.tpl.txt

Sadly things are not yet working, i.e., I can't get tbg to submit to the user given queue at the moment.

Thanks in advance for your help!

@PrometheusPi
Copy link
Member

@danlevy100 I would be glad to help you setting up the configuration for Wexac. Since I am busy till Tuesday evening, I could start to look into this on Wednesday. Would this be fine with you?

@PrometheusPi PrometheusPi self-assigned this Jan 24, 2021
@ax3l
Copy link
Member Author

ax3l commented Jan 25, 2021

Thank you for taking care of this, @PrometheusPi 👍

@danlevy100
Copy link

That would be great, @PrometheusPi. Thanks! I'll try to make some progress on my own in the meantime.

@ax3l
Copy link
Member Author

ax3l commented Jan 25, 2021

@danlevy100 can you please document the current error message about the memory here?

@danlevy100
Copy link

After submitting the LaserWakefield example with

tbg -s bsub -c etc/picongpu/1.cfg -t etc/picongpu/wexac-wis/gpu_batch.tpl ~/picOutput/LaserWakefield -f

I get:
Memory reservation is (MB): 8192
Memory Limit is (MB): 8192
femalka: No such queue. Job not submitted.

"femalka" is Victor's username in fact... I have no idea why it appears here.

@sbastrakov
Copy link
Member

sbastrakov commented Jan 25, 2021

In order to figure it out, it would be helpful to see what is the resulting submission command after tbg and your .tpl file is applied. For the provided tbg command line there should be a file ~/picOutput/LaserWakefield/tbg/submit.start. If it is there, could you attach it? It should contain, among other things, the plain bsub command inside, and so we can compare to what @ax3l wrote for FBPIC.

@danlevy100
Copy link

Sure, here it is:
submit.start.txt

@sbastrakov
Copy link
Member

sbastrakov commented Jan 25, 2021

Thank you @danlevy100 .

So far I see an issue in the gpu_batch.tpl file attached earlier in this topic. On line 30 there is a spurious space in #B SUB (should be #BSUB). I believe it causes that and the following #BSUB lines to have no effect, and so leads to improper set of parameters. I don't know if it is the only issue, and do not have access to a similar machine to check.

@sbastrakov
Copy link
Member

sbastrakov commented Jan 25, 2021

In case it does not fix the problem, I think the relevant information may be not just in submit.start, as otherwise it looks fine to me. According to the documentation linked by Axel, probably it is worth looking into output of bjobs -l JOBID to see if the partition and other things are being set correctly.

@danlevy100
Copy link

Thanks @sbastrakov for having a look. I saw this but thought that maybe it was just to comment out the line. I have removed it anyway and submitted and still no go.
As for bjobs -l, since the job is not being submitted there is no information displayed about it.

@PrometheusPi
Copy link
Member

PrometheusPi commented Jan 27, 2021

Sine the error message was about memory, could you please decrease the set memory from #BSUB -M 8192 to half of it? Why is there so much more memory used for the fbpic runs? (16000), or is this defined differently?

EDIT:
This definition is in kB - thus 8192 kB is definitely very low - please adjust the memory needed accordingly and use the same as with fbpic:

-R "rusage[mem=16000]"

@PrometheusPi
Copy link
Member

Furthermore you seem to not define a project #BSUB -P . I am not sure how this is handled since your fbpic run does not define a project, I assume you have a default one or none are used. Please try to remove that line - perhaps seting an empty projects creates an error while setting none just uses the default.

@PrometheusPi
Copy link
Member

If this does not work, we could schedule a video meeting try try things out live.

@ax3l
Copy link
Member Author

ax3l commented Jan 27, 2021

Yep, we tried those already. I guess you will be most efficient with a VC :)

@danlevy100
Copy link

danlevy100 commented Jan 27, 2021

Something that should be mentioned: the way things are set up is that I have installed picongpu at the node level ("interactive session" like getNode on hemera). Submitting a job is thus only possible at the node level. Perhaps this was a mistake, but I could not get things to work otherwise.

When submitting a job, it appears that the memory is limited by the memory requested for the interactive session. Strange, but I think that it is the case.

Also, the error as far as I understand it is not a memory error but a "femalka: No such queue" error.

VC would be great. I'm available throughout most of the day tomorrow and on Friday if that works for you.

@PrometheusPi
Copy link
Member

@danlevy100 Okay then let's do a VC tomorrow. @sbastrakov Do you want to join as well?

@sbastrakov
Copy link
Member

I can

@danlevy100
Copy link

14:00 Dresden time works for you?

@PrometheusPi
Copy link
Member

@danlevy100 That would be fine with me. How about you @sbastrakov ?
In order to better work together on the submit file (other than us suggesting changes on your submit file we see via screen sharing) I would recommend the Atom editor together with the teletypepackage, so that we all an type together. Would that be fine with you two?

@PrometheusPi
Copy link
Member

PrometheusPi commented Jan 28, 2021

@danlevy100
Is the following submit script queued/executed by LSF?

#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi

and than just submitted via bsub without extra arguments?

@sbastrakov
Copy link
Member

That is fine with me as well

@danlevy100
Copy link

danlevy100 commented Jan 28, 2021

@danlevy100
Is the following submit script queued/executed by LSF?

#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi

and than just submitted via bsub without extra arguments?

bsub in fact fails with the same error.

@danlevy100
Copy link

I could also try to get a cluster admin to join our meeting, do you think this could prove useful?

@danlevy100
Copy link

danlevy100 commented Jan 28, 2021

UPDATE:

@danlevy100
Is the following submit script queued/executed by LSF?

#!/usr/bin/env bash
#BSUB -J test 
#BSUB -o test.out 
#BSUB -e test.err
#BSUB -q gpu-short 
#BSUB -gpu "num=1:mode=shared:j_exclusive=no" 
#BSUB -R "rusage[mem=16000]" 

hostname
nvidia-smi

and than just submitted via bsub without extra arguments?

bsub in fact fails with the same error.

UPDATE:
I got this script to work. The secret is to execute bsub < test_script.sh and not bsub test_script.sh.

@PrometheusPi
Copy link
Member

Yes, if a cluster admin could join the meeting, that would be great 👍

@PrometheusPi
Copy link
Member

PrometheusPi commented Jan 28, 2021

We could get PIConGPU to run, but it only worked if the task were on one the same node.
mpiexec -n seems to only schedule tasks on the MPI-rank=0 node.
However, LSF schedules nodes, as can be seen when checking the variable $LSB_HOSTS while running.
Thus there seems to be some misconfiguration on how mpiexec finds the available machines (it seems to only use the first in that list.)
To get to multiple nodes, we manually defined a machinefile and used it via mpiexec --machinefile as follows:

echo $LSB_HOSTS | sed -e 's/ /\n/g' > machinefile.txt
mpiexec -n 16 --machinefile machinefile.txt hostname

This apparently told mpiexec to use the nodes scheduled by LSF, but when mpiexec tried to connect to these nodes via ssh, it failed with a authentication error.

@psychocoderHPC
Copy link
Member

Is it possible to ask the admin how to start job MPI obs on multiple nodes? I would say MPI is not compiled with support for the batch system therefore MPI is not using the information stored in $LSB_HOSTS

@sbastrakov
Copy link
Member

Yes, that's the plan. It was told to us that not many run multi-node jobs there and it may require a certain MPI version to work. Which shouldn't be a problem once we know which version it is.

@PrometheusPi
Copy link
Member

Update: There is no password-less ssh into GPU nodes. This is only available on non-GPU nodes or for admins. The next test will be together with an admin to run multi-node GPU jobs in admin-mode.

@danlevy100
Copy link

danlevy100 commented Feb 4, 2021

Another update:
We eventually gave up the spack approach and went for modules.

Using openmpi/2.0.1 we finally got mpi to work today. We successfully ran a "bare-bones" version of picongpu. Now we need to install the remaining modules which I will do with the help of the cluster admin next week.

Here is the simple.profile file that we used:

module load gcc/6.3.0
module load cmake/3.18.4
module load openmpi/2.0.1
module load cuda/9.2
module load boost/1.69.0

export CXX=$(which g++)
export CC=$(which gcc)

export PICSRC=$HOME/src/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="cuda:70"

export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin

@PrometheusPi
Copy link
Member

@danlevy100 As promised on Monday, you can find a setup script here:
https://gist.github.com/PrometheusPi/3b873c754fbb0f0a2684480d0969410f

Please be aware of the comments that state which lines should be copied to your picongpu.profile as well.

I have not yet tested that script. Thus there still might be some bugs included. If any install fails, please let me know.

After you installed all dependencies, you should be able to run PIConGPU as on hemera. If that is the case, I would be very happy if you could share a submit.start file here, so that we can develop a general *.tpl file based for the Wexac cluster.

@danlevy100
Copy link

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

  1. The curl command for zlib should include the filename as well. Not a problem, installed correctly.
  2. openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

@psychocoderHPC
Copy link
Member

@PrometheusPi That really wonderful. Thanks!!

I gave it a shot and ran into a couple of issues:

  1. The curl command for zlib should include the filename as well. Not a problem, installed correctly.
  2. openPMD failed to install. I tried but could not solve it. The error log is attached.

openPMD_install_fail.txt

The linker error is saying you should compile ADIOS with -fPIC enabled. You should use

./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --enable-static --enable-shared --prefix=$LIB/adios --with-mpi=$MPI_ROOT --with-zlib=$LIB/zlib --with-blosc=$LIB/c-blosc 

@PrometheusPi Could you please update your gist.

For testing ADIOS1 is fine but I would suggest switching to ADIOS2 because there is no real support for ADIOS1 anymore and openPMD-api is also working much better with ADIOS2.

@PrometheusPi
Copy link
Member

@danlevy100 Thanks - yes I quickly changed my initial wget command to curl but forget that curl requires files. 😓 I now changed it back to wget.

@psychocoderHPC I fixed the gist. Thanks for your look at it. Is the readthedocs documentation correct or is the order wrong or the CXX flag missing?

@PrometheusPi
Copy link
Member

@danlevy100 It might be that you have to rebuild libpng as well. It might have linked to the system zlib, not the one you installed. I fixed the gist on that.

@danlevy100
Copy link

Alright, it seems like everything installed fine. However running the simulation fails with an openPMD error:

../LWF/input/bin/picongpu: error while loading shared libraries: libopenPMD.so: cannot open shared object file: No such file or directory

I reinstalled everything with the new script, rebuilt the simulation but still no go.

P.S. there's a small typo at the gist: wegt -> wget.

@PrometheusPi
Copy link
Member

PrometheusPi commented Feb 11, 2021

@danlevy100 Sorry for the typo 😓 - I fixed the gist.

I have an idea: could you please check, whether in $LIB/openPMD-api/ there is a lib directory? Or is there only a lib64 directory?
If there is only a lib64 directory, please change the LD_LIBRARY_PATH extension to:

-export LD_LIBRARY_PATH="$LIB/openPMD-api/lib:$LD_LIBRARY_PATH"
+export LD_LIBRARY_PATH="$LIB/openPMD-api/lib64:$LD_LIBRARY_PATH"

@danlevy100
Copy link

danlevy100 commented Feb 11, 2021

@PrometheusPi That solved it!

Seems like everything is installed and set up correctly now. But... Now there's a new error, mpi related:

[hgn10.wexac.weizmann.ac.il:45451] 17 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[hgn02.wexac.weizmann.ac.il:26305] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

When running on the dgx nodes (V100's) the error is slightly different:

[ibdgx009.wexac.weizmann.ac.il:80283] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[ibdgx009.wexac.weizmann.ac.il:80283] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

In fact, this error shows up now also when running the bare-bones simulation. I have no idea what changed from the time it worked. The error also occurs when requesting only 1 GPU (that's the error above).

@sbastrakov
Copy link
Member

That is MPI related. So could be either something changed or broke down on the cluster side. Or something in your environment has changed either by introducing more dependencies, or accidentally. To check on your side, you could try making a fresh session, load the environment we used last time during the VC and re-compile and re-run that MPI hello world mini application on CPU and GPU partition. If it's also broken now, it probably needs attention of cluster admin. If it still works, I would first suspect your environment.

@danlevy100
Copy link

danlevy100 commented Feb 14, 2021

Well, that was my bad: the simulation actually ran successfully! Just extremely fast, so I didn't even look at stdout...
Turns out this error was there also when we ran the bare-bone simulation. Doesn't seem to hurt, so I don't really care.

The simulation runs fine on the Quadro RTX 6000 nodes. However that's not the case for the V100 nodes which fail due to some memory allocation issue. I have attached the stdout and stderr files for this job. The submit.start file is attached as well.

While we're at it, I have also attached the output of pic-build of the FoilLCT example. It shows some warnings that I'm not familiar with when running over hemera. Maybe that is helpful in some way. EDIT: turns out the cmake warnings did not register in the file attached (pic-build | tee output.file didn't do the trick). I'll look into this later.

FoilLCT.build.txt
stderr.149243.txt
stdout.149243.txt
submit.start.txt

@sbastrakov
Copy link
Member

Regarding the crash on V100s, from the attached files I am not sure what went wrong, besides your observation that it's probably something with memory allocation. You could try investigating by enabling more debug output: rebuilding with pic-build -c "-DPMACC_BLOCKING_KERNEL=ON -DCMAKE_BUILD_TYPE=Debug -DPIC_VERBOSE=21" (when using an existing directory, please remove subdirectory .build first) and re-running. It may produce more details about what went wrong.

Regarding the FoilLCT compilation, I do not see anything suspicious in the attached file. Perhaps that part of the output went to stderr, not stdout?

@danlevy100
Copy link

danlevy100 commented Feb 16, 2021

Seems like the V100 issue is inconsistent. I did get it to work a couple of times. I'm still trying to figure out what's going on.

In any case, things are working nicely and quickly on the Quadro nodes! There are just some minor issues left to solve:

  • Would be convenient to have a .tpl file.
  • The build warnings (were indeed in the stderr!) might have some significance, I don't know. The file is attached now.
  • PNGwriter is not found so the png plugin does not work (see the stdout of pic-build, also attached here).
  • openPMD is not fully working, i.e, --checkpoint.restart openPMD results in
Unhandled exception of type 'St13runtime_error' with message '
Using ADIOS1 through PIConGPU's openPMD plugin is not supported.
Please pick either of the following:
* Use the ADIOS plugin.
* Use the openPMD plugin with another backend, such as ADIOS2.
  If the openPMD API has been compiled with support for ADIOS2, the openPMD API
  will automatically prefer using ADIOS2 over ADIOS1.
  Make sure that environment variable OPENPMD_BP_BACKEND is not set to ADIOS1.
                ', terminating

picbuild.stdout.txt
picbuild.stderr.txt

EDIT: It's --checkpoint.backend openPMD of course, not --checkpoint.restart openPMD

@sbastrakov
Copy link
Member

sbastrakov commented Feb 17, 2021

This exception with checkpointing is because we no longer support ADIOS1 as backend for openPMD for output in newly made simulations. We only have a legacy support for it, so that a user can still restart from an older checkpoint written with ADIOS1. So currently you need openPMD to be compiled with, and then using either ADIOS2 or HDF5 backend. So you need to install yourself or via admin one of those or both to use dev version of PIConGPU.

Regarding the png support, in case you installed it yourself, probably the profile has to be extended with lines like these, where line 55 points to your local installation. The png support is of course optional.

Those warning #2381-D: dynamic exception specifications are deprecated concern libSplash that we still use for output of hdf5 files in some plugins. As main output and checkpoints it was already replaced with openPMD-API. So you can ignore those warnings, they do not cause issues and will disappear once we fully drop libSplash. The last couple of warnings come from openPMD API itself, I will check if they exist in the current version of it.

@sbastrakov
Copy link
Member

Regarding ADIOS1 vs. 2 there is this explanation, with a TL;DR at the end.

@danlevy100
Copy link

danlevy100 commented Feb 17, 2021

Ok, I see.

For installing ADIOS2, I looked at https://adios2.readthedocs.io/en/latest/setting_up/setting_up.html but am unsure of which flags I should use when building, i.e., what should be the exact commands as in @PrometheusPi 's gist file from a few comments up. Don't want to mess things up which are already working...

For the .tpl file, I think I can write it myself. I'll upload it here once it's working.

EDIT: the only thing I don't know how to do is how to tell tbg to submit with bsub < submit.start instead of bsub submit.start.
EDIT 2: Found it... Tried tbg -s "bsub <" ... but that did not work, so I changed ~/src/picongpu/tbg to do it.

@sbastrakov
Copy link
Member

Regarding installation of ADIOS2 I found out that we actually forgot to update our instructions page which now only has ADIOS1. Working on fixing it. I don't think it would break anything, as you can always hide it from openPMD API if something does not work.

@PrometheusPi
Copy link
Member

@danlevy100 Is your last submit.start file from 20 days ago the one working best and still valid?

If yes, we can setup a *.tpl file. If not, would you please share your current version.

@sbastrakov
Copy link
Member

Now our (dev version of) readthedocs is updated with links on installing ADIOS2 as an openPMD-API backend.

@danlevy100
Copy link

Here it is.

The latest status is: the V100 and RTX2000 nodes are not working reliably probably due to mpi issues. I don't know how to solve this and in the meantime I'm working with the RTX6000/8000 nodes.

I have also added this text to the file:

#There are 3 relevant GPU queues on WEXAC: gpu-short (32 gpu's/6 hours max), gpu-medium (24 gpu's/12 hours max)
#and gpu-long (16 gpu's/10 days max). There are different GPU nodes: RTX-2000/6000/8000 and V100.
#The nodes can be explicitly selected using the BSUB -m command.
#The RTX-2000 and V100's do not reliably work at the moment apparently due to mpi issues.

submit_short.tpl.txt

@danlevy100
Copy link

There was a small but important mistake in the memory request in that file (24 GB instead of 48 GB).
Here is the corrected file.
submit_short.tpl.txt

@PrometheusPi PrometheusPi mentioned this issue May 6, 2021
1 task
@BrianMarre
Copy link
Member

Since there does not seem to be any progress on this issue any more, can we just close this issue for now? We may reopen it in the future if anything changes.

@PrometheusPi
Copy link
Member

I agree. @danlevy100 if you still want to run PIConGPU on Wexac, please let us know. Will close this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation regarding documentation or wiki discussions install installation & packaging machine/system machine & HPC system specific issues
Projects
None yet
Development

No branches or pull requests

6 participants