Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wexac cluster #3613

Conversation

PrometheusPi
Copy link
Member

@PrometheusPi PrometheusPi commented May 6, 2021

This pull request concludes the discussion in issue #3496 how to setup PIConGPU in the wexac cluster at Weizmann institute in Israel. Many thanks to @danlevy100 for giving very valuable input. I copied various lines from your example tpl file and thus co-authored you.

@danlevy100 could you test both the gpu.tpl and gpu_picongpu.example.profile?

Please note:

  • Since tbg does not (yet) support the bsub streaming approach of submit files, tbg will print an information on what to do next. (Or is there an option to do that @psychocoderHPC?)

  • Since there are various combinations of queues and gpu hardware, I avoided generating all MxN configurations via MxN *.tbl files. Instead both options can be selected in the profile and are communicated to tbg.

  • run time test on wexac

@PrometheusPi
Copy link
Member Author

@danlevy100 I think I used your wrong email for the co-authorship. Could you please send me the email address you used for github via e-mail?

@PrometheusPi PrometheusPi added the machine/system machine & HPC system specific issues label May 6, 2021

# "tbg" default options #######################################################
# currently the submit script, generated by tbg, needs to be streamed to bsub
export TBG_SUBMIT="echo 'manually execute: bsub < '"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does having export TBG_SUBMIT="bsub <" not work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not well familiar with internals of tbg itself, so do not know it.

Copy link
Member Author

@PrometheusPi PrometheusPi May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍 - I have no access to test this. @danlevy100 could you test this? Or @psychocoderHPC could you comment whether this could work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this line should be

export TBG_SUBMIT="bsub"

Maybe the line above is generating a valid example but I do not understand why you would do it instead of executing bsub directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psychocoderHPC We have to do that, because the admins of the wexac cluster, as far as we understood it, prevent using the input file option and only allow to "stream" to bsub.
@danlevy100 Did that configuration change in the mean time?

Copy link

@danlevy100 danlevy100 May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PrometheusPi @psychocoderHPC No, the configuration did not change.
The only way I could get bsub to submit using a script file is by "bsub < submit.start".
There is some information here: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=bsub-write-job-scripts

I had to change ~/src/picongpu/tbg in order to be able to submit with tbg, as mentioned in one of my comments in the #3496 thread.

Copy link
Member Author

@PrometheusPi PrometheusPi May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbastrakov According to @danlevy100 post here, "bsub <" should not work. @danlevy100 How did you change tbg? (Could you provide a diff?)

@PrometheusPi PrometheusPi force-pushed the add_Wexac_cluster branch 3 times, most recently from 645466d to 074aebd Compare May 6, 2021 15:35
@PrometheusPi
Copy link
Member Author

Thanks @danlevy100 for sending your email to me. The co-authored commit is fixed.

.TBG_gpusPerNode=`if [ $TBG_tasks -gt $TBG_numHostedGPUPerNode ] ; then echo $TBG_numHostedGPUPerNode; else echo $TBG_tasks; fi`

# number of cores to block per GPU - we got 2 cpus per gpu
# and we will be accounted 2 CPUs per GPU anyway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong, should by 7 CPUs per GPU, based on the variable below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thanks for catching this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Co-authored-by: danlevy100 <danlevy100@gmail.com>
@psychocoderHPC psychocoderHPC marked this pull request as draft June 28, 2021 06:04
@psychocoderHPC
Copy link
Member

I switched this PR to draft because there is currently no progress and is always shown as mergeable.

@PrometheusPi
Copy link
Member Author

@danlevy100 or Sheroy Tata, are you interested in testing this?
If yes, I will reopen this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
machine/system machine & HPC system specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants