fix: use -hl to enforce per job and per host memory limits #5

dlaehnemann · 2024-03-19T16:24:40Z

We might have to additionally add a check whether LSB_RESOURCE_ENFORCE contains the "memory" string, and if it doesn't, we might have to resort back to the /job syntax in the rusage[] statement.

BEFH · 2024-03-21T00:29:51Z

Bad news: -R rusage[mem=X/job] does not seem to be working on further testing. I am trying to figure things out, but I think we have to have a rethink. Maybe go back to close to the original code and detect your specific setup.

BEFH · 2024-03-21T00:43:12Z

The issue with this is that it is actually (at least on my server) requesting the mem * threads in terms of resource request, but killing the job once it reaches mem. This is the worst of both worlds.

Here is an example where I requested 128 MB of memory and 6 threads, and wrote a dummy rule with the help of chatGPT to use 256 MB of memory (tested and confirmed). You can see here that it was killed at 128 MB.

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1.

Resource usage summary:

    CPU time :                                   3.00 sec.
    Max Memory :                                 128 MB
    Average Memory :                             43.40 MB
    Total Requested Memory :                     768.00 MB
    Delta Memory :                               640.00 MB
    Max Swap :                                   -
    Max Processes :                              6
    Max Threads :                                8
    Run time :                                   5 sec.
    Turnaround time :                            11 sec.

The output (if any) follows:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=128, mem_mib=123, disk_mb=1000, disk_mib=954
Select jobs to execute...
Execute 1 jobs...

I really think we need to set it to use memory/core by default and change configuration either using an environment variable or resource request.

dlaehnemann · 2024-03-21T10:34:54Z

Do you have LSB_RESOURCE_ENFORCE set in the lsf.conf file? And does it contain memory, so something like:

LSB_RESOURCE_ENFORCE="memory cpu"

dlaehnemann · 2024-03-21T10:37:35Z

The recommendation by our local admins was to not try to parse too much of the configuration settings, as these can be configured in loads of different places. And some of them can even be set / altered during submission (by things like the esub script). So we would have to parse and check a lot of things.

So unless we find a restricted set of informative settings, we might have to resort to setting some kind of environment variable manually for each cluster configuration. This is annoying, because every user will have to first check out their cluster setup by trial and error, to find a working manual setting...

BEFH · 2024-03-21T19:53:24Z

LSB_RESOURCE_ENFORCE="memory cpu gpu"

What about a variable SNAKEMAKE_LSF_MEMFMT that can be perjob or unset? Do you need -hl for the command to work?

BEFH · 2024-03-21T20:46:49Z

I made some changes here: https://github.com/BEFH/snakemake-executor-plugin-lsf/tree/flexible_mem_behavior

I don't know if I should make a competing pull request or if you want to take a look?

You should be able to set SNAKEMAKE_LSF_MEMFMT=perjob to get the desired behavior.

dlaehnemann · 2024-03-22T13:45:09Z

With the other fixes and the added documentation, I think starting an alternative pull request with your branch is probably a good idea. One last thought:

The -hl command line argument might be a way to get memory enforcement to more generally work on a per-job (and not per task / cpu) basis. See the option description here:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-hl

So I just wanted to double-check if you have tried the version in this pull request, or only the version with the /job syntax? Because if the -hl version at least works in both our setups, that would be much nicer than having to add an environment variable to the mix...

BEFH · 2024-03-22T13:49:30Z

The output I sent where it requested mem*threads but enforced mem was with -hl. I would appreciate if you could try my opinion without -hl, as asking the cluster for extra enforcement is often counterproductive.

…

On Fri, Mar 22, 2024, 9:45 AM David Laehnemann ***@***.***> wrote: With the other fixes and the added documentation, I think starting an alternative pull request with your branch is probably a good idea. One last thought: The -hl command line argument might be a way to get memory enforcement to more generally work on a per-job (and not per task / cpu) basis. See the option description here: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-hl So I just wanted to double-check if you have tried the version in this pull request, or only the version with the /job syntax? Because if the -hl version at least works in both our setups, that would be much nicer than having to add an environment variable to the mix... — Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZ2Z2FVWJG42JRLL4MBNN3YZQYXZAVCNFSM6AAAAABE544GGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGE2DEOBUG4> . You are receiving this because you commented.Message ID: ***@***.***>

BEFH · 2024-03-22T15:48:29Z

I've merged the other pull request and made a release. Could you please approve this and test on your cluster:

bioconda/bioconda-recipes#46693

I also updated the docs to more fully cover configuration. You can use lsf_extra to test what is needed on your cluster.

dlaehnemann · 2024-03-22T20:29:27Z

Done. And I'll test the new version next week.

BEFH · 2024-03-23T13:39:50Z

Perfect, thanks. We can use the environment variable to make more changes if we need them for your cluster. I have tested that the env variable works to do what I say, so now we need to see if it will work for you.

fix: use -hl to enforce per job and per host memory limits

9cf2132

dlaehnemann mentioned this pull request Mar 19, 2024

fix: always request total needed memory, as snakemake seems to count as single process #4

Merged

dlaehnemann closed this Mar 22, 2024

dlaehnemann deleted the patch-1 branch March 22, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use -hl to enforce per job and per host memory limits #5

fix: use -hl to enforce per job and per host memory limits #5

dlaehnemann commented Mar 19, 2024

BEFH commented Mar 21, 2024

BEFH commented Mar 21, 2024

dlaehnemann commented Mar 21, 2024

dlaehnemann commented Mar 21, 2024

BEFH commented Mar 21, 2024

BEFH commented Mar 21, 2024

dlaehnemann commented Mar 22, 2024

BEFH commented Mar 22, 2024 via email

BEFH commented Mar 22, 2024

dlaehnemann commented Mar 22, 2024

BEFH commented Mar 23, 2024

fix: use -hl to enforce per job and per host memory limits #5

fix: use -hl to enforce per job and per host memory limits #5

Conversation

dlaehnemann commented Mar 19, 2024

BEFH commented Mar 21, 2024

BEFH commented Mar 21, 2024

dlaehnemann commented Mar 21, 2024

dlaehnemann commented Mar 21, 2024

BEFH commented Mar 21, 2024

BEFH commented Mar 21, 2024

dlaehnemann commented Mar 22, 2024

BEFH commented Mar 22, 2024 via email

BEFH commented Mar 22, 2024

dlaehnemann commented Mar 22, 2024

BEFH commented Mar 23, 2024