Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use -hl to enforce per job and per host memory limits #5

Closed
wants to merge 1 commit into from

Conversation

dlaehnemann
Copy link
Contributor

We might have to additionally add a check whether LSB_RESOURCE_ENFORCE contains the "memory" string, and if it doesn't, we might have to resort back to the /job syntax in the rusage[] statement.

@BEFH
Copy link
Owner

BEFH commented Mar 21, 2024

Bad news: -R rusage[mem=X/job] does not seem to be working on further testing. I am trying to figure things out, but I think we have to have a rethink. Maybe go back to close to the original code and detect your specific setup.

@BEFH
Copy link
Owner

BEFH commented Mar 21, 2024

The issue with this is that it is actually (at least on my server) requesting the mem * threads in terms of resource request, but killing the job once it reaches mem. This is the worst of both worlds.

Here is an example where I requested 128 MB of memory and 6 threads, and wrote a dummy rule with the help of chatGPT to use 256 MB of memory (tested and confirmed). You can see here that it was killed at 128 MB.

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 1.

Resource usage summary:

    CPU time :                                   3.00 sec.
    Max Memory :                                 128 MB
    Average Memory :                             43.40 MB
    Total Requested Memory :                     768.00 MB
    Delta Memory :                               640.00 MB
    Max Swap :                                   -
    Max Processes :                              6
    Max Threads :                                8
    Run time :                                   5 sec.
    Turnaround time :                            11 sec.

The output (if any) follows:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=128, mem_mib=123, disk_mb=1000, disk_mib=954
Select jobs to execute...
Execute 1 jobs...

I really think we need to set it to use memory/core by default and change configuration either using an environment variable or resource request.

@dlaehnemann
Copy link
Contributor Author

Do you have LSB_RESOURCE_ENFORCE set in the lsf.conf file? And does it contain memory, so something like:

LSB_RESOURCE_ENFORCE="memory cpu"

@dlaehnemann
Copy link
Contributor Author

The recommendation by our local admins was to not try to parse too much of the configuration settings, as these can be configured in loads of different places. And some of them can even be set / altered during submission (by things like the esub script). So we would have to parse and check a lot of things.

So unless we find a restricted set of informative settings, we might have to resort to setting some kind of environment variable manually for each cluster configuration. This is annoying, because every user will have to first check out their cluster setup by trial and error, to find a working manual setting...

@BEFH
Copy link
Owner

BEFH commented Mar 21, 2024

LSB_RESOURCE_ENFORCE="memory cpu gpu"

What about a variable SNAKEMAKE_LSF_MEMFMT that can be perjob or unset? Do you need -hl for the command to work?

@BEFH
Copy link
Owner

BEFH commented Mar 21, 2024

I made some changes here: https://github.com/BEFH/snakemake-executor-plugin-lsf/tree/flexible_mem_behavior

I don't know if I should make a competing pull request or if you want to take a look?

You should be able to set SNAKEMAKE_LSF_MEMFMT=perjob to get the desired behavior.

@dlaehnemann
Copy link
Contributor Author

With the other fixes and the added documentation, I think starting an alternative pull request with your branch is probably a good idea. One last thought:

The -hl command line argument might be a way to get memory enforcement to more generally work on a per-job (and not per task / cpu) basis. See the option description here:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-hl

So I just wanted to double-check if you have tried the version in this pull request, or only the version with the /job syntax? Because if the -hl version at least works in both our setups, that would be much nicer than having to add an environment variable to the mix...

@BEFH
Copy link
Owner

BEFH commented Mar 22, 2024 via email

@BEFH
Copy link
Owner

BEFH commented Mar 22, 2024

I've merged the other pull request and made a release. Could you please approve this and test on your cluster:

bioconda/bioconda-recipes#46693

I also updated the docs to more fully cover configuration. You can use lsf_extra to test what is needed on your cluster.

@dlaehnemann
Copy link
Contributor Author

Done. And I'll test the new version next week.

@dlaehnemann dlaehnemann deleted the patch-1 branch March 22, 2024 20:29
@BEFH
Copy link
Owner

BEFH commented Mar 23, 2024

Perfect, thanks. We can use the environment variable to make more changes if we need them for your cluster. I have tested that the env variable works to do what I say, so now we need to see if it will work for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants