From 654f636f4c35dee0f8fa6eb5e2a0a7bafd3036ee Mon Sep 17 00:00:00 2001 From: Robert McLay Date: Mon, 17 Jun 2024 09:59:30 -0600 Subject: [PATCH 1/2] add FAQ about HPE/Cray and collections --- docs/source/040_FAQ.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/source/040_FAQ.rst b/docs/source/040_FAQ.rst index bd590c747..931d6b75c 100644 --- a/docs/source/040_FAQ.rst +++ b/docs/source/040_FAQ.rst @@ -398,3 +398,15 @@ prepend_path() or append_path() a bad idea? local testdir = os.getenv("HOME") .. "/test" setenv("TESTDIR", testdir) prepend_path("TESTPATH", testdir) + +Why don't collections work on HPE/Cray systems? + + Currently, program environments on login nodes are different from + the ones on compute nodes. Collections require that the + modulefiles have the same name (versions could be different but not + the name). When a collection is *restored*, it purges **ALL** + modules and loads all the modules listed in the collection. Note + that the module load() inside a modulefile are ignored for very + complicated reasons. Instead Lmod loads all modules listed in the + collection. This works well except when the list of modules is + different. From 2fdfa8ca1bff34bd85a361493230e26fce6f3697 Mon Sep 17 00:00:00 2001 From: Robert McLay Date: Mon, 17 Jun 2024 10:13:35 -0600 Subject: [PATCH 2/2] add FAQ about HPE/Cray and collections and why --- docs/source/040_FAQ.rst | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/docs/source/040_FAQ.rst b/docs/source/040_FAQ.rst index 931d6b75c..99db330e0 100644 --- a/docs/source/040_FAQ.rst +++ b/docs/source/040_FAQ.rst @@ -409,4 +409,19 @@ Why don't collections work on HPE/Cray systems? that the module load() inside a modulefile are ignored for very complicated reasons. Instead Lmod loads all modules listed in the collection. This works well except when the list of modules is - different. + different on different nodes. + + The reason why Lmod loads the list of modules in the collections + and ignores load() type functions in the modulefiles is complex. + The problem is when two or more modulefiles share the same + environment variable. Suppose that your site sets the variable + **MPI_HOME** (using *setenv()*) in each mpi modulefile. If Lmod + obeyed the load() function in each modulefile, then it would have + to delete the extra modules not in the collection. In the case + where the user switched mpi modules, Lmod would load both mpi + modules then delete one to match the list of names in the + collection. Unloading the second module would unset **MPI_HOME** + and leave this variable with no value. If a site depended on + **MPI_HOME** as part of mpi program startup script, then those + users would not be able to submit mpi programs. +