Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgi_acc support is broken on Titan #1006

Closed
worleyph opened this issue Aug 10, 2016 · 14 comments
Closed

pgi_acc support is broken on Titan #1006

worleyph opened this issue Aug 10, 2016 · 14 comments
Assignees

Comments

@worleyph
Copy link
Contributor

by something that has changed on the OLCF side?

Building with pgi or pgi_acc (and I assume with intel as well) generates:

 > cat software_environment.txt 
 Currently Loaded Modulefiles:
 ModuleCmd_List.c(146):FATAL:997: The environment variables LOADEDMODULES and _LMFILES_ have inconsistent lengths.

This does not prevent pgi builds, because we are using the default version of pgi. pgi_acc fails when it tries to switch to a different version.

@mrnorman is on vacation. This almost sounds like the issue that @ndkeen had to deal with when module loads were documented in two different lists? Noel - does this look familiar?

In any case, easily documented, and I will submit a report to help at OLCF. Just a note that this problem has been reported.

@worleyph
Copy link
Contributor Author

Note that this has something to do with the "preamble" at the beginning of the env_mach_specific script. We may have to go back to the old style without the module purge etc. I don't know. Not my area of expertise, and will reassign this to @mrnorman unless someone else thinks that they can figure this out. Not something that I know how to complain to the OLCF about, since I do not know the reason behind the current logic.

@worleyph
Copy link
Contributor Author

Just noticed that env_mach-specific.titan was working fine as recently as yesterday. I did go ahead and submit a bug report to the OLCF. Hopefully this can be corrected quickly.

@worleyph
Copy link
Contributor Author

OLCF staff had no insight - were able to reproduce the problem in csh or tcsh, but it worked when starting from bash:

"My shell is set to bash and I am able to run all the commands you listed, even if I start a csh shell. One of my colleagues has his shell set to tcsh and he sees the same issue."

(env_mach_specific is a csh script). The did suggest

"We generally recommend against running 'module purge' since it can sometimes
mess with your environment and break modules."

@amametjanov
Copy link
Member

On Edison and Cori, module rm instead of module purge is used to avoid unloading other modules loaded from shell dot files. Loading of version-specific modules had to be done in a particular order.

Tagging a related issue #1004.

@worleyph
Copy link
Contributor Author

That was the style that we used to use on Titan. I thought that the purge was added to address a module loading problem. Unfortunately @mrnorman is not available to explain the history of the current logic.

@worleyph
Copy link
Contributor Author

The new style was intorduced only 3 weeks ago, in PR #958 .

@worleyph
Copy link
Contributor Author

@ndkeen , this should look familiar:

 Ok, it looks like the problem comes from the fact than under csh/tcsh, if too
 many modules are loaded, the $_LMFILES_ variable is split into two:
 $_LMFILES_000 and $_LMFILES_001. When you do a 'module purge', the
 $LOADEDMOULES and $_LMFILES_ environment variables are cleared, however,
 $_LMFILES_000 and $_LMFILES_0001 are not. This seems like a bug in the module
 command rather than a change in the environment.

 ... (a) newer version of module that is
 available on Titan doesn't seem to have this problem

 One workaround could be to switch from 'module purge' to just swapping the
 programming environment. Alternatively, you could source instead the newer
 version of environment modules.

 Could you give /opt/modules/3.2.10.4/init/csh a try and see if you still get
 this error?

I'll try the new init routine. @mrnorman , since @ndkeen already worked through this at NERSC, you should touch base with Noel and decide on a common, robust, style.

@worleyph
Copy link
Contributor Author

Update: /opt/modules/3.2.10.4/init/csh has its own problems. Guess that we will have to back off the module purge-style logic and reinstate the explicit module rms.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 11, 2016

Yuck. You could ask about why those env variables are not being wiped clean with a purge.
I'm actually not a fan of 'module purge' because sometimes I really do want to load a module that I'd like ACME to leave alone. But I do see the utility of being able to say "start with nothing". In general, this is a tricky problem. One thing I also see is that sometimes when we remove modules, it leaves your env in an odd state and perhaps its better to tackle one thing at a time, instead of removing all, then adding all. That is, remove a module, then immediately add the version you want, etc. Module switching should be (effectively) the same as this, though I would have a slight preference to minimizing module commands (ie, say only use load/rm).

@worleyph worleyph removed their assignment Aug 17, 2016
@worleyph
Copy link
Contributor Author

@mrnorman is back from vacation now, and I think that he has a fix (correct, Matt?). The default modules were changed today, and another error crept in (hdf5 version). Updating netcdf and hdf to the latest solves the problem in my experiments.

@mrnorman
Copy link
Contributor

I never saw the error:

cat software_environment.txt
Currently Loaded Modulefiles:
ModuleCmd_List.c(146):FATAL:997: The environment variables LOADEDMODULES and LMFILES have inconsistent lengths.

But the pgi not found error, and the netcdf version errors are fixed. I'm also updating to the latest defaults for libsci, mpich, and atp. Tests should be done by the end of today, and I'll update master then.

@mrnorman
Copy link
Contributor

Alright, I'm giving up module purge entirely.

@worleyph
Copy link
Contributor Author

cat software_environment.txt
Currently Loaded Modulefiles:
ModuleCmd_List.c(146):FATAL:997: The environment variables LOADEDMODULES and LMFILES have inconsistent lengths.

This was a csh/tcsh issue that OLCF were able to address at their end. FYI.

@rljacob rljacob assigned minxu74 and unassigned mrnorman Apr 19, 2017
@worleyph
Copy link
Contributor Author

worleyph commented Jul 4, 2017

I haven't seen any more issues with this @minxu74 and @mrnorman . I have a different pgiacc problem now. I'm closing this. If you feel that it needs to be reopened, please do. I'm opening a new pgiacc issue though.

@worleyph worleyph closed this as completed Jul 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants