Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for conda on gaea #428

Closed
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion ush/load_modules_run_task.sh
Original file line number Diff line number Diff line change
Expand Up @@ -183,10 +183,11 @@ module list

if [ -n "${SRW_ENV:-}" ] ; then
set +u
conda activate ${SRW_ENV}
if [ $machine = "gaea" ]; then
conda deactivate
conda activate ${SRW_ENV}
else
conda activate ${SRW_ENV}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a problem on Gaea and Jenkins, which is why this activate->deactivate->activate route is used specifically for Gaea. Please take a look at this discussion:
#404 (comment)
Maybe something has changed since moving to Lua, but pretty sure it was working with Jenkins as it was a couple of days ago.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Daniel. The deactivate worked for me when I tried it without first activating an environment, but I am looking into this further now. I think that we may not need the deactivate at all, actually. The problem I was trying to fix on my end was because I had an old conda init stanza in my .bash_profile that was confusing the new conda. I will see if something similar is messing up the Jenkins job.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-a-potts Maybe there is a way to make gaea consistent with other systems (one conda activate) once EPIC installs are used in PR #419 ? I also don't like the hack either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the Jenkins CI pipeline for Gaea successfully ran through to completion for this PR. If it was possible to remove the conda deactivate entirely, that would be great, but it might need to wait until after the updated miniconda PRs come in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken Indeed. I was under the impression that the conda deactivate will be a no-op before conda activate $SRW is executed, but it looks like it does deactivate the conda from the parent shell. I guess this is one less command for gaea but would be desirable to get rid of the conda deactivate as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested a little more this afternoon, and I can't quite figure out what is going on. When I load wflow_gaea by hand, the python path is right and conda is known to be a function, but when I run a step with rocoto, the load_modules_and_run.sh script does not know that conda is a function and so the conda activate line does not work without a conda deactivate first.

You are probably right @MichaelLueken, that we should test again once the miniconda PRs come through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing today and think I have found the problem, though I am not entirely sure why it is a problem. It seems that the extra module load for miniconda3 that happens in the modulefiles/tasks/gaea/*.lua files adds an extra level of conda stuff that messes things up. I removed all but the vx modulefile under gaea and everything seems to be working correctly without a need to deactivate the conda environment. I don't quite understand why the extra conda modules seem to be okay on other platforms, but not gaea, but I think this is a bit better of a solution than the deactivate workaround. I will re-open this and push the latest updates I have made.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is happening on all systems now after transition to EPIC modules with @natalie-perlin 's PR?
#431 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is currently not activating conda in load_modules_run_task.sh because SRW_ENV is not set for it. So gaea is still an outlier compared to other Tier-1 systems. I wish there was a better solution that switches the python paths to */envs/regional_workflow/bin instead of */bin which is the root cause of the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it appears as though @natalie-perlin's PR #431 is having the same issue for all machines, merging this PR would likely break what Natalie has done. Would it be best to open a PR in Natalie's fork to include these changes in PR #431 and then ask her to make similar modifications to the local modulefiles for machines that aren't Gaea (since Mark has already done this for Gaea)?

fi
set -u
fi
Expand Down