Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix addresses issue with mulitple mpi comm groups when dyn_npes<npes #1604

Closed
wants to merge 2 commits into from

Conversation

AaronDonahue
Copy link
Contributor

Addresses issue #1593

Addresses an issue in HOMME where multiple communication groups are
generated in cases where dyn_npes < npes for the ATM model. This
causes the code to crash for certain choices of dyn_npes with respects
to npes.
* Change to initmp subroutine for the case of CAM defined
* Change to namelist read for the case of CAM defined
* Change to dynamics initialization (dyn_init1) to be
compatible with changes to HOMME

[BFB] - Bit-For-Bit

Addresses an issue in HOMME where multiple communication groups are
generated in cases where dyn_npes < npes for the ATM model.  This
causes the code to crash for certain choices of dyn_npes with respects
to npes.
	* Change to initmp subroutine for the case of CAM defined
	* Change to namelist read for the case of CAM defined
	* Change to dynamics initialization (dyn_init1) to be
	  compatible with changes to HOMME

[BFB] - Bit-For-Bit
@rljacob
Copy link
Member

rljacob commented Jun 28, 2017

@lazarusM3B can you add a real name to your github profile? All the usernames are hard to keep track of.

@gold2718
Copy link

@lazarusM3B (aka Aaron Donahue as indicated in the branch name),
You could easily try out @mt5555's preferred method by gong back to just using par and its communicator so that line 135 of dyn_comp.F90 becomes:

if (iam < par%nprocs) then
    call readnl(par, NLFileName)
end if

and make the assignments to masterproc etc. inside of readnl.
Testing should reveal any issues with physics tasks not having reasonable namelist values (I don't expect any).

@AaronDonahue
Copy link
Contributor Author

@rljacob , Absolutely, I've been meaning to do that for this reason.

@gold2718 @mt5555 , I'll go ahead and try this fix and see what happens. I'll let you know what I get.

@AaronDonahue
Copy link
Contributor Author

AaronDonahue commented Jun 28, 2017

@gold2718 @mt5555 , It looks like this fix works and is less intrusive then the previous approach. I'm going to run a quick B4B test to make sure that it isn't somehow running but has messed with the solution. Then push a new commit.

This commit is less invasive by leaving the readnl subroutine alone
and just having the ATM model skip calling the subroutine for
processors not assigned to the dynamics solve.
	* Reverted the homme file namelist_mod.F90 to original version
	* Changed dyn_init1 in dyn_comp.F90 to skip the namelist read
	  for cores not assigned to solve dynamics.

[BFB] - Bit-For-Bit
@AaronDonahue
Copy link
Contributor Author

I pushed a new commit to the branch with the above changes made.

I am about to leave on vacation for a few weeks so I may not be able to respond to any new comments on this PR until I get back.

@mt5555
Copy link
Contributor

mt5555 commented Jun 28, 2017

does this work?
I worry about statements like this:

 tstep = dtime/real(se_nsplit*qsplit,r8)

all processor will compute tstep, but in this case the wont all have the same value of se_nsplit, since it's in the namelist. In this case it is harmless, becuase non-dycore pes would never use 'tstep'.

Sorry to waffle on this, but your other solution might be better.

Or, how about making a fake "par",

type (parallel_t) :: par _all

par=initmp(npes_se)
par_all%comm = mpicom
par_all%root = 0
par_all%masterproc = (iam==0)
etc...

call readnl(par_all, NLFileName)

@mt5555
Copy link
Contributor

mt5555 commented Jun 28, 2017

old comment: (moved from issue #1593 ) adding here for reference:

So are you saying that there are dycore namelist variables that are needed by the physics procs? I would have hoped that anything in the namelist read by the dycore is not needed by the physics, but that might be wishfull thinking.

An alternative would be to call readnl from the dp coupling layer (since it wont be a big deal to call it twice).

I can see your solution may be the best way - but my first reaction is I dont like dycore code sometimes working with one communicator, and other times working with a different communicator.

jgfouca pushed a commit that referenced this pull request Jul 14, 2017
Fixed issue with GET_REFCASE not being set correctly that was introduced with PR#1561 and
an issue with X compset mapping files that was introduced in #1586

Test suite: scripts_regression_tests.py
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes

User interface changes?:

Code review:
@rljacob rljacob requested review from amametjanov and removed request for singhbalwinder July 24, 2017 17:09
@rljacob
Copy link
Member

rljacob commented Jul 24, 2017

Where are we with this PR?

@rljacob rljacob requested review from mt5555 and removed request for amametjanov July 24, 2017 17:10
@rljacob
Copy link
Member

rljacob commented Jul 24, 2017

@mt5555 or @golaz you need to approve this.

@AaronDonahue
Copy link
Contributor Author

@rljacob @mt5555 @golaz , I've been working on this this morning. Sorry it took so long to get back but I wasn't able to work on this until I got up and running on LC machines, which we just got working last week.

But to summarize what I've got so far. I took a look at implementing the suggestions made by @mt5555 up above but this lead to another set of errors further down the pipe in the code. I was trying to hunt down exactly why. As it stands the fix that is in this commit appears to work, but as @mt5555 pointed out it means that non-dynamics processors won't read in the dynamics namelist. So far this doesn't look like a problem, but could be hard to test for. If the reviewers are ok with this then we can probably call it finished, but if there are still concerns I am happy to continue working on this issue.

@singhbalwinder
Copy link
Contributor

@mt5555 and @gold2718 : Is it okay to move forward with this PR?

@mt5555
Copy link
Contributor

mt5555 commented Aug 17, 2017

@AaronDonahue : looks good - your knowledge of this is deeper than mine, so I defer to your judgement as to the best approach.

@AaronDonahue
Copy link
Contributor Author

AaronDonahue commented Aug 21, 2017

@mt5555 , Excellent, thank you. @singhbalwinder, I would like to take one more look at this code before we go ahead an integrate it. It has been over a month since I started this PR and I want to make sure that its ready to go.

@AaronDonahue AaronDonahue reopened this Aug 21, 2017
@rljacob
Copy link
Member

rljacob commented Aug 24, 2017

@AaronDonahue please merge this to next today.

@singhbalwinder
Copy link
Contributor

@rljacob : Is your comment above for this PR? This PR was created by @AaronDonahue and I am the integrator. @AaronDonahue indicated in his comment above that he would like to work on the suggestions made by reviewers before merging it. Is that still the case @AaronDonahue ?

@AaronDonahue
Copy link
Contributor Author

@singhbalwinder , yes, I was planning on making my final check this morning before giving the green light.

@singhbalwinder
Copy link
Contributor

Sounds good. Let me know when you think you are okay with this PR and I will merge it to next.

@rljacob
Copy link
Member

rljacob commented Aug 24, 2017

Sorry, I mean you should integrate it @singhbalwinder.

@AaronDonahue
Copy link
Contributor Author

@singhbalwinder @rljacob , I've dug deeper into potential issues with handling communication groups in this way and found that the code can run into problems with a history write. Based on this I will need some more time to work on the issue.

@rljacob
Copy link
Member

rljacob commented Aug 24, 2017

@AaronDonahue in that case, please close the PR and you can reopen it when its ready.

@AaronDonahue
Copy link
Contributor Author

@rljacob , absolutely, will do

@rljacob rljacob deleted the aarondonahue/cam/nprocs_fix branch February 15, 2018 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants