Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS-O running on mira / cetus #260

Closed
douglasjacobsen opened this issue Jun 30, 2015 · 30 comments
Closed

MPAS-O running on mira / cetus #260

douglasjacobsen opened this issue Jun 30, 2015 · 30 comments
Assignees

Comments

@douglasjacobsen
Copy link
Member

MPAS-O has not been confirmed to run correctly on mira or cetus yet.

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna I only assigned you because you're the POC. I'll help make sure it's working, but we can have the discussions about it here.

@jayeshkrishna
Copy link
Contributor

Thanks.
(FYI: Please refer to earlier discussions on MPAS0 + Mira/Cetus in Issue #173)

@jayeshkrishna
Copy link
Contributor

Unfortunately the 2-day run on Cetus failed.
The logs are here -
https://gist.github.com/f726e12d3f637ccaa82b

@douglasjacobsen , Anything in particular that stands out to you (in the logs) as the source of the failure?

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna Not really, but it looks like it's dying in finalization (since It's getting through almost two days worth of time steps).

What command did you use to create the case? I want to try it out on cetus, but previously it was dying when it was trying to init the model for me.

@jayeshkrishna
Copy link
Contributor

I used the following command to create the case (this is the same error that I got when I was interacting with Mark for #173 ),

./create_newcase -compset CMPASO-NYF -mach cetus -res T62_oEC60to30 -case mpaso_testing_CMPASO-NYF_cetus_T62_oEC60to30_062615_01

@jayeshkrishna
Copy link
Contributor

I found two versions of ESMF_TimeMod in the repo (see below) that is possibly the source of one of the error messages in the cesm log (format specification error),

[jayesh@cetuslac1 ACME_mpaso_testing (douglasjacobsen/mpas-o/add-model=)]$ find . -name ESMF_TimeMod.F90
./models/ocn/mpas-o/model/src/external/esmf_time_f90/ESMF_TimeMod.F90
./models/utils/esmf_wrf_timemgr/ESMF_TimeMod.F90
[jayesh@cetuslac1 ACME_mpaso_testing (douglasjacobsen/mpas-o/add-model=)]$ diff ./models/utils/esmf_wrf_timemgr/ESMF_TimeMod.F90 ./models/ocn/mpas-o/model/src/external/esmf_time_f90/ESMF_TimeMod.F90 | more
839c839
<            "(I", yearWidth, ".", yearWidth, "'-',I2.2,'-',I2.2,'_',I2.2,':',I2.2,':',I2.
2)"
---
>            "(I", yearWidth, ".", yearWidth, ",'-',I2.2,'-',I2.2,'_',I2.2,':',I2.2,':',I2
.2)"

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna Yeah, the one in the ocean isn't built as part of this. (i.e. models/ocn/mpas-o/model/src/external/esmf_time_f90). Instead, we use the one from models/utils/esmf_wrf_timemgr.

So, unless the bottom format specifier looks like the incorrect one, I'd be surprised if this is actually causing an issue.

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna I had a run using the same command you used that made it through the queue over the long weekend, but it didn't make it as far as your run did.
https://gist.github.com/douglasjacobsen/3fbde002f0a1b9cb201c

I'm going to keep looking into it, but I'm not sure what to do about this. I had built this run in debug mode, but since it failed different from how yours failed, I'm going to try it without debug mode.

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna I can get through a run, but now when it's deallocating everything, I get the following error message:

Error encountered while attempting to deallocate a data object

It specifically happens on this line: https://github.com/ACME-Climate/MPAS/blob/ocean/develop/src/framework/mpas_pool_routines.F#L252

I tried changing the compiler version to see if it was a bug, but that didn't work. Any ideas?

@jayeshkrishna
Copy link
Contributor

Was (dptr % r2) NULLified before use? Some compilers don't nullify ptrs by default.

@douglasjacobsen
Copy link
Member Author

@douglasjacobsen
Copy link
Member Author

Also, I added an if associated test right before the deallocate to see if it is something that it should be able to dealloacte, and it is.

@jayeshkrishna
Copy link
Contributor

How about r2 % array (Not NULLiefied in the decl - mpas_field_types.inc)?

@worleyph
Copy link
Contributor

worleyph commented Jul 8, 2015

I haven't been paying any attention to this, so please excuse the possibly stupid comment, but have you tried running with fewer processes per node (and more nodes), just to eliminate any possibility that this is a memory corruption issue due to maxing out the memory?

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna r2 % array is deallocated fine right before that. I could try nullifying it after it's deallocated, just as a test though.

@worleyph I've not had good luck trying to change the number of processes on mira / cetus. So, I'm just using the default for now. However, this grid shouldn't run into memory issues. It can easily run on 64 processors (slowly at least), and this case is running on 2048.

@jayeshkrishna
Copy link
Contributor

I am assuming that deallocating an unassociated ptr might have corrupted data structures.

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna It wasn't unassociated though. It was an actually alloacted pointer (meaning r2 % array).

@jayeshkrishna
Copy link
Contributor

Ok, I was thinking in the lines of "r2%array not NULLified" => associated() returns true even though the ptr is unassociated => deallocate(r%array) corrupts memory. It might be worthwhile just trying to print couple of elements of the array to make sure it is what you expect it to be.

@jayeshkrishna
Copy link
Contributor

  • As I mentioned above the difference in ESMF_TimeMod is the cause of the warning messages in the cesm log (changing the fmt string got rid of the warning messages)
  • With the DEBUG mode on, as you mentioned before, I am not able to get past the initialization

@singhbalwinder
Copy link
Contributor

Not sure if it is relevant here but I got into a lot of unassociated
pointer errors and I ended up nullifying a lot of pointer in MPAS-O. My
changes to the code are on my local machine. If you think they may be
useful, I can push them to the branch.

On Wed, Jul 8, 2015 at 10:57 AM, jayeshkrishna notifications@github.com
wrote:

Ok, I was thinking in the lines of "r2%array not NULLified" =>
associated() returns true even though the ptr is unassociated =>
deallocate(r%array) corrupts memory. It might be worthwhile just trying to
print couple of elements of the array to make sure it is what you expect it
to be.


Reply to this email directly or view it on GitHub
#260 (comment).

@jayeshkrishna
Copy link
Contributor

You could also try using the "-qinitauto" compiler option to see if that helps.

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna Yeah, the ESMF change needs to happen within ACME either way to prevent that message. I think the MPAS version has that change though, but within ACME the MPAS version doesn't get used so we don't need to modify it at all.

I tried adding a nullify(dptr % r2 % array) and it didn't help any, but I'll let you know about the -qinitauto

@douglasjacobsen
Copy link
Member Author

@jayeshkrishna Adding -qinitauto actually made the model die during init. I'll have to look into why, but the error messages are not useful right now.

@jayeshkrishna
Copy link
Contributor

I was able to get past the init failure with DEBUG turned on. However the run did not complete (crashed).
The logs and the patch is here - https://gist.github.com/759b2bbfeefb39794558

@douglasjacobsen
Copy link
Member Author

Thanks @jayeshkrishna. I'll look into it with debug on now and see if it gets us any further.

@maltrud
Copy link
Contributor

maltrud commented Jul 9, 2015

On Jul 9, 2015, at 8:28 AM, Doug Jacobsen <notifications@github.commailto:notifications@github.com> wrote:

Thanks @jayeshkrishnahttps://github.com/jayeshkrishna. I'll look into it with debug on now and see if it gets us any further.

Hi Guys-- i think you may just need to decrease the timestep to 10 minutes. or has it run on other platforms with 20 minute step? -mat

@douglasjacobsen
Copy link
Member Author

@maltrud It runs fine with a 20 minute time step. The error on mira is some weird fortran + xlf issue that we're trying to figure out.

@rljacob
Copy link
Member

rljacob commented Nov 23, 2015

Can this issue be closed?

@douglasjacobsen
Copy link
Member Author

@rljacob I haven't tested it again yet. I need to get around to it, but the hope is that #412 should fix this issue.

@douglasjacobsen
Copy link
Member Author

This is fixed now.

rljacob added a commit that referenced this issue Jul 20, 2016
92a5d03 Merge pull request #271 from jedwards4b/misc_python_fixes
8659568 fix issues with code_checker and missed run_cmd changes
1f9f0b3 Merge branch 'jedwards4b-more_mira_port'
1df3edd add path search for pylint
89c084f change CAM55 to CAM60
03520b1 merge to trunk
367d47f Merge branch 'jgfouca/add_code_checking' (PR #267)
de33082 Fix a couple remaining pylint issues
1516c3b Made code_checker parallel, fix test name so it actually runs
e46d0f4 Merge branch 'testreporter'
0c091cc remove whitespace
8a25823 Merge branch 'testreporter_update' of https://github.com/fischer-ncar/cime into testreporter
54e3916 Fix remaining test failures
e777c55 Update testreporter to handle the new ESMCI TestStatus files. Put back ETEST compset needed for testing.
300cdce Merge pull request #268 from jedwards4b/pbs_fixes
84ebf2d remove commented code
037d253 get st_archive working on mira
aa193c7 need compiler attribute in get_default_mpilib
30b5fcf remove colon from fixes line
42ad9e0 add SAVE_TIMING_DIR for blues
b36e2cd Bug fix
380f8d4 fix regex match for blues
c5d157e rework CIMEROOT capture in templates
acb6008 sta working now
77b4ae1 more updates for mira
b90daa8 changed so that the expression is evald
a0c984d update acme side
8a5ef7f correct batch directives
bfac10e get mira working again
706c28e Merge branch 'master' of https://github.com/ESMCI/cime
9a4db4d Merge pull request #261 from jedwards4b/nck_fix
4fff865 All scripts passing
553bba5 fixes for pbs systems
479ddea Fix missing import, dunno how this made it past testing
557cff3 Merge branch 'master' of https://github.com/ESMCI/cime
ed3a74d Merge branch 'jgfouca/python3_better_error' merge to master
fd7dd06 fix merge conflict
c233f96 Add code checker test. Remove refactor disablings from scripts_regression_tests.py
c23bb0b Progress. Tool added. build.py 100%
ca97ca9 fix user mod support for multi instance
17fe594 got nck build right
d9b7302 correct problems in nck test
e681111 better fix for nck test issues
1abee3e if ntasks for a comp is 1 the ninst should not be > 1
8b6d137 Merge branch 'jgfouca/fix_acme_postbuild' (PR #259)
fb3aeea Fix unit tests. Unsafe defaults were being used
9239984 Suffix all files with lid
d4c503b Minor improvement to error message
f70029d Python3 users should get a better error now
d3de6ae Trying to get reasonable error when using python 3
e3dcd0b Minor fix
bbf9ba7 Merge pull request #260 from mvertens/remove_esmf_interface
302c66c Port performance archiving scripts to python.
766f974 Merge branch 'remove_esmf' into remove_esmf_interface
5e13ecd removed NOC test
42d19e7 fixes for pre-alpha cesm tests
e989229 Removed all code related to using the ESMF interfaces in driver_cpl directory
83e7e0f removed esmf interfaces
8fa604a Merge branch 'jedwards4b-read_xml_fix'
41c71bf updated to PR #255
c10289e Merge pull request #252 from ESMCI/santos/case-context-manager
77e5a78 needed a flush in systems_tests_common
215987a Add `read_only` flag to `Case` constructor.
d7ae79b Add read-only mode to `Case` objects.
f84aebb Make `Case` a context manager.
0f298cf Merge branch 'master' of https://github.com/ESMCI/cime
c52b65c Merge pull request #251 from jedwards4b/test_updates2
cd50e4f Merge remote-tracking branch 'mydev/test_updates2'
87b9b7d add sanity check
f684eaf added new version of seq_diag_mct.F90 and backed out new config_grids.xml schema
dea5064 move case update to preview_namelist
d1ed95a fixes erp test
e0dda70 working on erp
4c4a57f a pythonization of the original csh scripts
f9c8252 fixed bugs in erp.py - there are still outstanding problems
dabbdce new schema (version 2.0) for config_grids.xml
3f5eca9 Merge remote-tracking branch 'origin/master' into test_updates2
d55ba8f fixed problems with order dependence of compset attributes
ad26223 reordered elements such that grid search could be done correctly
3874918 merged to cesmdev cime cime4.5.22
a26d382 added comments to seq.py
a65a6f3  added file back in - needed for regression tests
dbde0f0 xmlchange_cmds was not called correctly - this is now fixed
ec6f656 fixes to get tests working and major cleanup of utils/perl
a1cbd50 updates for numerous tests and deletion from Testing/TestCases
e20a53b added pem and erp tests
8e424dc fixed removal of CME tests
8b32dc3 Merge pull request #471 from fischer-ncar/Replace_Bcompsets
f3dc375 Fix broken compset names
9bea53a Merge isotope updates and corip1 module updates
ceac486 Merge pull request #469 from fischer-ncar/corip1_update
26918a5 Update modules for corip1
60d2fa5 Merge isotope updates to corip1_update
81102e9 Merge pull request #463 from cacraigucar/geotrace_cime
d9a31c0 Merge tag 'cime4.5.20' into geotrace_cime
b08f005 Merge remote-tracking branch 'upstream/master'
e8d85ba Merge pull request #462 from fischer-ncar/mpas_o
3c9357d Merge updated origin.
10bc4ac Update ChangeLog
9be64db Merge pull request #467 from fischer-ncar/ChangeLog
8d45e7e Update ChangeLog
ae57982 Merge remote-tracking branch 'upstream/master'
27a22d1 Merge pull request #465 from fischer-ncar/ChangeLog
baec797 Add new ocn ice coupler fields
f17545b Merge pull request #459 from apcraig/marbl
de371db Update to add WaveWatch support
467b8ea Merge pull request #464 from fischer-ncar/WaveWatch
15b3bea Update ChangeLog, fix typo in config_compset.xml
3767ae8 Removed allactive compsets that weren't being tested.
49b2b47 Updates for WaveWatch
bc93250 Changes to config_files.xml and config_grids.xml to add mpas-o
43d5d0e Merge tag 'cime4.5.17' into geotrace_cime
9e2603e Merge tag 'cime4.5.14' into geotrace_cime
1837319 Merge tag 'cime4.5.10.1' into geotrace_cime
1c8ef5b Merge tag 'cime4.5.6' into geotrace_cime
a9a5ac9 Merge tag 'cime4.4.9' into geotrace_cime
7e4143e Add bcphi, bcpho, flxdst ice to ocean coupling fields
c4a9548 Remove ESMF tests and add ChangeLog
bc5524e git reset --hard geotrace_cime_n13_cime4.4.8 and add aux_isotope test
9eae658 update esp_present bug for cime4.5.10.1
eba705a Merge tag 'geotrace_cime_n12_cime4.4.8' into geotrace_cime
43e18b8 [ 50 character, one line summary ] add Mariana's fix for CO2 for Isotopes
dd34c44 Fix esp assignment in seq_rest_mod.F90.  This will go into cesm1_5_beta6.1
a4be399 Merge tag 'cime4.4.8' into geotrace_cime
4410a89 Revert "Revert "Merge tag 'cime4.4.7' into geotrace_cime""
f76d977 Revert "Merge tag 'cime4.4.7' into geotrace_cime"
c8f9137 Merge tag 'cime4.4.7' into geotrace_cime
d5df824 Merge remote-tracking branch 'upstream/master'
e598bea Merge tag 'cime4.2.3' into geotrace_cime
cd5c2c9 Merge tag 'cime4.0.3' into geotrace_cime
502959b Merge tag 'cime3.0.7' into geotrace_cime
a552bff Merge tag 'cime2.0.18-p1.1' into geotrace_cime
1fafc89 Merge tag 'cime2.0.0' into geotrace_cime
0f0ff2b Add -ntr_iso to CICE config for isotopes
ea874e9 Put in correct CMakeLists.txt
e96f3b2 Merge tag 'cime1.1.10' into geotrace_cime
3d72660  	modified:   driver_cpl/bld/namelist_files/namelist_definition_drv.xml  	new file:   driver_cpl/driver/mrg_mod.F90  	modified:   driver_cpl/driver/prep_ice_mod.F90  	modified:   driver_cpl/driver/prep_ocn_mod.F90  	modified:   driver_cpl/driver/prep_rof_mod.F90  	modified:   driver_cpl/driver/seq_diag_mct.F90  	modified:   driver_cpl/driver/seq_flux_mct.F90  	modified:   driver_cpl/shr/seq_flds_mod.F90  	modified:   externals/pio/CMakeLists.txt  	modified:   machines/config_pes.xml  	modified:   scripts/Tools/config_compsets.xml  	modified:   scripts/Tools/config_definition.xml  	modified:   share/csm_share/shr/shr_const_mod.F90  	modified:   share/csm_share/shr/shr_flux_mod.F90  	new file:   share/csm_share/shr/water_isotopes.F90  	new file:   share/csm_share/shr/water_types.F90                       - rest of files from geotracer svn branch
99e4eeb 	modified:   driver_cpl/bld/build-namelist                 - added first file for geotrace branch
32b1b7d Merge tag 'cime2.0.17-p1.1' into cime2.0.18-p1
cebd23c patch to fix resubmit in cesm1_3_beta07

git-subtree-dir: cime
git-subtree-split: 92a5d03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants