Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocean_only resting/z and resting/layer don't run #298

Closed
nichannah opened this issue May 23, 2016 · 5 comments
Closed

ocean_only resting/z and resting/layer don't run #298

nichannah opened this issue May 23, 2016 · 5 comments

Comments

@nichannah
Copy link
Collaborator

These two experiments don't appear to run under any compiler/build/domain configuration.

Example output:
https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM6_run/build=DEBUG,compiler=intel,experiment=ocean_only-resting-z,memory_type=dynamic/

Error output for vanilla run:

[r2816:7741] *** An error occurred in MPI_Wait
[r2816:7741] *** reported by process [47732917534721,140720308486158]
[r2816:7741] *** on communicator MPI_COMM_WORLD
[r2816:7741] *** MPI_ERR_TRUNCATE: message truncated
[r2816:7741] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[r2816:7741] *** and potentially your MPI job)

Valgrind is reporting an error like this just before the crash but I don't know if it is related:

==1376== by 0x66956A4: walk_type_array (libmpiwrap.c:908)
==1376== by 0x66956A4: make_mem_defined_if_addressable (libmpiwrap.c:1015)
==1376== by 0x66956A4: maybe_complete (libmpiwrap.c:1359)
==1376== by 0x6696AF6: PMPI_Wait (libmpiwrap.c:1463)
==1376== by 0x7E10CDF: PMPI_WAIT (pwait_f.c:74)
==1376== by 0x481B58E: mpp_mod_mp_mpp_sync_self_ (mpp_util_mpi.inc:223)
==1376== by 0x416294D: mpp_domains_mod_mp_mpp_do_update_r8_3d_ (mpp_do_update.h:245)
==1376== by 0x4024582: mpp_domains_mod_mp_mpp_update_domain2d_r8_3d_ (mpp_update_domains2D.h:145)
==1376== by 0x2B50479: mom_domains_mp_pass_var_3d_ (MOM_domains.F90:157)
==1376== by 0x10355C1: mom_state_initialization_mp_mom_initialize_state_ (MOM_state_initialization.F90:253)
==1376== by 0x2DD1864: mom_mp_initialize_mom_ (MOM.F90:1800)
==1376== by 0x1BA5C3E: MAIN__ (MOM_driver.F90:263)

Full valgrind output can be found at:

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM6_runtime_analyzer/analyzer=valgrind,build=DEBUG,compiler=intel,experiment=ocean_only-resting-z,memory_type=dynamic/37/console

@angus-g
Copy link
Collaborator

angus-g commented May 23, 2016

As I've mentioned to Nic, it looks like too many CPUs are being allocated to this case, and it may be resolved by using fewer.

@adcroft
Copy link
Collaborator

adcroft commented May 23, 2016

I think @angus-g is right. I'm not sure what is meant to happen within halo updates when the halo becomes larger than the computational domain:

MOM domain decomposition
whalo =    3, ehalo =    3, shalo =    3, nhalo =    3
  X-AXIS =    2   2   1   1   1   1   1   1   1   1   1   1   1   1   2   2
  Y-AXIS =    1   1

@Zhi-Liang
Copy link
Contributor

Hi,

I tested the case with (nx=20,ny=2) and layout=(16,2) in test_mpp_domains
and it runs fine for the double cyclic condition. The unit test runs fine.
But I think the halo update is not correct in the y-direction at j=-2 and
j=5 ( the last point in the halo ). These two end halo points will get
wrong value or not get any value.

The current fms can not support halo size greater than nx or ny. I will
add an error check in mpp_domains to check this.

Greetings,

Zhi

On Mon, May 23, 2016 at 7:53 AM, Alistair Adcroft notifications@github.com
wrote:

I think @angus-g https://github.com/angus-g is right. I'm not sure what
is meant to happen within halo updates when the halo becomes larger than
the computational domain:

MOM domain decomposition
whalo = 3, ehalo = 3, shalo = 3, nhalo = 3
X-AXIS = 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2
Y-AXIS = 1 1


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
NOAA-GFDL#298 (comment)

@adcroft
Copy link
Collaborator

adcroft commented May 26, 2016

Thanks @Zhi-Liang . @Hallberg-NOAA thought you might support halos wider than compute domain because of the wide-halo work you guys both did in the shallow water model and barotropic solver. It could be we never push the decomposition fine enough to test this limit. Certainly I can see why you wouldn't want to support it (since it needs repeated communication).

We could consider a special flag at the MOM framework level to indicate homogeneity along an axis. We have several 1d and 2d tests where we "mimic" a half dimension by using identical values at both j=1 and j=2. However, this is probably overkill for a problem for which the easiest and simplest solution is not use too many PEs.

@adcroft
Copy link
Collaborator

adcroft commented Jul 6, 2016

According to build 27 this case is passing. I am not able to reproduce the problem interactively with the latest on dev/master.

@adcroft adcroft closed this as completed Jul 6, 2016
gustavo-marques pushed a commit to gustavo-marques/MOM6 that referenced this issue Sep 3, 2024
MARBL: convert salt_flux to tracer flux and add to STF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants