Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jlaucar
Copy link
Contributor

@jlaucar jlaucar commented Jan 9, 2025

Description:

The BNRHF transform fails when the computed ensemble standard deviation is not positive. Existing code returns an untransformed ensemble but no error return. The transforms and the probit transforms that call them were modified to return an error. The inflation in filter_mod and the increments in assim_tools_mod were updated to check for this error. These routines now explicitly do nothing to variables that fail in the transform step.

Fixes issue

fixes #749

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Documentation changes needed?

  • My change requires a change to the documentation.
    • I have updated the documentation accordingly.

Tests

Was tested with Lorenz_96_tracer model QCEFF test suites. Produced answer changes in 7 of the tests but rest were bitwise. The answer changes do not appear to be associated with significantly changed perfomance.
Was also tested with the original CAM reanalysis case that led to identification of this problem. Tests there bitwise reproduced over several days of testing with previous results that had to use clamping; clamping is no longer required with the BNRHF as would be expected.

Checklist for merging

  • Updated changelog entry
  • Documentation updated
  • Update conf.py

Checklist for release

  • Merge into main
  • Create release from the main branch with appropriate tag
  • Delete feature-branch

Testing Datasets

  • Dataset needed for testing available upon request
  • Dataset download instructions included
  • No dataset needed

… failed because the sample ensemble

covariance was not positive. The existing code simply returned the input ensemble but did not indicate
that an error had occurred. This resulted in the untransformed ensemble being used as if it had been
transformed in filter_mod.f90 for inflation and in assim_tools_mod.f90 for assimilation. In general,
this was not a problem because zero covariance in the ensemble led to both inflation and assimilation
doing nothing. However, due to numerical round-off, it was possible to get cases where the inflation
could result in violating the bounds of the bnrh distribution. This commit attempts to correct this
behavior by updates to probit_transform_mod.f90, filter_mod.f90, and assim_tools_mod.f90.

probit_transform_mod.f90:
An error return argument was added to subroutine to_probit_bounded_normal_rh. It returns a 1 if the
ensemble standard deviation was not positive (failure) and a 0 otherwise. This returned argument is
passed up through transform_to_probit and transform_all_to_probit, where it becomes an array with one
entry for each state variable being transformed. In the case of an error, the returned ‘transformed’
ensemble is simply the input original ensemble. The check for a non-positive standard deviation was
eliminated in the case where ‘use_input_p’ is true since this should never be called if the original
transformation failed. Subroutine transform_all_from_probit was modified to include an additional
logical array that indicates whether each of the state variables to be transformed was successfully
transformed to probit space. The inverse transform is only computed for the state variables for which
the original transform was successful.

filter_mod.f90:
All changes are in subroutine filter_ensemble_inflate. The inflation is only performed if the error
return from the transform_to_probit is 0. Note that this is called for each group if a group filter is
being run and it would be possible for some groups to transform successfully while others failed. This
case is handled properly.

assim_tools_mod.f90:
Two logical arrays were added to keep track of which transforms fail for the observed variables
(obs_probit_trans_ok) and state variables (state_probit_trans_ok); these are allocated on entry into
subroutine filter_assim and deallocated on exit. The result of the transforms is only copied into the
appropriate ensemble_handle storage if the transform was successful. When observations are sequentially
assimilated, the observation ensemble is only converted back to physical space if its original transform
was successful. The observation prior and posterior are transformed back to probit space in the
SEQUENTIAL_OBS loop. If this transform fails for any group, the loop is cycled for this observation so
that it has no impact. In the STATE_UPDATE loop, if the transform to probit space for the current state
variable failed, the loop is cycled. The same is done in the OBS_UPDATE loop. Note that groups are not
relevant to this step. Finally, when transform_all_from_probit is called for state variables, only those
that were successfully transformed originally are transformed back.
for the transform_all case.

This version has been tested with the lorenz_96_tracer tests. All tests run
to completion. All of the RMSE tests reproduce the baseline values except for:
Case 1 ensemble size 6 and 15
Case 2 ensemble size 6 and 15
Case 3 ensemble size 6 and 10
results the same as reported in previous commit.
bounded_below, bounded_above, lower_bound, upper_bound)
bounded_below, bounded_above, lower_bound, upper_bound, ierr)
! If probit transform fails for any group, this observation cannot be assimimlated
if(ierr /= 0) cycle SEQUENTIAL_OBS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be helpful to add a CAUTION comment here.
Because of the new error check, the interaction between the transformations and the groups can become messy. If it happens and the probit transform fails, then only part of the ensemble may be updated for a certain obs. Because of the cycle SEQUENTIAL_OBS the remaining groups, which could be transformed without issues, are disregarded.

Copy link
Contributor

@mgharamti mgharamti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me. I just have one comment regrading the interaction between QCEFF and groups with the newly added error check.

Copy link
Member

@hkershaw-brown hkershaw-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, summary here actual comments are on the code.

Question about the error code.

Performance question on setting the qc if the observation ensemble to_probit fails

Definite change is to remove the unused routine transform_all_to_probit, because it is unused and i think incorrect.

The intent(in) intent(out) I think will bite us in the future, but I think fixing it in this pull request is going to confuse what this pull request is fixing.

Comment on lines +171 to +178
if(abs(maxval(diff)) > 1.0e-8_r8) then
write(*, *) 'Maximum allowed value of probit to/from difference exceeded'
write(*, *) 'Location of minimum ensemble member ', minloc(state_ens)
write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens)
do i = 1, ens_size
write(*, *) i, state_ens(i), state_ens_temp(i), diff(i)
enddo
stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the failed processes will stop in an mpi program, it will leave the other processes running.

call error_handler(E_ERR, ....) will call MPI_ABORT

costs you $$ Derecho cpu hours

Comment on lines +189 to +197
if(abs(maxval(diff)) > 1.0e-8_r8) then
write(*, *) 'Maximum allowed value of probit to/from difference for input p exceeded'
write(*, *) 'Location of minimum ensemble member ', minloc(state_ens)
write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens)
do i = 1, ens_size
write(*, *) i, state_ens(i), state_ens_temp(i), diff(i)
enddo
stop
endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment - call error_handler

@@ -62,7 +62,7 @@ module probit_transform_mod
!------------------------------------------------------------------------

subroutine transform_all_to_probit(ens_size, num_vars, state_ens, distribution_type, &
p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound)
p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound, ierr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine is not called at all.
I think rather than editing to have the new error code, it would be better to remove this routine.
The arguments are scalars: use_input_p, bounded_below, bounded_above, lower_bound, upper_bound so you are going to give the same bounds to the whole ensemble if use_input_p=.false.

@@ -108,6 +115,7 @@ subroutine transform_to_probit(ens_size, state_ens_in, distribution_type, p, &
logical, intent(in) :: use_input_p
logical, intent(in) :: bounded_below, bounded_above
real(r8), intent(in) :: lower_bound, upper_bound
integer, intent(out) :: ierr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use an integer return code? For different types of failures?
The oks are logicals.
logical, allocatable :: obs_probit_trans_ok(:), state_probit_trans_ok(:)

Comment on lines +509 to +510
bounded_below, bounded_above, lower_bound, upper_bound, ierr)
state_probit_trans_ok(i) = (ierr == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the question about the error code being an integer, if you had the error be a logical you could do:
call transform_to_probit(........., state_probit_trans_ok(i))

that way you there is less chance of a future person, putting something in the middle of these two statements.

call transform_to_probit(....., ierr)
call something_else_mathsie(.....,ierr)
state_probit_trans_ok(i) = (ierr == 0)

Comment on lines +384 to 386
ierr = 1
! Just return the original ensemble
probit_ens = state_ens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite a nice contract, that if the transform fails, just return the original ensemble.

obs_ens_handle%copies(1:ens_size, i) = probit_ens
bounded_below, bounded_above, lower_bound, upper_bound, ierr)
obs_probit_trans_ok(i) = (ierr == 0)
if(obs_probit_trans_ok(i)) obs_ens_handle%copies(1:ens_size, i) = probit_ens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added if(obs_probit_trans_ok(i)) & at line 511 gives you the same answer as before, because of the bnrh

fail -> ! Just return the original ensemble
I guess this is for algorithm readability?

Comment on lines +645 to +646
call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , &
obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still dangerous since you have the same input for intent(in) and intent(out), but the original code does that so yippee ki yay I guess.

subroutine transform_from_probit(ens_size, probit_ens, p, state_ens)
integer, intent(in) :: ens_size
real(r8), intent(in) :: probit_ens(ens_size)
type(distribution_params_type), intent(inout) :: p
real(r8), intent(out) :: state_ens(ens_size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you return the original if the transform fails, might be a good time to change this to inout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that calling a routine with the same array with both an intent(in) and intent(out) goes several levels deep in this code so it would be pretty disruptive to change it. but in the future i'd avoid passing in the same array multiple times with different intents. it's leaving a trap for future code maintainers because someone looking only at the called subroutines could decide to set the outgoing state_ens(:) to 0 or MISSING_R8 as an initial value for debugging, which would result in the probit_ens(:) mysteriously getting overwritten without any protection from the intent(in). you're requiring some behavior in the called subroutine because of how the calling code is written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is standard conforming to alias in this way so not just people changing the code, its undefined what the compiler will do.

call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , &
obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index))
! Convert this observation ensemble from probit back to regular space if to_probit was successful
if(obs_probit_trans_ok(owners_index)) then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the to_probit was not ok, do you want to set the qc to failed fwd op?

The other processes don't have access to obs_probit_trans_ok(owners_index) and so they will go ahead and calculate obs_increments because the qc is 0, then transform_to_probit which may or may not work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: inflation creating out-of-bounds values qceff
4 participants