-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qceff inflation fix; Explicitly handling BNRHF transform failures. #794
base: main
Are you sure you want to change the base?
Conversation
… failed because the sample ensemble covariance was not positive. The existing code simply returned the input ensemble but did not indicate that an error had occurred. This resulted in the untransformed ensemble being used as if it had been transformed in filter_mod.f90 for inflation and in assim_tools_mod.f90 for assimilation. In general, this was not a problem because zero covariance in the ensemble led to both inflation and assimilation doing nothing. However, due to numerical round-off, it was possible to get cases where the inflation could result in violating the bounds of the bnrh distribution. This commit attempts to correct this behavior by updates to probit_transform_mod.f90, filter_mod.f90, and assim_tools_mod.f90. probit_transform_mod.f90: An error return argument was added to subroutine to_probit_bounded_normal_rh. It returns a 1 if the ensemble standard deviation was not positive (failure) and a 0 otherwise. This returned argument is passed up through transform_to_probit and transform_all_to_probit, where it becomes an array with one entry for each state variable being transformed. In the case of an error, the returned ‘transformed’ ensemble is simply the input original ensemble. The check for a non-positive standard deviation was eliminated in the case where ‘use_input_p’ is true since this should never be called if the original transformation failed. Subroutine transform_all_from_probit was modified to include an additional logical array that indicates whether each of the state variables to be transformed was successfully transformed to probit space. The inverse transform is only computed for the state variables for which the original transform was successful. filter_mod.f90: All changes are in subroutine filter_ensemble_inflate. The inflation is only performed if the error return from the transform_to_probit is 0. Note that this is called for each group if a group filter is being run and it would be possible for some groups to transform successfully while others failed. This case is handled properly. assim_tools_mod.f90: Two logical arrays were added to keep track of which transforms fail for the observed variables (obs_probit_trans_ok) and state variables (state_probit_trans_ok); these are allocated on entry into subroutine filter_assim and deallocated on exit. The result of the transforms is only copied into the appropriate ensemble_handle storage if the transform was successful. When observations are sequentially assimilated, the observation ensemble is only converted back to physical space if its original transform was successful. The observation prior and posterior are transformed back to probit space in the SEQUENTIAL_OBS loop. If this transform fails for any group, the loop is cycled for this observation so that it has no impact. In the STATE_UPDATE loop, if the transform to probit space for the current state variable failed, the loop is cycled. The same is done in the OBS_UPDATE loop. Note that groups are not relevant to this step. Finally, when transform_all_from_probit is called for state variables, only those that were successfully transformed originally are transformed back.
variable had been successfully tranformed.
Merge branch 'main' into qceff_inflation_fix
for the transform_all case. This version has been tested with the lorenz_96_tracer tests. All tests run to completion. All of the RMSE tests reproduce the baseline values except for: Case 1 ensemble size 6 and 15 Case 2 ensemble size 6 and 15 Case 3 ensemble size 6 and 10
results the same as reported in previous commit.
bounded_below, bounded_above, lower_bound, upper_bound) | ||
bounded_below, bounded_above, lower_bound, upper_bound, ierr) | ||
! If probit transform fails for any group, this observation cannot be assimimlated | ||
if(ierr /= 0) cycle SEQUENTIAL_OBS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be helpful to add a CAUTION comment here.
Because of the new error check, the interaction between the transformations and the groups can become messy. If it happens and the probit transform fails, then only part of the ensemble may be updated for a certain obs. Because of the cycle SEQUENTIAL_OBS
the remaining groups, which could be transformed without issues, are disregarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me. I just have one comment regrading the interaction between QCEFF and groups with the newly added error check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, summary here actual comments are on the code.
Question about the error code.
Performance question on setting the qc if the observation ensemble to_probit fails
Definite change is to remove the unused routine transform_all_to_probit
, because it is unused and i think incorrect.
The intent(in) intent(out) I think will bite us in the future, but I think fixing it in this pull request is going to confuse what this pull request is fixing.
if(abs(maxval(diff)) > 1.0e-8_r8) then | ||
write(*, *) 'Maximum allowed value of probit to/from difference exceeded' | ||
write(*, *) 'Location of minimum ensemble member ', minloc(state_ens) | ||
write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens) | ||
do i = 1, ens_size | ||
write(*, *) i, state_ens(i), state_ens_temp(i), diff(i) | ||
enddo | ||
stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the failed processes will stop in an mpi program, it will leave the other processes running.
call error_handler(E_ERR, ....) will call MPI_ABORT
costs you $$ Derecho cpu hours
if(abs(maxval(diff)) > 1.0e-8_r8) then | ||
write(*, *) 'Maximum allowed value of probit to/from difference for input p exceeded' | ||
write(*, *) 'Location of minimum ensemble member ', minloc(state_ens) | ||
write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens) | ||
do i = 1, ens_size | ||
write(*, *) i, state_ens(i), state_ens_temp(i), diff(i) | ||
enddo | ||
stop | ||
endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment - call error_handler
@@ -62,7 +62,7 @@ module probit_transform_mod | |||
!------------------------------------------------------------------------ | |||
|
|||
subroutine transform_all_to_probit(ens_size, num_vars, state_ens, distribution_type, & | |||
p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound) | |||
p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound, ierr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This routine is not called at all.
I think rather than editing to have the new error code, it would be better to remove this routine.
The arguments are scalars: use_input_p, bounded_below, bounded_above, lower_bound, upper_bound so you are going to give the same bounds to the whole ensemble if use_input_p=.false.
@@ -108,6 +115,7 @@ subroutine transform_to_probit(ens_size, state_ens_in, distribution_type, p, & | |||
logical, intent(in) :: use_input_p | |||
logical, intent(in) :: bounded_below, bounded_above | |||
real(r8), intent(in) :: lower_bound, upper_bound | |||
integer, intent(out) :: ierr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use an integer return code? For different types of failures?
The oks are logicals.
logical, allocatable :: obs_probit_trans_ok(:), state_probit_trans_ok(:)
bounded_below, bounded_above, lower_bound, upper_bound, ierr) | ||
state_probit_trans_ok(i) = (ierr == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the question about the error code being an integer, if you had the error be a logical you could do:
call transform_to_probit(........., state_probit_trans_ok(i))
that way you there is less chance of a future person, putting something in the middle of these two statements.
call transform_to_probit(....., ierr)
call something_else_mathsie(.....,ierr)
state_probit_trans_ok(i) = (ierr == 0)
ierr = 1 | ||
! Just return the original ensemble | ||
probit_ens = state_ens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a nice contract, that if the transform fails, just return the original ensemble.
obs_ens_handle%copies(1:ens_size, i) = probit_ens | ||
bounded_below, bounded_above, lower_bound, upper_bound, ierr) | ||
obs_probit_trans_ok(i) = (ierr == 0) | ||
if(obs_probit_trans_ok(i)) obs_ens_handle%copies(1:ens_size, i) = probit_ens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added if(obs_probit_trans_ok(i)) & at line 511 gives you the same answer as before, because of the bnrh
fail -> ! Just return the original ensemble
I guess this is for algorithm readability?
call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , & | ||
obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still dangerous since you have the same input for intent(in) and intent(out), but the original code does that so yippee ki yay I guess.
DART/assimilation_code/modules/assimilation/probit_transform_mod.f90
Lines 596 to 601 in f0e7f99
subroutine transform_from_probit(ens_size, probit_ens, p, state_ens) | |
integer, intent(in) :: ens_size | |
real(r8), intent(in) :: probit_ens(ens_size) | |
type(distribution_params_type), intent(inout) :: p | |
real(r8), intent(out) :: state_ens(ens_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you return the original if the transform fails, might be a good time to change this to inout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see that calling a routine with the same array with both an intent(in) and intent(out) goes several levels deep in this code so it would be pretty disruptive to change it. but in the future i'd avoid passing in the same array multiple times with different intents. it's leaving a trap for future code maintainers because someone looking only at the called subroutines could decide to set the outgoing state_ens(:) to 0 or MISSING_R8 as an initial value for debugging, which would result in the probit_ens(:) mysteriously getting overwritten without any protection from the intent(in). you're requiring some behavior in the called subroutine because of how the calling code is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is standard conforming to alias in this way so not just people changing the code, its undefined what the compiler will do.
call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , & | ||
obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index)) | ||
! Convert this observation ensemble from probit back to regular space if to_probit was successful | ||
if(obs_probit_trans_ok(owners_index)) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the to_probit was not ok, do you want to set the qc to failed fwd op?
The other processes don't have access to obs_probit_trans_ok(owners_index) and so they will go ahead and calculate obs_increments because the qc is 0, then transform_to_probit which may or may not work?
Description:
The BNRHF transform fails when the computed ensemble standard deviation is not positive. Existing code returns an untransformed ensemble but no error return. The transforms and the probit transforms that call them were modified to return an error. The inflation in filter_mod and the increments in assim_tools_mod were updated to check for this error. These routines now explicitly do nothing to variables that fail in the transform step.
Fixes issue
fixes #749
Types of changes
Documentation changes needed?
Tests
Was tested with Lorenz_96_tracer model QCEFF test suites. Produced answer changes in 7 of the tests but rest were bitwise. The answer changes do not appear to be associated with significantly changed perfomance.
Was also tested with the original CAM reanalysis case that led to identification of this problem. Tests there bitwise reproduced over several days of testing with previous results that had to use clamping; clamping is no longer required with the BNRHF as would be expected.
Checklist for merging
Checklist for release
Testing Datasets