Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

jlaucar · 2025-01-09T19:44:08Z

Description:

The BNRHF transform fails when the computed ensemble standard deviation is not positive. Existing code returns an untransformed ensemble but no error return. The transforms and the probit transforms that call them were modified to return an error. The inflation in filter_mod and the increments in assim_tools_mod were updated to check for this error. These routines now explicitly do nothing to variables that fail in the transform step.

Fixes issue

fixes #749

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Documentation changes needed?

My change requires a change to the documentation.
- I have updated the documentation accordingly.

Tests

Was tested with Lorenz_96_tracer model QCEFF test suites. Produced answer changes in 7 of the tests but rest were bitwise. The answer changes do not appear to be associated with significantly changed perfomance.
Was also tested with the original CAM reanalysis case that led to identification of this problem. Tests there bitwise reproduced over several days of testing with previous results that had to use clamping; clamping is no longer required with the BNRHF as would be expected.

Checklist for merging

Updated changelog entry
Documentation updated
Update conf.py

Checklist for release

Merge into main
Create release from the main branch with appropriate tag
Delete feature-branch

Testing Datasets

Dataset needed for testing available upon request
Dataset download instructions included
No dataset needed

… failed because the sample ensemble covariance was not positive. The existing code simply returned the input ensemble but did not indicate that an error had occurred. This resulted in the untransformed ensemble being used as if it had been transformed in filter_mod.f90 for inflation and in assim_tools_mod.f90 for assimilation. In general, this was not a problem because zero covariance in the ensemble led to both inflation and assimilation doing nothing. However, due to numerical round-off, it was possible to get cases where the inflation could result in violating the bounds of the bnrh distribution. This commit attempts to correct this behavior by updates to probit_transform_mod.f90, filter_mod.f90, and assim_tools_mod.f90. probit_transform_mod.f90: An error return argument was added to subroutine to_probit_bounded_normal_rh. It returns a 1 if the ensemble standard deviation was not positive (failure) and a 0 otherwise. This returned argument is passed up through transform_to_probit and transform_all_to_probit, where it becomes an array with one entry for each state variable being transformed. In the case of an error, the returned ‘transformed’ ensemble is simply the input original ensemble. The check for a non-positive standard deviation was eliminated in the case where ‘use_input_p’ is true since this should never be called if the original transformation failed. Subroutine transform_all_from_probit was modified to include an additional logical array that indicates whether each of the state variables to be transformed was successfully transformed to probit space. The inverse transform is only computed for the state variables for which the original transform was successful. filter_mod.f90: All changes are in subroutine filter_ensemble_inflate. The inflation is only performed if the error return from the transform_to_probit is 0. Note that this is called for each group if a group filter is being run and it would be possible for some groups to transform successfully while others failed. This case is handled properly. assim_tools_mod.f90: Two logical arrays were added to keep track of which transforms fail for the observed variables (obs_probit_trans_ok) and state variables (state_probit_trans_ok); these are allocated on entry into subroutine filter_assim and deallocated on exit. The result of the transforms is only copied into the appropriate ensemble_handle storage if the transform was successful. When observations are sequentially assimilated, the observation ensemble is only converted back to physical space if its original transform was successful. The observation prior and posterior are transformed back to probit space in the SEQUENTIAL_OBS loop. If this transform fails for any group, the loop is cycled for this observation so that it has no impact. In the STATE_UPDATE loop, if the transform to probit space for the current state variable failed, the loop is cycled. The same is done in the OBS_UPDATE loop. Note that groups are not relevant to this step. Finally, when transform_all_from_probit is called for state variables, only those that were successfully transformed originally are transformed back.

variable had been successfully tranformed.

Merge branch 'main' into qceff_inflation_fix

for the transform_all case. This version has been tested with the lorenz_96_tracer tests. All tests run to completion. All of the RMSE tests reproduce the baseline values except for: Case 1 ensemble size 6 and 15 Case 2 ensemble size 6 and 15 Case 3 ensemble size 6 and 10

results the same as reported in previous commit.

mgharamti · 2025-01-09T21:17:39Z

assimilation_code/modules/assimilation/assim_tools_mod.f90

-         bounded_below, bounded_above, lower_bound, upper_bound)
+         bounded_below, bounded_above, lower_bound, upper_bound, ierr)
+      ! If probit transform fails for any group, this observation cannot be assimimlated
+      if(ierr /= 0) cycle SEQUENTIAL_OBS


I think it could be helpful to add a CAUTION comment here.
Because of the new error check, the interaction between the transformations and the groups can become messy. If it happens and the probit transform fails, then only part of the ensemble may be updated for a certain obs. Because of the cycle SEQUENTIAL_OBS the remaining groups, which could be transformed without issues, are disregarded.

mgharamti

All looks good to me. I just have one comment regrading the interaction between QCEFF and groups with the newly added error check.

hkershaw-brown

A few comments, summary here actual comments are on the code.

Question about the error code.

Performance question on setting the qc if the observation ensemble to_probit fails

Definite change is to remove the unused routine transform_all_to_probit, because it is unused and i think incorrect.

The intent(in) intent(out) I think will bite us in the future, but I think fixing it in this pull request is going to confuse what this pull request is fixing.

hkershaw-brown · 2025-01-09T20:16:12Z

assimilation_code/modules/assimilation/probit_transform_mod.f90

+            if(abs(maxval(diff)) > 1.0e-8_r8) then
+               write(*, *) 'Maximum allowed value of probit to/from difference exceeded'
+               write(*, *) 'Location of minimum ensemble member ', minloc(state_ens)
+               write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens)
+               do i = 1, ens_size
+                  write(*, *) i, state_ens(i), state_ens_temp(i), diff(i)
+               enddo
+               stop


Only the failed processes will stop in an mpi program, it will leave the other processes running.

call error_handler(E_ERR, ....) will call MPI_ABORT

costs you $$ Derecho cpu hours

hkershaw-brown · 2025-01-09T20:16:56Z

assimilation_code/modules/assimilation/probit_transform_mod.f90

+            if(abs(maxval(diff)) > 1.0e-8_r8) then
+               write(*, *) 'Maximum allowed value of probit to/from difference for input p exceeded'
+               write(*, *) 'Location of minimum ensemble member ', minloc(state_ens)
+               write(*, *) 'Location of maximum ensemble member ', maxloc(state_ens)
+               do i = 1, ens_size
+                  write(*, *) i, state_ens(i), state_ens_temp(i), diff(i)
+               enddo
+               stop
+            endif


same comment - call error_handler

hkershaw-brown · 2025-01-09T20:39:06Z

assimilation_code/modules/assimilation/probit_transform_mod.f90

@@ -62,7 +62,7 @@ module probit_transform_mod
 !------------------------------------------------------------------------

 subroutine transform_all_to_probit(ens_size, num_vars, state_ens, distribution_type, &
-   p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound)
+   p, probit_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound, ierr)


This routine is not called at all.
I think rather than editing to have the new error code, it would be better to remove this routine.
The arguments are scalars: use_input_p, bounded_below, bounded_above, lower_bound, upper_bound so you are going to give the same bounds to the whole ensemble if use_input_p=.false.

hkershaw-brown · 2025-01-09T20:40:17Z

assimilation_code/modules/assimilation/probit_transform_mod.f90

@@ -108,6 +115,7 @@ subroutine transform_to_probit(ens_size, state_ens_in, distribution_type, p, &
 logical, intent(in)                           :: use_input_p
 logical, intent(in)                           :: bounded_below, bounded_above
 real(r8), intent(in)                          :: lower_bound,   upper_bound
+integer,  intent(out)                         :: ierr


why use an integer return code? For different types of failures?
The oks are logicals.
logical, allocatable :: obs_probit_trans_ok(:), state_probit_trans_ok(:)

hkershaw-brown · 2025-01-09T20:48:11Z

assimilation_code/modules/assimilation/assim_tools_mod.f90

+      bounded_below, bounded_above, lower_bound, upper_bound, ierr)
+   state_probit_trans_ok(i) = (ierr == 0)


This is related to the question about the error code being an integer, if you had the error be a logical you could do:
call transform_to_probit(........., state_probit_trans_ok(i))

that way you there is less chance of a future person, putting something in the middle of these two statements.

call transform_to_probit(....., ierr) call something_else_mathsie(.....,ierr) state_probit_trans_ok(i) = (ierr == 0)

hkershaw-brown · 2025-01-09T21:00:50Z

assimilation_code/modules/assimilation/probit_transform_mod.f90

+      ierr = 1
      ! Just return the original ensemble
      probit_ens = state_ens 


this is quite a nice contract, that if the transform fails, just return the original ensemble.

hkershaw-brown · 2025-01-09T21:04:31Z

assimilation_code/modules/assimilation/assim_tools_mod.f90

-      obs_ens_handle%copies(1:ens_size, i) = probit_ens
+         bounded_below, bounded_above, lower_bound, upper_bound, ierr)
+      obs_probit_trans_ok(i) = (ierr == 0)
+      if(obs_probit_trans_ok(i)) obs_ens_handle%copies(1:ens_size, i) = probit_ens


The added if(obs_probit_trans_ok(i)) & at line 511 gives you the same answer as before, because of the bnrh

fail -> ! Just return the original ensemble
I guess this is for algorithm readability?

hkershaw-brown · 2025-01-09T21:13:33Z

assimilation_code/modules/assimilation/assim_tools_mod.f90

+            call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , &
+               obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index))


this is still dangerous since you have the same input for intent(in) and intent(out), but the original code does that so yippee ki yay I guess.

DART/assimilation_code/modules/assimilation/probit_transform_mod.f90

Lines 596 to 601 in f0e7f99

subroutine transform_from_probit(ens_size, probit_ens, p, state_ens)

integer, intent(in) :: ens_size

real(r8), intent(in) :: probit_ens(ens_size)

type(distribution_params_type), intent(inout) :: p

real(r8), intent(out) :: state_ens(ens_size)

since you return the original if the transform fails, might be a good time to change this to inout

i see that calling a routine with the same array with both an intent(in) and intent(out) goes several levels deep in this code so it would be pretty disruptive to change it. but in the future i'd avoid passing in the same array multiple times with different intents. it's leaving a trap for future code maintainers because someone looking only at the called subroutines could decide to set the outgoing state_ens(:) to 0 or MISSING_R8 as an initial value for debugging, which would result in the probit_ens(:) mysteriously getting overwritten without any protection from the intent(in). you're requiring some behavior in the called subroutine because of how the calling code is written.

I don't think it is standard conforming to alias in this way so not just people changing the code, its undefined what the compiler will do.

hkershaw-brown · 2025-01-09T21:21:30Z

assimilation_code/modules/assimilation/assim_tools_mod.f90

-         call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , &
-            obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index))
+         ! Convert this observation ensemble from probit back to regular space if to_probit was successful
+         if(obs_probit_trans_ok(owners_index)) then 


If the to_probit was not ok, do you want to set the qc to failed fwd op?

The other processes don't have access to obs_probit_trans_ok(owners_index) and so they will go ahead and calculate obs_increments because the qc is 0, then transform_to_probit which may or may not work?

jlaucar added 5 commits December 9, 2024 11:51

Corrected use of 1 in a loop over i that was checking to see if each

844163f

variable had been successfully tranformed.

Bringing the qceff_inflation_fix branch up-to-date with main for review.

7bc2724

Merge branch 'main' into qceff_inflation_fix

Corrected syntax error in probit_tranform_mod.f90. Reran tests with

f0e7f99

results the same as reported in previous commit.

jlaucar requested review from hkershaw-brown and mgharamti January 9, 2025 19:44

mgharamti reviewed Jan 9, 2025

View reviewed changes

hkershaw-brown requested changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

jlaucar commented Jan 9, 2025 •

edited by hkershaw-brown

Loading

mgharamti Jan 9, 2025

mgharamti left a comment

hkershaw-brown left a comment •

edited

Loading

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

hkershaw-brown Jan 9, 2025

nancycollins Jan 9, 2025

hkershaw-brown Jan 10, 2025

hkershaw-brown Jan 9, 2025

		bounded_below, bounded_above, lower_bound, upper_bound, ierr)
		state_probit_trans_ok(i) = (ierr == 0)

		call transform_from_probit(ens_size, obs_ens_handle%copies(1:ens_size, owners_index) , &
		obs_dist_params(owners_index), obs_ens_handle%copies(1:ens_size, owners_index))

	subroutine transform_from_probit(ens_size, probit_ens, p, state_ens)

	integer, intent(in) :: ens_size
	real(r8), intent(in) :: probit_ens(ens_size)
	type(distribution_params_type), intent(inout) :: p
	real(r8), intent(out) :: state_ens(ens_size)

Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

Are you sure you want to change the base?

Qceff inflation fix; Explicitly handling BNRHF transform failures. #794

Conversation

jlaucar commented Jan 9, 2025 • edited by hkershaw-brown Loading

Description:

Fixes issue

Types of changes

Documentation changes needed?

Tests

Checklist for merging

Checklist for release

Testing Datasets

Choose a reason for hiding this comment

mgharamti left a comment

Choose a reason for hiding this comment

hkershaw-brown left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlaucar commented Jan 9, 2025 •

edited by hkershaw-brown

Loading

hkershaw-brown left a comment •

edited

Loading