Refine split block reconstruction #7934

behlendorf · 2018-09-20T16:31:21Z

Motivation and Context

Address the concerns raised by @sdimitro in openzfs/openzfs#625.
This follow up patch implements the suggested improvements.

Description

Due to a flaw in 4589f3a the number of unique combinations
could be calculated incorrectly. This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

How Has This Been Tested?

For debugging the code was instrumented to intentionally damage
random segment copies, then log the the number of possible combinations
and how many attempts were required before a successful reconstruction.
This was done in addition to the normal damage which is tested by ztest.

A 12 hour run of ztest with the additional debugging showed, as expected,
that in the normal case where a copy was not damaged there was only a
single possible combination. When damage was intentionally injected
the unique copies were detected and the number of combinations
increased.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

ahrens

this is great! I was worried that the linked list would make the code messier but it actually makes a few things more clear. Thanks!

ahrens · 2018-09-20T16:44:27Z

module/zfs/vdev_indirect.c

-	 * number of unique combinations when attempting reconstruction.
+	 * Determine the unique children for a split segment and add them
+	 * to the is_unique_child list.  By restricting reconstruction
+	 * attempts to these children only unique combinations will


should read will *be* considered

Personally, I'd add commas after children and after considered

ahrens · 2018-09-20T16:50:11Z

module/zfs/vdev_indirect.c

-				    is->is_child[j].ic_data) == 0) {
-					is->is_child[j].ic_duplicate = i;
-				}
+				ASSERT3P(ic_j->ic_duplicate, ==, NULL);


this seems self-evident given the code 1 line above.

Removed, it was a bit of extra debug code I forgot to drop.

module/zfs/vdev_indirect.c

behlendorf

Thanks for the quick feedback, I've refreshed the PR.

I also bumped up the default number of reconstruction attempts from 100 to 256. Making this a power of two feels more correct considering we expect two copies to be the common case.

behlendorf · 2018-09-20T17:48:19Z

module/zfs/vdev_indirect.c

-	 * number of unique combinations when attempting reconstruction.
+	 * Determine the unique children for a split segment and add them
+	 * to the is_unique_child list.  By restricting reconstruction
+	 * attempts to these children only unique combinations will


behlendorf · 2018-09-20T17:51:05Z

module/zfs/vdev_indirect.c

-				    is->is_child[j].ic_data) == 0) {
-					is->is_child[j].ic_duplicate = i;
-				}
+				ASSERT3P(ic_j->ic_duplicate, ==, NULL);


Removed, it was a bit of extra debug code I forgot to drop.

module/zfs/vdev_indirect.c

ahrens · 2018-09-20T20:39:53Z

Did your testing cover the case where we have to try randomly because there are too many combinations? Just checking since you've done such a good job reducing the number of combinations possible, that it's a lot harder to hit the randomization code now.

behlendorf · 2018-09-20T23:34:44Z

@ahrens yes, I took the opportunity to manually inflict some severe damage to test that code. I was tempted to leave that debug code in place to make it easier to test, but in the end decided against it since it was a bit ugly. If you'd like I'm happy to put something like it back in.

sdimitro · 2018-09-21T00:56:57Z

module/zfs/vdev_indirect.c

 			zio->io_error = EIO;
 			vdev_indirect_all_checksum_errors(zio);
 			zio_checksum_verified(zio);
 			return;
 		}

-		combinations *= is_copies;
+		combinations *= is->is_unique_children;
+		is->is_good_child = list_head(&is->is_unique_child);
 	}

 	for (;;) {


So I remember looking at this code the first time and it does make sense although I found it a bit hard to read due to the fact that we loop back here and then deciding again if we are enumerating or randomly picking (plus thinking about attempts and more).

I believe it is fine leaving the code as it is, on the other hand I thought of the layout below which may be a bit simpler. Basically in case of having this infinite loop we have the following:

int ret; if (combinations <= attempts_max) { ret = vdev_indirect_splits_enumerate_all(iv, zio); } else { ret = vdev_indirect_splits_enumerate_randomly(iv, zio, max_attempts); } if (ret != 0) { /* All attempted combinations failed. */ zio->io_error = ret; vdev_indirect_all_checksum_errors(zio); zio_checksum_verified(zio); }

where:

int vdev_indirect_splits_checksum_validate(iv) { for (indirect_split_t *is = list_head(&iv->iv_splits); is != NULL; is = list_next(&iv->iv_splits, is)) { ASSERT3P(is->is_good_child->ic_data, !=, NULL); ASSERT3P(is->is_good_child->ic_duplicate, ==, NULL); abd_copy_off(zio->io_abd, is->is_good_child->ic_data, is->is_split_offset, 0, is->is_size); } ... zbc; return(zio_checksum_error(zio, &zbc)); } int vdev_indirect_splits_enumerate_all(iv, zio) { ... do the enumeration part ... assert that you don't do more than the *combinations* variable .... ... check every combination with vdev_indirect_splits_checksum_validate() ... check its return value ... } int vdev_indirect_splits_enumerate_randomly(iv, zio, max_attempts) { for (attempt = 0; attempt < max_attempts; attempt++) { ... do the spa_get_random_part , and check every time with vdev_indirect_splits_checksum_validate() ... ... break when you match .... [you can also have a dprintf saying ("matched after %d attempts", attempt)] } }

The disadvantage of the above layout is that we have auxiliary functions but hopefully make the code more readable (instead of having one big loop that decides if we are randomly enumerating over some stuff or enumerating through everything, in every iteration, we have two separate loops with more straightforward logic). An additional benefit may be that we can more easily trace things with DTrace/eBPF because of the auxiliary functions, and gives us more introspection about which strategy we decide).

Again, if you think it doesn't worth it or there is something wrong feel free to drop this.
The code LGTM as it is.

ahrens · 2018-09-21T01:51:12Z

If you'd like I'm happy to put something like it back in.

I don't think that's necessary. Just wanted to be sure that the rare-case code got exercised at least once :)

Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2018-09-26T19:20:50Z

Refreshed, please take another look.

I've updated the patch to use the refactoring @sdimitro proposed. Overall I do think it is an improvement, and it allowed me to add a vdev_indirect_splits_damage() function which is used by ztest to stress both the enumeration and random reconstruction logic.

For comparison purposes the previous version of this PR was moved to the https://github.com/behlendorf/zfs/tree/issue-6900-original branch.

ahrens · 2018-09-26T19:35:56Z

module/zfs/vdev_indirect.c

+		for (indirect_split_t *is = list_head(&iv->iv_splits);
+		    is != NULL; is = list_next(&iv->iv_splits, is)) {
+			if ((is->is_good_child = list_next(
+			    &is->is_unique_child, is->is_good_child)) != NULL) {


stylistically, I find it easier to read without side effects in if statements where practical. e.g.:

is_good_child = list_next(); if (is_good_child != NULL) {

Sure, me too as a general rule. I'll update it.

ahrens · 2018-09-26T19:39:42Z

module/zfs/vdev_indirect.c

+
+			abd_zero(ic->ic_data, ic->ic_data->abd_size);
+		}
+		iv->iv_attempts_max *= 2;


could we add an assertion that this does not overflow? Or cap it to UINT64_MAX? Or is there something else that prevents that from happening?

Let's instead stop damaging copies if this gets out of control, making this take longer doesn't improve the test coverage.

ahrens · 2018-09-26T19:41:28Z

module/zfs/vdev_indirect.c

+	}
+
+	/* Attempt to select a valid one randomly. */
+	int error = vdev_indirect_splits_enumerate_randomly(iv, zio);


You could be more specific: set each is_good_child to a randomly-selected child

ahrens · 2018-09-26T19:43:56Z

module/zfs/vdev_indirect.c

+
+		zfs_dbgmsg("reconstruction failed (%d) after %llu / %llu "
+		    "allowed attempts, %llu unique combination(s)\n", error,
+		    (u_longlong_t)iv->iv_attempts,


I'm not sure if we want to spam dbgmsg if a disk is silently damaged. I don't think we do that today.

ahrens · 2018-09-26T19:44:29Z

module/zfs/vdev_indirect.c

 		/*
-		 * Checksum failed; try a different combination of split
-		 * children.
+		 * The checksum has been successfully validated issue


I think you mean ... validated. Issue ...

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

codecov · 2018-09-27T06:02:57Z

Codecov Report

Merging #7934 into master will increase coverage by 0.05%.
The diff coverage is 91.96%.

@@            Coverage Diff             @@
##           master    #7934      +/-   ##
==========================================
+ Coverage   78.64%    78.7%   +0.05%     
==========================================
  Files         377      377              
  Lines      114014   114078      +64     
==========================================
+ Hits        89668    89786     +118     
+ Misses      24346    24292      -54

Flag	Coverage Δ
#kernel	`78.9% <2.8%> (-0.02%)`	⬇️
#user	`67.93% <92.79%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d126145...e1aea23. Read the comment docs.

Due to the extensive damage which ztest can inflict on the pool during testing increasing the default maximum number of attempts is still required. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2018-09-28T17:15:20Z

I've updated this PR to keep the increased number of reconstruction attempts when ztest is calling zdb (though it was reduced to 65536). While things are significantly improved, we did hit this in the automated testing which is why I put it back. See the end of the following ztest log.

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20%28TEST%29/builds/1360/steps/shell_6/logs/1.ztest.out

Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6900 Closes openzfs#7934

behlendorf requested review from ahrens and sdimitro September 20, 2018 16:31

behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 20, 2018

behlendorf force-pushed the issue-6900 branch from 91b1322 to 0bc944c Compare September 20, 2018 16:45

ahrens reviewed Sep 20, 2018

View reviewed changes

behlendorf commented Sep 20, 2018

View reviewed changes

behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 20, 2018

ahrens approved these changes Sep 20, 2018

View reviewed changes

behlendorf force-pushed the issue-6900 branch from a46dfa1 to cc6b6c7 Compare September 20, 2018 23:36

sdimitro approved these changes Sep 21, 2018

View reviewed changes

behlendorf force-pushed the issue-6900 branch from cc6b6c7 to 31c7f44 Compare September 26, 2018 19:20

ahrens approved these changes Sep 26, 2018

View reviewed changes

Review feedback

a6de318

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ahrens approved these changes Sep 27, 2018

View reviewed changes

ahrens added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 28, 2018

Keep increased attempts for ztest calling zdb

e1aea23

Due to the extensive damage which ztest can inflict on the pool during testing increasing the default maximum number of attempts is still required. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf merged commit 1258bd7 into openzfs:master Oct 1, 2018

behlendorf deleted the issue-6900 branch April 19, 2021 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine split block reconstruction #7934

Refine split block reconstruction #7934

behlendorf commented Sep 20, 2018

ahrens left a comment

ahrens Sep 20, 2018

behlendorf Sep 20, 2018

ahrens Sep 20, 2018

behlendorf Sep 20, 2018

behlendorf left a comment

behlendorf Sep 20, 2018

behlendorf Sep 20, 2018

ahrens commented Sep 20, 2018

behlendorf commented Sep 20, 2018

sdimitro Sep 21, 2018

ahrens commented Sep 21, 2018

behlendorf commented Sep 26, 2018

ahrens Sep 26, 2018

behlendorf Sep 26, 2018

ahrens Sep 26, 2018

behlendorf Sep 26, 2018

ahrens Sep 26, 2018

ahrens Sep 26, 2018

ahrens Sep 26, 2018

codecov bot commented Sep 27, 2018 •

edited

Loading

behlendorf commented Sep 28, 2018

Refine split block reconstruction #7934

Refine split block reconstruction #7934

Conversation

behlendorf commented Sep 20, 2018

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ahrens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrens commented Sep 20, 2018

behlendorf commented Sep 20, 2018

Choose a reason for hiding this comment

ahrens commented Sep 21, 2018

behlendorf commented Sep 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 27, 2018 • edited Loading

Codecov Report

behlendorf commented Sep 28, 2018

codecov bot commented Sep 27, 2018 •

edited

Loading