[WIP] raidz expansion, alpha preview 1 #8853

ahrens · 2019-06-04T03:44:47Z

Motivation and Context

This is a alpha-quality preview of RAID-Z expansion. This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: slides video

Description

Functionality that's currently implemented:

Can expand raidz device with zpool attach poolname raidz2-0 newdisk
Simple test script in scripts/raidz_expand_test.sh
During reflow/expansion:
- All allocated space in device is rewritten (copied to its new location in the RAIDZ vdev)
- Reflow happens in background over multiple txg’s
- Reads and writes during reflow are handled
- Can reboot or export/import, resumes after import (with exception if at the very beginning of reflow)
- Progress is reported in zpool status
After expansion completes:
- Can initiate additional expansions
- Additional space available
- Device failure and silent damage are handled
- Can reboot or export/import
- Status (completion time) reported in zpool status

Functionality that still needs to be implemented:

Add on-disk feature flag
Progress should be reported in terms of offset on disk, not bytes copied
Logical stripe width does not increase
Crash in the very beginning of reflow can trash the pool
Pool must be healthy during reflow
Does not use SIMD instructions for raidz math
Documentation
Automated tests!

This feature should only be used on test pools. The pool will eventually need to be DESTROYED, because the on-disk format will not be compatible with the final release. Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss.

I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test).

How Has This Been Tested?

scripts/raidz_expand_test.sh

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

This is a alpha-quality preview of RAID-Z expansion. This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: [slides](http://www.open-zfs.org/w/images/6/68/RAIDZ_Expansion_v2.pdf) [video](https://www.youtube.com/watch?v=ZF8V7Tc9G28) Functionality that's currently implemented: * Can expand raidz device with `zpool attach poolname raidz2-0 newdisk` * Simple test script in `scripts/raidz_expand_test.sh` * During reflow/expansion: * All allocated space in device is rewritten (copied to its new location in the RAIDZ vdev) * Reflow happens in background over multiple txg’s * Reads and writes during reflow are handled * Can reboot or export/import, resumes after import (with exception if at the very beginning of reflow) * Progress is reported in zpool status * After expansion completes: * Can initiate additional expansions * Additional space available * Device failure and silent damage are handled * Can reboot or export/import * Status (completion time) reported in zpool status Functionality that still needs to be implemented: * Add on-disk feature flag * Progress should be reported in terms of offset on disk, not bytes copied * Logical stripe width does not increase * Crash in the very beginning of reflow can trash the pool * Pool must be healthy during reflow * Does not use SIMD instructions for raidz math * Documentation * Automated tests! This feature should only be used on test pools. The pool will eventually need to be **DESTROYED**, because the on-disk format will not be compatible with the final release. Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss. I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test). Sponsored by: The FreeBSD Foundation

ahrens · 2019-06-04T03:50:34Z

cc: @emaste @allanjude @mckusick @kmoore134 @grwilson

codecov · 2019-06-04T12:09:50Z

Codecov Report

Merging #8853 (fe2ca28) into master (161ed82) will increase coverage by 4.46%.
The diff coverage is 92.79%.

❗ Current head fe2ca28 differs from pull request most recent head 7384b93. Consider uploading reports for the commit 7384b93 to get more accurate results

@@            Coverage Diff             @@
##           master    #8853      +/-   ##
==========================================
+ Coverage   75.17%   79.63%   +4.46%     
==========================================
  Files         402      394       -8     
  Lines      128071   125151    -2920     
==========================================
+ Hits        96283    99670    +3387     
+ Misses      31788    25481    -6307

Flag	Coverage Δ
kernel	`80.38% <93.56%> (+1.62%)`	⬆️
user	`65.89% <88.58%> (+18.46%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
cmd/raidz_test/raidz_bench.c	`0.00% <0.00%> (ø)`
cmd/raidz_test/raidz_test.h	`0.00% <ø> (ø)`
include/sys/vdev_raidz_impl.h	`100.00% <ø> (ø)`
module/zfs/vdev_raidz.c	`93.17% <ø> (-1.53%)`	⬇️
cmd/ztest/ztest.c	`77.82% <80.00%> (+70.57%)`	⬆️
cmd/raidz_test/raidz_test.c	`81.50% <89.28%> (-1.12%)`	⬇️
module/zfs/vdev.c	`90.36% <92.85%> (+1.00%)`	⬆️
cmd/zpool/zpool_main.c	`80.53% <93.02%> (-0.29%)`	⬇️
lib/libzfs/libzfs_pool.c	`73.55% <100.00%> (+2.51%)`	⬆️
module/zfs/abd.c	`90.72% <100.00%> (-4.24%)`	⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 860051f...7384b93. Read the comment docs.

kithrup

First pass, this looks ok. It's missing an overview in the code of how the expansion works (I found a couple smaller ones talking about the reconstructions). It doesn't seem hugely complicated, which surprised me, but I was still looking for it in the code.

module/zfs/spa.c

ahrens · 2019-06-04T19:48:19Z

@kithrup Thanks for taking a look! The code still has a lot of missing pieces, including a "big theory statement" of how it works. I'm going to wait to write that until the remaining tricky pieces are implemented, so that I can be sure the comment reflects the details in the final design.

kithrup · 2019-06-04T21:07:14Z

No sweat. So far, as I said, it looks good.

ronnyegner · 2019-06-06T14:33:57Z

@ahrens: I am willing to help (except coding). How?

ahrens · 2019-06-06T17:06:49Z

@ronnyegner Thanks! Things aside from coding that you could help with include:

manual testing & reporting bugs
writing automated tests (shell scripting)
writing documentation

Feel free to get in touch with me directly if you'd like more specific direction.

kwinz · 2019-07-30T18:29:25Z

Thanks a lot @ahrens! In the talk you said that the existing old data keeps the old parity ratio aka. logical stripe width (at 6:50 and 13:20). What is the recommended way to convert the existing data to the new stripe width after expanding? Do I have to delete snapshots and mv data? Thanks in advance!

ahrens · 2019-07-30T18:45:08Z

@kwinz In the current PR, all data has the old parity ratio (even newly-written data), so there's no way to do that currently. But before we integrate this, I'll make it so that newly-written data has the new data:parity ratio.

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

mufunyo · 2019-08-02T11:26:15Z

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

That sounds like a pretty janky way to do it. It'd be a lot more user-friendly if you could just rewrite a dataset in the new data:parity ratio with a dedicated command like zfs rewrite <dataset> that would do it in-place.

snajpa · 2019-10-20T21:02:01Z

On the send/recv note, maybe @mufunyo's zfs rewrite could be expanded on a bit further, perhaps we could do something like KVM/similar do for live migration. I'm now wondering how many lines it would take to support live rewrite, which would keep on syncing until the synced delta is at minimum and then grabbing the z_teardown_lock on the original zfsvfs and swapping it with the new dataset's zfsvfs... that doesn't sound too crazy, does it?

cc @ahrens

ronnyegner · 2019-11-06T06:17:36Z

@ahrens I sent you two mails so far about wanting to help you. No reply. One on 6th June and another one on 8th July. Did you get them?

ahrens · 2019-11-06T16:54:52Z

@ronnyegner The last email I have from you is January 2015, about device removal. I did reply to your comment on this PR on June 6th. #8853 (comment)

kenthinson · 2019-11-22T04:45:46Z

@kwinz In the current PR, all data has the old parity ratio (even newly-written data), so there's no way to do that currently. But before we integrate this, I'll make it so that newly-written data has the new data:parity ratio.

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

Hello. Was wondering if a way to pause the reflow would be possible. That way we could do a workflow like this.

add disk reflow paused.
zfs send ... | zfs receive to change stripe width. (Without io contention of the reflow)
delete the old data.
resume normal reflow (in case we as the user missed anything that needs to be moved around).

ahrens · 2019-11-22T05:20:26Z

@kenthinson We could add a way to pause reflow. However, writes to locations that have not yet been reflown have to use the old stripe width. So the workflow you outlined won't achieve your goal. The send|recv won't change the stripe width until after the reflow has (at least mostly) completed.

kenthinson · 2019-11-22T06:11:09Z

@ahrens ah. Thats very interesting. I didn't think free space would have to be reflow. Thanks for the info :)

ahrens · 2019-11-22T06:59:23Z

@kenthinson Note that free space isn't copied, but it exists in whichever physical stripe width surrounds it. We track how far has been reflown: whatever's before that is in the new physical stripe width. We don't keep a map of which ranges are in which physical stripe widths.

scineram · 2019-11-22T07:15:25Z

Then don't you have to always use the old stripe width before reflow completes? How can stripe width be determined for a block written during reflow?

kenthinson · 2019-11-22T07:15:31Z

@ahrens again thanks for the information :)

kenthinson · 2019-11-22T07:19:06Z

@scineram that's a interesting question. I would love to understand that as well. However I'm afraid it's above my head and I don't want to bother Matt too much he's a busy guy lol.

How I had thought it would work is immediately as soon as the disk is added start writing at the new stripe width. However I can see that could be a problem because then you have no way to cancel the operation without having to deal with shrinking the stripe width to match the original set.

owlshrimp · 2019-12-04T21:37:26Z

How safe is the reflow (once everything is all tidied up) conceptually? Would the array be in a degraded state? Unavailable?

amotin · 2020-01-16T15:28:02Z

@ahrens Hi Matt. I've looked through the patch and I have a big worries about vdev performance after reflow:

In case of logical and physical widths being different the new code generates ashift-sized I/Os to every disk to read and process each row separately. Those will probably be aggregated later, but in case of ashift=9 the amount of CPU overhead I expect to be just huge.
Also, even if we try to optimize the I/Os somehow, each row now has parity data in different place(s). That means after we expand 3-disk RAIDZ1 to 4-disk RAIDZ1 each disk on average will be able to read only about 2 ashift's of data before hitting the parity block. ZIO aggregation code will likely extend the read over the gap, but that means every disk will on average loose 33% of its linear speed.

I haven't tested this in practice yet, so please tell me that I am wrong here.

ahrens · 2020-01-16T19:19:35Z

@amotin Thanks for taking a look!

In case of logical and physical widths being different the new code generates ashift-sized I/Os to every disk to read and process each row separately. Those will probably be aggregated later, but in case of ashift=9 the amount of CPU overhead I expect to be just huge.

They will definitely be aggregated later. So we will be processing the same amount of bytes as before, but we will have more zio_t's which will cause some more overhead. The increase in number of zio_t's would be greater with greater recordsize. With the default recordsize=128K, and a 10-wide RAIDZ2, and 4K-sector disks, we would have around 4x as many zio_t's, because there are 4 sectors of this block on each disk. (for 512 byte sectors it would be 32x). I agree we'll want to measure the practical impact of this and see if there are some shortcuts we can take in the zio code to minimize the per-zio overhead for these.

Also, even if we try to optimize the I/Os somehow, each row now has parity data in different place(s). That means after we expand 3-disk RAIDZ1 to 4-disk RAIDZ1 each disk on average will be able to read only about 2 ashift's of data before hitting the parity block. ZIO aggregation code will likely extend the read over the gap, but that means every disk will on average loose 33% of its linear speed.

Another way of thinking of this is that without expansion, each read only hits (vdev_width - parity) disks, i.e. it doesn't have to read from the "parity disks". If you have many concurrent reads outstanding (at least vdev_width reads), you should be able to average slightly more than the max iops of a single disk, because each read leaves its parity disks idle (and different reads leave different "parity disks" idle).

After expansion, each read hits all the disks in the vdev. In the worst case I think we'd expect (parity / old_width) reduction.

Do you have any thoughts on how we could improve this?

amotin · 2020-01-16T19:55:39Z

With the default recordsize=128K, and a 10-wide RAIDZ2, and 4K-sector disks, we would have around 4x as many zio_t's, because there are 4 sectors of this block on each disk. (for 512 byte sectors it would be 32x). I agree we'll want to measure the practical impact of this and see if there are some shortcuts we can take in the zio code to minimize the per-zio overhead for these.

This worries me since even now completion of aggregated 16KB requests cause ZIO burst, stressing random code paths and locks. Multiplying that by 32 does not sound promising to me. It has to be tested, but not sure we have much of optimization possibilities left in ZIO pipeline, considering how much is needed.

And I would personally target this feature not to somebody having 10-wide RAIDZ2, but to 3-wide RAIDZ1, i.e. SOHO. I think people having 10-wide RAIDZ2 is just a different market, who should not bother to add a single disk, especialy if they won't get full benefit from it. Beside we do not recommend that wide RAIDZ's typically due to very low IOPS and long rebuild times.

After expansion, each read hits all the disks in the vdev. In the worst case I think we'd expect (parity / old_width) reduction.

Yes. Instead of (new_width - parity) / (old_width - parity) increase in case of offload/reload.

Do you have any thoughts on how we could improve this?

Not really. The first problem may be approached by trying to read the whole column with following data reshuffling, but that is additional memory copy -- also not great. The problem of nonsequential data placement probably has no solution with this approach to expansion.

justinclift · 2020-01-16T20:20:21Z

Would a subsequent (optional?) pass to re-layout / optimise the data afterwards be feasible?

ztest: Fix raidz children count

Convert scripts: raidz_expand_doublewrite.sh raidz_expand_resilver.sh…

Convert testcases scratch

fix long lines and formatting

ahrens · 2021-06-11T05:19:52Z

Superseded by #12225

Fmstrat · 2021-06-11T11:39:02Z

@ahrens Since this was superseded, does this mean contributors will no longer get credit for commits? (Granted, my part was the most minuscule thing on earth, but looks like there were a number in the commits)

ahrens · 2021-06-11T14:46:23Z

@Fmstrat The OpenZFS project generally requires PR's to be squashed to one commit, with multiple authorships preserved via the Contributions-by commit message tags. I'm sorry I missed you and @thorsteneb! I've added you both to the commit message and the PR description of #12225. Let me know if there's anyone else that I missed.

Fmstrat · 2021-06-11T15:04:14Z

@ahrens Oh np at all, as I said, my contribution was about as small as one can be, I just remembered seeing other names in there when I submitted. Thanks! And thanks for all the work on this feature, so happy to see it come out of alpha.

justinclift · 2021-06-11T15:40:06Z

Yep, half the people in the storage world are going to be happy when this is released (and production ready moreso!). 😄

ahrens added Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature labels Jun 4, 2019

kithrup reviewed Jun 4, 2019

View reviewed changes

module/zfs/spa.c Show resolved Hide resolved

ahrens mentioned this pull request Jul 10, 2019

device removal for RAID-Z #9013

Open

allanjude mentioned this pull request Oct 20, 2019

https://reviews.freebsd.org/D15124 //RaidZ expansion #8437

Closed

ahrens and others added 18 commits June 4, 2021 13:29

Merge pull request #28 from fuporovvStack/ztest-raidz-children

69613d6

ztest: Fix raidz children count

Merge branch 'raidz' into convert-testcases

526b3c1

Merge pull request #24 from fuporovvStack/convert-testcases

3d7902f

Convert scripts: raidz_expand_doublewrite.sh raidz_expand_resilver.sh…

Merge pull request #25 from fuporovvStack/convert-testcases-scratch

746f90e

Convert testcases scratch

cleanup

36d75ff

Merge remote-tracking branch 'origin/upstreams/master' into raidz

f81837a

tunables

ce8d671

runfile

5f0f687

Merge remote-tracking branch 'origin/upstreams/master' into raidz

1c025f1

fix long lines and formatting

1092a8b

Merge pull request #30 from Fmstrat/raidz-linting

80aaa70

fix long lines and formatting

fix openzfs#11459

a866e47

my_tgts overflow

c708033

fix zs_do_raidz_scratch_verify

2d2f6c4

Merge remote-tracking branch 'origin/upstreams/master' into raidz

58554e3

zpool-features.7

6c46d81

zfs.4

2faec69

cleanup

c3f4a9a

ahrens self-assigned this Jun 10, 2021

ahrens added 5 commits June 10, 2021 10:09

disable zstd mempool

1875530

nits

120782e

reduce runtime of raidz_expand_002_pos

304c454

fix memory leak

721913d

no need to disable aggregation

7384b93

ahrens closed this Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] raidz expansion, alpha preview 1 #8853

[WIP] raidz expansion, alpha preview 1 #8853

ahrens commented Jun 4, 2019

ahrens commented Jun 4, 2019

codecov bot commented Jun 4, 2019 •

edited

Loading

kithrup left a comment

ahrens commented Jun 4, 2019

kithrup commented Jun 4, 2019

ronnyegner commented Jun 6, 2019

ahrens commented Jun 6, 2019 •

edited

Loading

kwinz commented Jul 30, 2019

ahrens commented Jul 30, 2019

mufunyo commented Aug 2, 2019

snajpa commented Oct 20, 2019

ronnyegner commented Nov 6, 2019

ahrens commented Nov 6, 2019

kenthinson commented Nov 22, 2019 •

edited

Loading

ahrens commented Nov 22, 2019

kenthinson commented Nov 22, 2019

ahrens commented Nov 22, 2019

scineram commented Nov 22, 2019

kenthinson commented Nov 22, 2019

kenthinson commented Nov 22, 2019 •

edited

Loading

owlshrimp commented Dec 4, 2019

amotin commented Jan 16, 2020 •

edited

Loading

ahrens commented Jan 16, 2020

amotin commented Jan 16, 2020 •

edited

Loading

justinclift commented Jan 16, 2020

ahrens commented Jun 11, 2021

Fmstrat commented Jun 11, 2021

ahrens commented Jun 11, 2021 •

edited

Loading

Fmstrat commented Jun 11, 2021

justinclift commented Jun 11, 2021 •

edited

Loading

[WIP] raidz expansion, alpha preview 1 #8853

[WIP] raidz expansion, alpha preview 1 #8853

Conversation

ahrens commented Jun 4, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

ahrens commented Jun 4, 2019

codecov bot commented Jun 4, 2019 • edited Loading

Codecov Report

kithrup left a comment

Choose a reason for hiding this comment

ahrens commented Jun 4, 2019

kithrup commented Jun 4, 2019

ronnyegner commented Jun 6, 2019

ahrens commented Jun 6, 2019 • edited Loading

kwinz commented Jul 30, 2019

ahrens commented Jul 30, 2019

mufunyo commented Aug 2, 2019

snajpa commented Oct 20, 2019

ronnyegner commented Nov 6, 2019

ahrens commented Nov 6, 2019

kenthinson commented Nov 22, 2019 • edited Loading

ahrens commented Nov 22, 2019

kenthinson commented Nov 22, 2019

ahrens commented Nov 22, 2019

scineram commented Nov 22, 2019

kenthinson commented Nov 22, 2019

kenthinson commented Nov 22, 2019 • edited Loading

owlshrimp commented Dec 4, 2019

amotin commented Jan 16, 2020 • edited Loading

ahrens commented Jan 16, 2020

amotin commented Jan 16, 2020 • edited Loading

justinclift commented Jan 16, 2020

ahrens commented Jun 11, 2021

Fmstrat commented Jun 11, 2021

ahrens commented Jun 11, 2021 • edited Loading

Fmstrat commented Jun 11, 2021

justinclift commented Jun 11, 2021 • edited Loading

codecov bot commented Jun 4, 2019 •

edited

Loading

ahrens commented Jun 6, 2019 •

edited

Loading

kenthinson commented Nov 22, 2019 •

edited

Loading

kenthinson commented Nov 22, 2019 •

edited

Loading

amotin commented Jan 16, 2020 •

edited

Loading

amotin commented Jan 16, 2020 •

edited

Loading

ahrens commented Jun 11, 2021 •

edited

Loading

justinclift commented Jun 11, 2021 •

edited

Loading