Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] raidz expansion, alpha preview 1 #8853

Closed
wants to merge 207 commits into from
Closed

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Jun 4, 2019

Motivation and Context

This is a alpha-quality preview of RAID-Z expansion. This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: slides video

Description

Functionality that's currently implemented:

  • Can expand raidz device with zpool attach poolname raidz2-0 newdisk
  • Simple test script in scripts/raidz_expand_test.sh
  • During reflow/expansion:
    • All allocated space in device is rewritten (copied to its new location in the RAIDZ vdev)
    • Reflow happens in background over multiple txg’s
    • Reads and writes during reflow are handled
    • Can reboot or export/import, resumes after import (with exception if at the very beginning of reflow)
    • Progress is reported in zpool status
  • After expansion completes:
    • Can initiate additional expansions
    • Additional space available
    • Device failure and silent damage are handled
    • Can reboot or export/import
    • Status (completion time) reported in zpool status

Functionality that still needs to be implemented:

  • Add on-disk feature flag
  • Progress should be reported in terms of offset on disk, not bytes copied
  • Logical stripe width does not increase
  • Crash in the very beginning of reflow can trash the pool
  • Pool must be healthy during reflow
  • Does not use SIMD instructions for raidz math
  • Documentation
  • Automated tests!

This feature should only be used on test pools. The pool will eventually need to be DESTROYED, because the on-disk format will not be compatible with the final release. Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss.

I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test).

Sponsored by: The FreeBSD Foundation

How Has This Been Tested?

scripts/raidz_expand_test.sh

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

This is a alpha-quality preview of RAID-Z expansion.  This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally.  This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: [slides](http://www.open-zfs.org/w/images/6/68/RAIDZ_Expansion_v2.pdf) [video](https://www.youtube.com/watch?v=ZF8V7Tc9G28)

Functionality that's currently implemented:
* Can expand raidz device with `zpool attach poolname raidz2-0 newdisk`
* Simple test script in `scripts/raidz_expand_test.sh`
* During reflow/expansion:
  * All allocated space in device is rewritten (copied to its new location in the RAIDZ vdev)
  * Reflow happens in background over multiple txg’s
  * Reads and writes during reflow are handled
  * Can reboot or export/import, resumes after import (with exception if at the very beginning of reflow)
  * Progress is reported in zpool status
* After expansion completes:
  * Can initiate additional expansions
  * Additional space available
  * Device failure and silent damage are handled
  * Can reboot or export/import
  * Status (completion time) reported in zpool status

Functionality that still needs to be implemented:
* Add on-disk feature flag
* Progress should be reported in terms of offset on disk, not bytes copied
* Logical stripe width does not increase
* Crash in the very beginning of reflow can trash the pool
* Pool must be healthy during reflow
* Does not use SIMD instructions for raidz math
* Documentation
* Automated tests!

This feature should only be used on test pools.  The pool will eventually need to be **DESTROYED**, because the on-disk format will not be compatible with the final release.  Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss.

I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test).

Sponsored by: The FreeBSD Foundation
@ahrens ahrens added Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature labels Jun 4, 2019
@ahrens
Copy link
Member Author

ahrens commented Jun 4, 2019

@codecov
Copy link

codecov bot commented Jun 4, 2019

Codecov Report

Merging #8853 (fe2ca28) into master (161ed82) will increase coverage by 4.46%.
The diff coverage is 92.79%.

❗ Current head fe2ca28 differs from pull request most recent head 7384b93. Consider uploading reports for the commit 7384b93 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8853      +/-   ##
==========================================
+ Coverage   75.17%   79.63%   +4.46%     
==========================================
  Files         402      394       -8     
  Lines      128071   125151    -2920     
==========================================
+ Hits        96283    99670    +3387     
+ Misses      31788    25481    -6307     
Flag Coverage Δ
kernel 80.38% <93.56%> (+1.62%) ⬆️
user 65.89% <88.58%> (+18.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/raidz_test/raidz_bench.c 0.00% <0.00%> (ø)
cmd/raidz_test/raidz_test.h 0.00% <ø> (ø)
include/sys/vdev_raidz_impl.h 100.00% <ø> (ø)
module/zfs/vdev_raidz.c 93.17% <ø> (-1.53%) ⬇️
cmd/ztest/ztest.c 77.82% <80.00%> (+70.57%) ⬆️
cmd/raidz_test/raidz_test.c 81.50% <89.28%> (-1.12%) ⬇️
module/zfs/vdev.c 90.36% <92.85%> (+1.00%) ⬆️
cmd/zpool/zpool_main.c 80.53% <93.02%> (-0.29%) ⬇️
lib/libzfs/libzfs_pool.c 73.55% <100.00%> (+2.51%) ⬆️
module/zfs/abd.c 90.72% <100.00%> (-4.24%) ⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 860051f...7384b93. Read the comment docs.

Copy link
Contributor

@kithrup kithrup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, this looks ok. It's missing an overview in the code of how the expansion works (I found a couple smaller ones talking about the reconstructions). It doesn't seem hugely complicated, which surprised me, but I was still looking for it in the code.

module/zfs/spa.c Show resolved Hide resolved
@ahrens
Copy link
Member Author

ahrens commented Jun 4, 2019

@kithrup Thanks for taking a look! The code still has a lot of missing pieces, including a "big theory statement" of how it works. I'm going to wait to write that until the remaining tricky pieces are implemented, so that I can be sure the comment reflects the details in the final design.

@kithrup
Copy link
Contributor

kithrup commented Jun 4, 2019

No sweat. So far, as I said, it looks good.

@ronnyegner
Copy link

@ahrens: I am willing to help (except coding). How?

@ahrens
Copy link
Member Author

ahrens commented Jun 6, 2019

@ronnyegner Thanks! Things aside from coding that you could help with include:

  • manual testing & reporting bugs
  • writing automated tests (shell scripting)
  • writing documentation

Feel free to get in touch with me directly if you'd like more specific direction.

@kwinz
Copy link

kwinz commented Jul 30, 2019

Thanks a lot @ahrens! In the talk you said that the existing old data keeps the old parity ratio aka. logical stripe width (at 6:50 and 13:20). What is the recommended way to convert the existing data to the new stripe width after expanding? Do I have to delete snapshots and mv data? Thanks in advance!

@ahrens
Copy link
Member Author

ahrens commented Jul 30, 2019

@kwinz In the current PR, all data has the old parity ratio (even newly-written data), so there's no way to do that currently. But before we integrate this, I'll make it so that newly-written data has the new data:parity ratio.

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

@mufunyo
Copy link

mufunyo commented Aug 2, 2019

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

That sounds like a pretty janky way to do it. It'd be a lot more user-friendly if you could just rewrite a dataset in the new data:parity ratio with a dedicated command like zfs rewrite <dataset> that would do it in-place.

@snajpa
Copy link
Contributor

snajpa commented Oct 20, 2019

On the send/recv note, maybe @mufunyo's zfs rewrite could be expanded on a bit further, perhaps we could do something like KVM/similar do for live migration. I'm now wondering how many lines it would take to support live rewrite, which would keep on syncing until the synced delta is at minimum and then grabbing the z_teardown_lock on the original zfsvfs and swapping it with the new dataset's zfsvfs... that doesn't sound too crazy, does it?

cc @ahrens

@ronnyegner
Copy link

@ahrens I sent you two mails so far about wanting to help you. No reply. One on 6th June and another one on 8th July. Did you get them?

@ahrens
Copy link
Member Author

ahrens commented Nov 6, 2019

@ronnyegner The last email I have from you is January 2015, about device removal. I did reply to your comment on this PR on June 6th. #8853 (comment)

@kenthinson
Copy link

kenthinson commented Nov 22, 2019

@kwinz In the current PR, all data has the old parity ratio (even newly-written data), so there's no way to do that currently. But before we integrate this, I'll make it so that newly-written data has the new data:parity ratio.

Then the easiest way to convert existing data would usually be to use zfs send ... | zfs receive ... within the same pool, and then delete the old datasets.

Hello. Was wondering if a way to pause the reflow would be possible. That way we could do a workflow like this.

  1. add disk reflow paused.
  2. zfs send ... | zfs receive to change stripe width. (Without io contention of the reflow)
  3. delete the old data.
  4. resume normal reflow (in case we as the user missed anything that needs to be moved around).

@ahrens
Copy link
Member Author

ahrens commented Nov 22, 2019

@kenthinson We could add a way to pause reflow. However, writes to locations that have not yet been reflown have to use the old stripe width. So the workflow you outlined won't achieve your goal. The send|recv won't change the stripe width until after the reflow has (at least mostly) completed.

@kenthinson
Copy link

@ahrens ah. Thats very interesting. I didn't think free space would have to be reflow. Thanks for the info :)

@ahrens
Copy link
Member Author

ahrens commented Nov 22, 2019

@kenthinson Note that free space isn't copied, but it exists in whichever physical stripe width surrounds it. We track how far has been reflown: whatever's before that is in the new physical stripe width. We don't keep a map of which ranges are in which physical stripe widths.

@scineram
Copy link

Then don't you have to always use the old stripe width before reflow completes? How can stripe width be determined for a block written during reflow?

@kenthinson
Copy link

@ahrens again thanks for the information :)

@kenthinson
Copy link

kenthinson commented Nov 22, 2019

@scineram that's a interesting question. I would love to understand that as well. However I'm afraid it's above my head and I don't want to bother Matt too much he's a busy guy lol.

How I had thought it would work is immediately as soon as the disk is added start writing at the new stripe width. However I can see that could be a problem because then you have no way to cancel the operation without having to deal with shrinking the stripe width to match the original set.

@owlshrimp
Copy link

How safe is the reflow (once everything is all tidied up) conceptually? Would the array be in a degraded state? Unavailable?

@amotin
Copy link
Member

amotin commented Jan 16, 2020

@ahrens Hi Matt. I've looked through the patch and I have a big worries about vdev performance after reflow:

  • In case of logical and physical widths being different the new code generates ashift-sized I/Os to every disk to read and process each row separately. Those will probably be aggregated later, but in case of ashift=9 the amount of CPU overhead I expect to be just huge.
  • Also, even if we try to optimize the I/Os somehow, each row now has parity data in different place(s). That means after we expand 3-disk RAIDZ1 to 4-disk RAIDZ1 each disk on average will be able to read only about 2 ashift's of data before hitting the parity block. ZIO aggregation code will likely extend the read over the gap, but that means every disk will on average loose 33% of its linear speed.

I haven't tested this in practice yet, so please tell me that I am wrong here.

@ahrens
Copy link
Member Author

ahrens commented Jan 16, 2020

@amotin Thanks for taking a look!

In case of logical and physical widths being different the new code generates ashift-sized I/Os to every disk to read and process each row separately. Those will probably be aggregated later, but in case of ashift=9 the amount of CPU overhead I expect to be just huge.

They will definitely be aggregated later. So we will be processing the same amount of bytes as before, but we will have more zio_t's which will cause some more overhead. The increase in number of zio_t's would be greater with greater recordsize. With the default recordsize=128K, and a 10-wide RAIDZ2, and 4K-sector disks, we would have around 4x as many zio_t's, because there are 4 sectors of this block on each disk. (for 512 byte sectors it would be 32x). I agree we'll want to measure the practical impact of this and see if there are some shortcuts we can take in the zio code to minimize the per-zio overhead for these.

Also, even if we try to optimize the I/Os somehow, each row now has parity data in different place(s). That means after we expand 3-disk RAIDZ1 to 4-disk RAIDZ1 each disk on average will be able to read only about 2 ashift's of data before hitting the parity block. ZIO aggregation code will likely extend the read over the gap, but that means every disk will on average loose 33% of its linear speed.

Another way of thinking of this is that without expansion, each read only hits (vdev_width - parity) disks, i.e. it doesn't have to read from the "parity disks". If you have many concurrent reads outstanding (at least vdev_width reads), you should be able to average slightly more than the max iops of a single disk, because each read leaves its parity disks idle (and different reads leave different "parity disks" idle).

After expansion, each read hits all the disks in the vdev. In the worst case I think we'd expect (parity / old_width) reduction.

Do you have any thoughts on how we could improve this?

@amotin
Copy link
Member

amotin commented Jan 16, 2020

With the default recordsize=128K, and a 10-wide RAIDZ2, and 4K-sector disks, we would have around 4x as many zio_t's, because there are 4 sectors of this block on each disk. (for 512 byte sectors it would be 32x). I agree we'll want to measure the practical impact of this and see if there are some shortcuts we can take in the zio code to minimize the per-zio overhead for these.

This worries me since even now completion of aggregated 16KB requests cause ZIO burst, stressing random code paths and locks. Multiplying that by 32 does not sound promising to me. It has to be tested, but not sure we have much of optimization possibilities left in ZIO pipeline, considering how much is needed.

And I would personally target this feature not to somebody having 10-wide RAIDZ2, but to 3-wide RAIDZ1, i.e. SOHO. I think people having 10-wide RAIDZ2 is just a different market, who should not bother to add a single disk, especialy if they won't get full benefit from it. Beside we do not recommend that wide RAIDZ's typically due to very low IOPS and long rebuild times.

After expansion, each read hits all the disks in the vdev. In the worst case I think we'd expect (parity / old_width) reduction.

Yes. Instead of (new_width - parity) / (old_width - parity) increase in case of offload/reload.

Do you have any thoughts on how we could improve this?

Not really. The first problem may be approached by trying to read the whole column with following data reshuffling, but that is additional memory copy -- also not great. The problem of nonsequential data placement probably has no solution with this approach to expansion.

@justinclift
Copy link

Would a subsequent (optional?) pass to re-layout / optimise the data afterwards be feasible?

@ahrens ahrens self-assigned this Jun 10, 2021
@ahrens
Copy link
Member Author

ahrens commented Jun 11, 2021

Superseded by #12225

@ahrens ahrens closed this Jun 11, 2021
@Fmstrat
Copy link

Fmstrat commented Jun 11, 2021

@ahrens Since this was superseded, does this mean contributors will no longer get credit for commits? (Granted, my part was the most minuscule thing on earth, but looks like there were a number in the commits)

@ahrens
Copy link
Member Author

ahrens commented Jun 11, 2021

@Fmstrat The OpenZFS project generally requires PR's to be squashed to one commit, with multiple authorships preserved via the Contributions-by commit message tags. I'm sorry I missed you and @thorsteneb! I've added you both to the commit message and the PR description of #12225. Let me know if there's anyone else that I missed.

@Fmstrat
Copy link

Fmstrat commented Jun 11, 2021

@ahrens Oh np at all, as I said, my contribution was about as small as one can be, I just remembered seeing other names in there when I submitted. Thanks! And thanks for all the work on this feature, so happy to see it come out of alpha.

@justinclift
Copy link

justinclift commented Jun 11, 2021

Yep, half the people in the storage world are going to be happy when this is released (and production ready moreso!). 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.