-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] raidz expansion, alpha preview 1 #8853
Conversation
This is a alpha-quality preview of RAID-Z expansion. This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: [slides](http://www.open-zfs.org/w/images/6/68/RAIDZ_Expansion_v2.pdf) [video](https://www.youtube.com/watch?v=ZF8V7Tc9G28) Functionality that's currently implemented: * Can expand raidz device with `zpool attach poolname raidz2-0 newdisk` * Simple test script in `scripts/raidz_expand_test.sh` * During reflow/expansion: * All allocated space in device is rewritten (copied to its new location in the RAIDZ vdev) * Reflow happens in background over multiple txg’s * Reads and writes during reflow are handled * Can reboot or export/import, resumes after import (with exception if at the very beginning of reflow) * Progress is reported in zpool status * After expansion completes: * Can initiate additional expansions * Additional space available * Device failure and silent damage are handled * Can reboot or export/import * Status (completion time) reported in zpool status Functionality that still needs to be implemented: * Add on-disk feature flag * Progress should be reported in terms of offset on disk, not bytes copied * Logical stripe width does not increase * Crash in the very beginning of reflow can trash the pool * Pool must be healthy during reflow * Does not use SIMD instructions for raidz math * Documentation * Automated tests! This feature should only be used on test pools. The pool will eventually need to be **DESTROYED**, because the on-disk format will not be compatible with the final release. Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss. I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test). Sponsored by: The FreeBSD Foundation
Codecov Report
@@ Coverage Diff @@
## master #8853 +/- ##
==========================================
+ Coverage 75.17% 79.63% +4.46%
==========================================
Files 402 394 -8
Lines 128071 125151 -2920
==========================================
+ Hits 96283 99670 +3387
+ Misses 31788 25481 -6307
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, this looks ok. It's missing an overview in the code of how the expansion works (I found a couple smaller ones talking about the reconstructions). It doesn't seem hugely complicated, which surprised me, but I was still looking for it in the code.
@kithrup Thanks for taking a look! The code still has a lot of missing pieces, including a "big theory statement" of how it works. I'm going to wait to write that until the remaining tricky pieces are implemented, so that I can be sure the comment reflects the details in the final design. |
No sweat. So far, as I said, it looks good. |
@ahrens: I am willing to help (except coding). How? |
@ronnyegner Thanks! Things aside from coding that you could help with include:
Feel free to get in touch with me directly if you'd like more specific direction. |
Thanks a lot @ahrens! In the talk you said that the existing old data keeps the old parity ratio aka. logical stripe width (at 6:50 and 13:20). What is the recommended way to convert the existing data to the new stripe width after expanding? Do I have to delete snapshots and |
@kwinz In the current PR, all data has the old parity ratio (even newly-written data), so there's no way to do that currently. But before we integrate this, I'll make it so that newly-written data has the new data:parity ratio. Then the easiest way to convert existing data would usually be to use |
That sounds like a pretty janky way to do it. It'd be a lot more user-friendly if you could just rewrite a dataset in the new data:parity ratio with a dedicated command like |
On the send/recv note, maybe @mufunyo's cc @ahrens |
@ahrens I sent you two mails so far about wanting to help you. No reply. One on 6th June and another one on 8th July. Did you get them? |
@ronnyegner The last email I have from you is January 2015, about device removal. I did reply to your comment on this PR on June 6th. #8853 (comment) |
Hello. Was wondering if a way to pause the reflow would be possible. That way we could do a workflow like this.
|
@kenthinson We could add a way to pause reflow. However, writes to locations that have not yet been reflown have to use the old stripe width. So the workflow you outlined won't achieve your goal. The send|recv won't change the stripe width until after the reflow has (at least mostly) completed. |
@ahrens ah. Thats very interesting. I didn't think free space would have to be reflow. Thanks for the info :) |
@kenthinson Note that free space isn't copied, but it exists in whichever physical stripe width surrounds it. We track how far has been reflown: whatever's before that is in the new physical stripe width. We don't keep a map of which ranges are in which physical stripe widths. |
Then don't you have to always use the old stripe width before reflow completes? How can stripe width be determined for a block written during reflow? |
@ahrens again thanks for the information :) |
@scineram that's a interesting question. I would love to understand that as well. However I'm afraid it's above my head and I don't want to bother Matt too much he's a busy guy lol. How I had thought it would work is immediately as soon as the disk is added start writing at the new stripe width. However I can see that could be a problem because then you have no way to cancel the operation without having to deal with shrinking the stripe width to match the original set. |
How safe is the reflow (once everything is all tidied up) conceptually? Would the array be in a degraded state? Unavailable? |
@ahrens Hi Matt. I've looked through the patch and I have a big worries about vdev performance after reflow:
I haven't tested this in practice yet, so please tell me that I am wrong here. |
@amotin Thanks for taking a look!
They will definitely be aggregated later. So we will be processing the same amount of bytes as before, but we will have more zio_t's which will cause some more overhead. The increase in number of zio_t's would be greater with greater recordsize. With the default recordsize=128K, and a 10-wide RAIDZ2, and 4K-sector disks, we would have around 4x as many zio_t's, because there are 4 sectors of this block on each disk. (for 512 byte sectors it would be 32x). I agree we'll want to measure the practical impact of this and see if there are some shortcuts we can take in the zio code to minimize the per-zio overhead for these.
Another way of thinking of this is that without expansion, each read only hits (vdev_width - parity) disks, i.e. it doesn't have to read from the "parity disks". If you have many concurrent reads outstanding (at least vdev_width reads), you should be able to average slightly more than the max iops of a single disk, because each read leaves its parity disks idle (and different reads leave different "parity disks" idle). After expansion, each read hits all the disks in the vdev. In the worst case I think we'd expect (parity / old_width) reduction. Do you have any thoughts on how we could improve this? |
This worries me since even now completion of aggregated 16KB requests cause ZIO burst, stressing random code paths and locks. Multiplying that by 32 does not sound promising to me. It has to be tested, but not sure we have much of optimization possibilities left in ZIO pipeline, considering how much is needed. And I would personally target this feature not to somebody having 10-wide RAIDZ2, but to 3-wide RAIDZ1, i.e. SOHO. I think people having 10-wide RAIDZ2 is just a different market, who should not bother to add a single disk, especialy if they won't get full benefit from it. Beside we do not recommend that wide RAIDZ's typically due to very low IOPS and long rebuild times.
Yes. Instead of (new_width - parity) / (old_width - parity) increase in case of offload/reload.
Not really. The first problem may be approached by trying to read the whole column with following data reshuffling, but that is additional memory copy -- also not great. The problem of nonsequential data placement probably has no solution with this approach to expansion. |
Would a subsequent (optional?) pass to re-layout / optimise the data afterwards be feasible? |
ztest: Fix raidz children count
Convert scripts: raidz_expand_doublewrite.sh raidz_expand_resilver.sh…
Convert testcases scratch
fix long lines and formatting
Superseded by #12225 |
@ahrens Since this was superseded, does this mean contributors will no longer get credit for commits? (Granted, my part was the most minuscule thing on earth, but looks like there were a number in the commits) |
@Fmstrat The OpenZFS project generally requires PR's to be squashed to one commit, with multiple authorships preserved via the |
@ahrens Oh np at all, as I said, my contribution was about as small as one can be, I just remembered seeing other names in there when I submitted. Thanks! And thanks for all the work on this feature, so happy to see it come out of alpha. |
Yep, half the people in the storage world are going to be happy when this is released (and production ready moreso!). 😄 |
Motivation and Context
This is a alpha-quality preview of RAID-Z expansion. This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).
For additional context as well as a design overview, see my short talk from the 2017 OpenZFS Developer Summit: slides video
Description
Functionality that's currently implemented:
zpool attach poolname raidz2-0 newdisk
scripts/raidz_expand_test.sh
Functionality that still needs to be implemented:
This feature should only be used on test pools. The pool will eventually need to be DESTROYED, because the on-disk format will not be compatible with the final release. Additionally, there are currently bugs in RAID-Z expansion which can occasionally cause data loss.
I would especially appreciate if anyone has time to write some automated tests for RAID-Z expansion in the ZFS Test Suite (including converting the raidz_expand_test.sh script into a proper test).
Sponsored by: The FreeBSD Foundation
How Has This Been Tested?
scripts/raidz_expand_test.sh
Types of changes
Checklist:
Signed-off-by
.