-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup allocation class vdev data to the pool #16073
Conversation
Just to quantify a little bit the amount of data we would be backing up, here's what I saw on a regular pool with a special device, default settings (but compression=off): New data from a
New data from decompressing the 6.8.4 Linux kernel tarball:
So the special metadata data is ~0.5-1.34% of the total data. Obviously, if you are setting |
I haven't looked inside yet, but by first thought is that it will break special vdev use case with dRAID, where it is not (only) about read performance, but about space efficiency. |
Yeah, this seems like it really badly breaks many of the use cases involved. Like, "I would advocate this be disabled by default" level of "breaks expectations". e: Thinking about it a little more, I think my suggestion would be that you cut a new top-level vdev type if this is useful, since:
|
@amotin if there's a particular dRAID config that you're interested in, I can give it a test. I tried using a 106-disk dRAID pool with this test script. When backup to pool is enabled, I don't see the total disk space go up that much relative to the total amount of data (and weirdly the allocation on the special device drops):
|
I mean, I'd expect the special's allocation to drop, since you've told it to spend one allocation not on the special, and absent small blocks that's all metadata (e.g. n=2 or 3 copies), right? |
@rincebrain ah ok, so irrespective of this PR, all metadata on special is copies >=2 by default, and special small blocks is copies = 1? Interesting. That brings up the question of which is safer: two copies on 'special' (probably NVMe), or one copy on special (NVMe) and one on pool (HDD)? My gut feeling would be the latter, since the data is then on two physical drives. |
I mean, all metadata, no matter where, is copies >= 2, unless you twiddle The latter is technically safer, depending on the failure case you're optimizing for, but then you've lost all the performance benefits for the write path for the special vdev, which is why I'd suggest you call it something different, and not make it the default for existing use cases. (We already call different "special" vdev specializations other things, e.g. "dedup", so we could add a few more cases without that much strangeness, I think...) IOW, I don't think you should break people's expectations for the performance characteristics of doing (I spent a little while contemplating whether you could do something evil and cute like a bit in the bp indicating that you should check an indirection table or something for DVA[2] and then you can do that write async, to let you have your cake and eat it too for something like this, but I don't personally have a use case for it, so I didn't go try implementing it...) |
@tonyhutter First make sure you have ashift=12 (AKA 512e/4Kn disks). After that for dRAID it is not the total number of disks in VDEV is important, but the number of disks in redundancy group, which is 8 by default, but you may increase it higher. |
@tonyhutter do you have some specific use cases in mind for this? Any design notes? I'm not sure yet what to make of this as a concept, and I'm not sure when I'd want to use it, and would like some guidance. (at least, from the referenced issue, it sounds like this more on "add disks to pool over time" side of the line, vs "plan and build the whole pool up front"? That's not necessarily bad, I just want to make sure I've got the right hat on). |
|
@tonyhutter Generally I agree that writing all copies of a block to the same vdev is a bad idea. But to me this echoes to a thought that I already sounded several times on few other threads: we should split concept of special vdev into finer-granular parts. Would we just use special vdev as a metadata accelerator, your approach would make some sense. But now special vdev also handles small allocations, and from that perspective it does not make it. Would we first introduce that finer-grained control, many things could be more reasonable. |
@amotin I see what you're saying about dRAID. For small 4k special writes, it's going to amplify it by [blocks * [data stripe + parity]] to the dRAID pool. So on a draid2:8:1, you're going to get 10x amplification for backing up special metadata. A partial workaround would be to print a big warning like "Note: you're probably going to waste ~X% pool space backing up your special device" when you do a Just to try to get some more real-world numbers, here's the https://gist.github.com/tonyhutter/ab3f0ee1c0c0e9a132cb802a766862ec The pool was one normal disk and one special disk, using only the zfs defaults. I see this line:
But that seems odd for 433B to be the average block size for metadata so maybe I'm reading it wrong. |
It gets much, much worse the larger the ashift gets. But if you look above it, you can see:
so we have mostly directory objects which can be |
This commit allows you to automatically backup allocation class vdevs to the pool. If the alloc class vdev is fully backed up, it can fail without the pool losing any data. This also means you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device). It works by making sure all alloc class writes have at least two DVA copies, and then having the 2nd copy always go to the pool itself. So whenever you write to an alloc class vdev, another copy of the data is also written to the pool. This behavior is controlled via three properties: 1. feature@allow_backup_to_pool - This feature flag enables the backup subsystem. It also prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support alloc class backups. 2. backup_alloc_class_to_pool - This pool property is the main on/off switch to control the backup feature. It is on by default but can be turned off at any time. Once it is turned off, then all existing vdevs will no longer considered to be fully backed up. 3. backup_to_pool - This is a read-only vdev property that will report "on" if all the data on the vdev is fully backed up to the pool. Note that the backup to pool feature is now enabled by default on all new pools. This may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). Closes: openzfs#15118 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Let's leave this PR on ice for now. I'm looking into a much better way of backing up the special metadata to the pool. Details to follow. |
Unfortunately this is not going to work, as the dedup data is per-pool. I've since created a newer, simplied version of this PR that I think will work better: #16185. Closing this PR in favor of the new one. |
Motivation and Context
Write-though all allocation class data to the pool so alloc class vdevs can fail without data loss. Closes: #15118
Description
This commit allows you to automatically backup allocation class vdevs to the pool. If the alloc class vdev is fully backed up, it can fail without the pool losing any data. This also means you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).
It works by making sure all alloc class writes have at least two DVA copies, and then having the 2nd copy always go to the pool itself. So whenever you write to an alloc class vdev, another copy of the data is also written to the pool.
This behavior is controlled via three properties:
feature@allow_backup_to_pool
- This feature flag enables the backup subsystem. It also prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support alloc class backups.backup_alloc_class_to_pool
- This pool property is the main on/off switch to control the backup feature. It is on by default but can be turned off at any time. Once it is turned off, then all existing alloc class vdevs will no longer considered to be fully backed up.backup_to_pool
- This is a read-only vdev property that will report "on" if all the data on the vdev is fully backed up to the pool.Note that the backup to pool feature is now enabled by default on all new pools. This may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device).
How Has This Been Tested?
Test cases added.
Types of changes
Checklist:
Signed-off-by
.