Backup allocation class vdev data to the pool #16073

tonyhutter · 2024-04-09T17:35:38Z

Motivation and Context

Write-though all allocation class data to the pool so alloc class vdevs can fail without data loss. Closes: #15118

Description

This commit allows you to automatically backup allocation class vdevs to the pool. If the alloc class vdev is fully backed up, it can fail without the pool losing any data. This also means you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).

It works by making sure all alloc class writes have at least two DVA copies, and then having the 2nd copy always go to the pool itself. So whenever you write to an alloc class vdev, another copy of the data is also written to the pool.

This behavior is controlled via three properties:

feature@allow_backup_to_pool - This feature flag enables the backup subsystem. It also prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support alloc class backups.
backup_alloc_class_to_pool - This pool property is the main on/off switch to control the backup feature. It is on by default but can be turned off at any time. Once it is turned off, then all existing alloc class vdevs will no longer considered to be fully backed up.
backup_to_pool - This is a read-only vdev property that will report "on" if all the data on the vdev is fully backed up to the pool.

Note that the backup to pool feature is now enabled by default on all new pools. This may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device).

How Has This Been Tested?

Test cases added.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2024-04-09T17:50:51Z

Just to quantify a little bit the amount of data we would be backing up, here's what I saw on a regular pool with a special device, default settings (but compression=off):

New data from a git clone of zfs:

normal pool data : 392M
special vdev data: 2.04M

New data from decompressing the 6.8.4 Linux kernel tarball:

normal pool data : 2.21G
special vdev data: 28.3M

So the special metadata data is ~0.5-1.34% of the total data.

Obviously, if you are setting special_small_blocks then the results will be much different.

amotin · 2024-04-09T17:52:01Z

I haven't looked inside yet, but by first thought is that it will break special vdev use case with dRAID, where it is not (only) about read performance, but about space efficiency.

rincebrain · 2024-04-09T18:39:07Z

Yeah, this seems like it really badly breaks many of the use cases involved.

Like, "I would advocate this be disabled by default" level of "breaks expectations".

e: Thinking about it a little more, I think my suggestion would be that you cut a new top-level vdev type if this is useful, since:

you're ~never going to see the property of "fully backed up" become true if it was ever false, since some of that data's almost never getting rewritten
it makes it much more explicitly clear that the behavior is different at a glance of e.g. zpool status versus having to query the property

tonyhutter · 2024-04-09T22:10:45Z

@amotin if there's a particular dRAID config that you're interested in, I can give it a test. I tried using a 106-disk dRAID pool with this test script. When backup to pool is enabled, I don't see the total disk space go up that much relative to the total amount of data (and weirdly the allocation on the special device drops):

--- ARGS: -o feature@allow_backup_to_pool=disabled : before ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  153K   -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  153K   -
--- ARGS: -o feature@allow_backup_to_pool=disabled : after ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  9.81G  -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  105M   -
NAME               PROPERTY        VALUE           SOURCE
/tmp/file-special  backup_to_pool  off             -

--- ARGS: -o feature@allow_backup_to_pool=enabled : before ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  394K   -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  101K   -
--- ARGS: -o feature@allow_backup_to_pool=enabled : after ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  9.98G  -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  52.5M  -
NAME               PROPERTY        VALUE           SOURCE
/tmp/file-special  backup_to_pool  on              -

rincebrain · 2024-04-09T22:23:48Z

I mean, I'd expect the special's allocation to drop, since you've told it to spend one allocation not on the special, and absent small blocks that's all metadata (e.g. n=2 or 3 copies), right?

tonyhutter · 2024-04-09T22:47:42Z

@rincebrain ah ok, so irrespective of this PR, all metadata on special is copies >=2 by default, and special small blocks is copies = 1? Interesting. That brings up the question of which is safer: two copies on 'special' (probably NVMe), or one copy on special (NVMe) and one on pool (HDD)? My gut feeling would be the latter, since the data is then on two physical drives.

rincebrain · 2024-04-09T22:53:34Z

I mean, all metadata, no matter where, is copies >= 2, unless you twiddle redundant_metadata.

The latter is technically safer, depending on the failure case you're optimizing for, but then you've lost all the performance benefits for the write path for the special vdev, which is why I'd suggest you call it something different, and not make it the default for existing use cases. (We already call different "special" vdev specializations other things, e.g. "dedup", so we could add a few more cases without that much strangeness, I think...)

IOW, I don't think you should break people's expectations for the performance characteristics of doing zpool create mypool raidz2 disk1...4 special mirror disk5 disk6 between 2.2 and 2.3.

(I spent a little while contemplating whether you could do something evil and cute like a bit in the bp indicating that you should check an indirection table or something for DVA[2] and then you can do that write async, to let you have your cake and eat it too for something like this, but I don't personally have a use case for it, so I didn't go try implementing it...)

amotin · 2024-04-09T23:35:22Z

@tonyhutter First make sure you have ashift=12 (AKA 512e/4Kn disks). After that for dRAID it is not the total number of disks in VDEV is important, but the number of disks in redundancy group, which is 8 by default, but you may increase it higher.

robn · 2024-04-10T00:41:58Z

@tonyhutter do you have some specific use cases in mind for this? Any design notes? I'm not sure yet what to make of this as a concept, and I'm not sure when I'd want to use it, and would like some guidance.

(at least, from the referenced issue, it sounds like this more on "add disks to pool over time" side of the line, vs "plan and build the whole pool up front"? That's not necessarily bad, I just want to make sure I've got the right hat on).

tonyhutter · 2024-04-10T00:56:35Z

@tonyhutter do you have some specific use cases in mind for this?

It was a requested feature (Write-through special device #15118)
It lowers the barrier to using alloc class devices since you don't need to match the redundancy of the pool. It makes using single-drive special vdevs very low-stakes.
In the common case where the metadata is only ~1% of all data, the extra write of that 1% to the pool is pretty insignificant.
We once had a case where we had a pool comprised of HDDs with NVMe special devices, and the NVMe drives all disappeared/re-appeared (maybe bad PCIe switch?). A reboot cleared it up and we didn't lose any data, but still would have been nice to know it was backed up.

amotin · 2024-04-10T01:29:31Z

@tonyhutter Generally I agree that writing all copies of a block to the same vdev is a bad idea. But to me this echoes to a thought that I already sounded several times on few other threads: we should split concept of special vdev into finer-granular parts. Would we just use special vdev as a metadata accelerator, your approach would make some sense. But now special vdev also handles small allocations, and from that perspective it does not make it. Would we first introduce that finer-grained control, many things could be more reasonable.

tonyhutter · 2024-04-10T22:08:30Z

@amotin I see what you're saying about dRAID. For small 4k special writes, it's going to amplify it by [blocks * [data stripe + parity]] to the dRAID pool. So on a draid2:8:1, you're going to get 10x amplification for backing up special metadata. A partial workaround would be to print a big warning like "Note: you're probably going to waste ~X% pool space backing up your special device" when you do a zpool create with dRAID & special.

Just to try to get some more real-world numbers, here's the zdb -PLbbbs output from a pool containing a Fedora 38 root file system:

https://gist.github.com/tonyhutter/ab3f0ee1c0c0e9a132cb802a766862ec

The pool was one normal disk and one special disk, using only the zfs defaults. I see this line:

1132772	20992134144	10042183680	10135813632	 8947	 2.09	100.00	Total
431992	2716620288	93478912	187077120	  433	29.06	  1.85	Metadata Total
                                                          |               |
                                                           \               \___ percent of pool
                                                            \____ average block size

But that seems odd for 433B to be the average block size for metadata so maybe I'm reading it wrong.

rincebrain · 2024-04-10T22:23:47Z

It gets much, much worse the larger the ashift gets.

But if you look above it, you can see:

 32516	532742144	59620352	119252480	 3667	 8.94	  1.18	    L0 DMU dnode
380961	224260608	11895296	23790592	   62	18.85	  0.23	    L0 ZFS directory
 14863	1948123136	20012032	40025600	 2692	97.35	  0.39	    L1 Total
431992	2716620288	93478912	187077120	  433	29.06	  1.85	Metadata Total

so we have mostly directory objects which can be embedded_data here, 4k dnodes that compress down to 3.5k or less sometimes, and L1 blocks that compress down a fair bit.

This commit allows you to automatically backup allocation class vdevs to the pool. If the alloc class vdev is fully backed up, it can fail without the pool losing any data. This also means you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device). It works by making sure all alloc class writes have at least two DVA copies, and then having the 2nd copy always go to the pool itself. So whenever you write to an alloc class vdev, another copy of the data is also written to the pool. This behavior is controlled via three properties: 1. feature@allow_backup_to_pool - This feature flag enables the backup subsystem. It also prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support alloc class backups. 2. backup_alloc_class_to_pool - This pool property is the main on/off switch to control the backup feature. It is on by default but can be turned off at any time. Once it is turned off, then all existing vdevs will no longer considered to be fully backed up. 3. backup_to_pool - This is a read-only vdev property that will report "on" if all the data on the vdev is fully backed up to the pool. Note that the backup to pool feature is now enabled by default on all new pools. This may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). Closes: openzfs#15118 Signed-off-by: Tony Hutter <hutter2@llnl.gov>

tonyhutter · 2024-04-12T23:44:40Z

Let's leave this PR on ice for now. I'm looking into a much better way of backing up the special metadata to the pool. Details to follow.

tonyhutter · 2024-04-15T17:46:17Z

I think the allocation class backups could better be done via a dataset property. Please see #16095 for details and discussion. If you want to comment please post in #16095.

tonyhutter · 2024-05-10T23:44:38Z

I think the allocation class backups could better be done via a dataset property. Please see #16095 for details and discussion. If you want to comment please post in #16095.

Unfortunately this is not going to work, as the dedup data is per-pool.

I've since created a newer, simplied version of this PR that I think will work better: #16185. Closing this PR in favor of the new one.

tonyhutter added the Type: Feature Feature request or new feature label Apr 9, 2024

tonyhutter force-pushed the special branch from dcb2feb to 8281c31 Compare April 9, 2024 18:49

tonyhutter force-pushed the special branch from 8281c31 to 76e5591 Compare April 11, 2024 00:13

tonyhutter mentioned this pull request May 10, 2024

Special failsafe feature #16185

Closed

13 tasks

tonyhutter closed this May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup allocation class vdev data to the pool #16073

Backup allocation class vdev data to the pool #16073

tonyhutter commented Apr 9, 2024

tonyhutter commented Apr 9, 2024

amotin commented Apr 9, 2024

rincebrain commented Apr 9, 2024 •

edited

Loading

tonyhutter commented Apr 9, 2024

rincebrain commented Apr 9, 2024

tonyhutter commented Apr 9, 2024

rincebrain commented Apr 9, 2024 •

edited

Loading

amotin commented Apr 9, 2024

robn commented Apr 10, 2024

tonyhutter commented Apr 10, 2024 •

edited

Loading

amotin commented Apr 10, 2024

tonyhutter commented Apr 10, 2024 •

edited

Loading

rincebrain commented Apr 10, 2024

tonyhutter commented Apr 12, 2024

tonyhutter commented Apr 15, 2024

tonyhutter commented May 10, 2024

Backup allocation class vdev data to the pool #16073

Backup allocation class vdev data to the pool #16073

Conversation

tonyhutter commented Apr 9, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented Apr 9, 2024

amotin commented Apr 9, 2024

rincebrain commented Apr 9, 2024 • edited Loading

tonyhutter commented Apr 9, 2024

rincebrain commented Apr 9, 2024

tonyhutter commented Apr 9, 2024

rincebrain commented Apr 9, 2024 • edited Loading

amotin commented Apr 9, 2024

robn commented Apr 10, 2024

tonyhutter commented Apr 10, 2024 • edited Loading

amotin commented Apr 10, 2024

tonyhutter commented Apr 10, 2024 • edited Loading

rincebrain commented Apr 10, 2024

tonyhutter commented Apr 12, 2024

tonyhutter commented Apr 15, 2024

tonyhutter commented May 10, 2024

rincebrain commented Apr 9, 2024 •

edited

Loading

rincebrain commented Apr 9, 2024 •

edited

Loading

tonyhutter commented Apr 10, 2024 •

edited

Loading

tonyhutter commented Apr 10, 2024 •

edited

Loading