Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup allocation class vdev data to the pool #16073

Closed
wants to merge 1 commit into from

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Write-though all allocation class data to the pool so alloc class vdevs can fail without data loss. Closes: #15118

Description

This commit allows you to automatically backup allocation class vdevs to the pool. If the alloc class vdev is fully backed up, it can fail without the pool losing any data. This also means you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).

It works by making sure all alloc class writes have at least two DVA copies, and then having the 2nd copy always go to the pool itself. So whenever you write to an alloc class vdev, another copy of the data is also written to the pool.

This behavior is controlled via three properties:

  1. feature@allow_backup_to_pool - This feature flag enables the backup subsystem. It also prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support alloc class backups.

  2. backup_alloc_class_to_pool - This pool property is the main on/off switch to control the backup feature. It is on by default but can be turned off at any time. Once it is turned off, then all existing alloc class vdevs will no longer considered to be fully backed up.

  3. backup_to_pool - This is a read-only vdev property that will report "on" if all the data on the vdev is fully backed up to the pool.

Note that the backup to pool feature is now enabled by default on all new pools. This may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device).

How Has This Been Tested?

Test cases added.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter tonyhutter added the Type: Feature Feature request or new feature label Apr 9, 2024
@tonyhutter
Copy link
Contributor Author

Just to quantify a little bit the amount of data we would be backing up, here's what I saw on a regular pool with a special device, default settings (but compression=off):

New data from a git clone of zfs:

normal pool data : 392M
special vdev data: 2.04M

New data from decompressing the 6.8.4 Linux kernel tarball:

normal pool data : 2.21G
special vdev data: 28.3M

So the special metadata data is ~0.5-1.34% of the total data.

Obviously, if you are setting special_small_blocks then the results will be much different.

@amotin
Copy link
Member

amotin commented Apr 9, 2024

I haven't looked inside yet, but by first thought is that it will break special vdev use case with dRAID, where it is not (only) about read performance, but about space efficiency.

@rincebrain
Copy link
Contributor

rincebrain commented Apr 9, 2024

Yeah, this seems like it really badly breaks many of the use cases involved.

Like, "I would advocate this be disabled by default" level of "breaks expectations".

e: Thinking about it a little more, I think my suggestion would be that you cut a new top-level vdev type if this is useful, since:

  • you're ~never going to see the property of "fully backed up" become true if it was ever false, since some of that data's almost never getting rewritten
  • it makes it much more explicitly clear that the behavior is different at a glance of e.g. zpool status versus having to query the property

@tonyhutter
Copy link
Contributor Author

@amotin if there's a particular dRAID config that you're interested in, I can give it a test. I tried using a 106-disk dRAID pool with this test script. When backup to pool is enabled, I don't see the total disk space go up that much relative to the total amount of data (and weirdly the allocation on the special device drops):

--- ARGS: -o feature@allow_backup_to_pool=disabled : before ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  153K   -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  153K   -
--- ARGS: -o feature@allow_backup_to_pool=disabled : after ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  9.81G  -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  105M   -
NAME               PROPERTY        VALUE           SOURCE
/tmp/file-special  backup_to_pool  off             -

--- ARGS: -o feature@allow_backup_to_pool=enabled : before ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  394K   -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  101K   -
--- ARGS: -o feature@allow_backup_to_pool=enabled : after ---
NAME  PROPERTY   VALUE  SOURCE
tank  allocated  9.98G  -
NAME               PROPERTY   VALUE  SOURCE
/tmp/file-special  allocated  52.5M  -
NAME               PROPERTY        VALUE           SOURCE
/tmp/file-special  backup_to_pool  on              -

@rincebrain
Copy link
Contributor

I mean, I'd expect the special's allocation to drop, since you've told it to spend one allocation not on the special, and absent small blocks that's all metadata (e.g. n=2 or 3 copies), right?

@tonyhutter
Copy link
Contributor Author

@rincebrain ah ok, so irrespective of this PR, all metadata on special is copies >=2 by default, and special small blocks is copies = 1? Interesting. That brings up the question of which is safer: two copies on 'special' (probably NVMe), or one copy on special (NVMe) and one on pool (HDD)? My gut feeling would be the latter, since the data is then on two physical drives.

@rincebrain
Copy link
Contributor

rincebrain commented Apr 9, 2024

I mean, all metadata, no matter where, is copies >= 2, unless you twiddle redundant_metadata.

The latter is technically safer, depending on the failure case you're optimizing for, but then you've lost all the performance benefits for the write path for the special vdev, which is why I'd suggest you call it something different, and not make it the default for existing use cases. (We already call different "special" vdev specializations other things, e.g. "dedup", so we could add a few more cases without that much strangeness, I think...)

IOW, I don't think you should break people's expectations for the performance characteristics of doing zpool create mypool raidz2 disk1...4 special mirror disk5 disk6 between 2.2 and 2.3.

(I spent a little while contemplating whether you could do something evil and cute like a bit in the bp indicating that you should check an indirection table or something for DVA[2] and then you can do that write async, to let you have your cake and eat it too for something like this, but I don't personally have a use case for it, so I didn't go try implementing it...)

@amotin
Copy link
Member

amotin commented Apr 9, 2024

@tonyhutter First make sure you have ashift=12 (AKA 512e/4Kn disks). After that for dRAID it is not the total number of disks in VDEV is important, but the number of disks in redundancy group, which is 8 by default, but you may increase it higher.

@robn
Copy link
Member

robn commented Apr 10, 2024

@tonyhutter do you have some specific use cases in mind for this? Any design notes? I'm not sure yet what to make of this as a concept, and I'm not sure when I'd want to use it, and would like some guidance.

(at least, from the referenced issue, it sounds like this more on "add disks to pool over time" side of the line, vs "plan and build the whole pool up front"? That's not necessarily bad, I just want to make sure I've got the right hat on).

@tonyhutter
Copy link
Contributor Author

tonyhutter commented Apr 10, 2024

@tonyhutter do you have some specific use cases in mind for this?

  1. It was a requested feature (Write-through special device #15118)
  2. It lowers the barrier to using alloc class devices since you don't need to match the redundancy of the pool. It makes using single-drive special vdevs very low-stakes.
  3. In the common case where the metadata is only ~1% of all data, the extra write of that 1% to the pool is pretty insignificant.
  4. We once had a case where we had a pool comprised of HDDs with NVMe special devices, and the NVMe drives all disappeared/re-appeared (maybe bad PCIe switch?). A reboot cleared it up and we didn't lose any data, but still would have been nice to know it was backed up.

@amotin
Copy link
Member

amotin commented Apr 10, 2024

@tonyhutter Generally I agree that writing all copies of a block to the same vdev is a bad idea. But to me this echoes to a thought that I already sounded several times on few other threads: we should split concept of special vdev into finer-granular parts. Would we just use special vdev as a metadata accelerator, your approach would make some sense. But now special vdev also handles small allocations, and from that perspective it does not make it. Would we first introduce that finer-grained control, many things could be more reasonable.

@tonyhutter
Copy link
Contributor Author

tonyhutter commented Apr 10, 2024

@amotin I see what you're saying about dRAID. For small 4k special writes, it's going to amplify it by [blocks * [data stripe + parity]] to the dRAID pool. So on a draid2:8:1, you're going to get 10x amplification for backing up special metadata. A partial workaround would be to print a big warning like "Note: you're probably going to waste ~X% pool space backing up your special device" when you do a zpool create with dRAID & special.

Just to try to get some more real-world numbers, here's the zdb -PLbbbs output from a pool containing a Fedora 38 root file system:

https://gist.github.com/tonyhutter/ab3f0ee1c0c0e9a132cb802a766862ec

The pool was one normal disk and one special disk, using only the zfs defaults. I see this line:

1132772	20992134144	10042183680	10135813632	 8947	 2.09	100.00	Total
431992	2716620288	93478912	187077120	  433	29.06	  1.85	Metadata Total
                                                          |               |
                                                           \               \___ percent of pool
                                                            \____ average block size

But that seems odd for 433B to be the average block size for metadata so maybe I'm reading it wrong.

@rincebrain
Copy link
Contributor

It gets much, much worse the larger the ashift gets.

But if you look above it, you can see:

 32516	532742144	59620352	119252480	 3667	 8.94	  1.18	    L0 DMU dnode
380961	224260608	11895296	23790592	   62	18.85	  0.23	    L0 ZFS directory
 14863	1948123136	20012032	40025600	 2692	97.35	  0.39	    L1 Total
431992	2716620288	93478912	187077120	  433	29.06	  1.85	Metadata Total

so we have mostly directory objects which can be embedded_data here, 4k dnodes that compress down to 3.5k or less sometimes, and L1 blocks that compress down a fair bit.

This commit allows you to automatically backup allocation class vdevs to
the pool. If the alloc class vdev is fully backed up, it can fail
without the pool losing any data.  This also means you can safely create
pools with non-matching alloc class redundancy (like a mirrored pool
with a single special device).

It works by making sure all alloc class writes have at least two DVA
copies, and then having the 2nd copy always go to the pool itself. So
whenever you write to an alloc class vdev, another copy of the data is
also written to the pool.

This behavior is controlled via three properties:

1. feature@allow_backup_to_pool - This feature flag enables the backup
   subsystem.  It also prevents the backed-up pool from being imported
   read/write on an older version of ZFS that does not support alloc
   class backups.

2. backup_alloc_class_to_pool - This pool property is the main on/off
   switch to control the backup feature.  It is on by default but can be
   turned off at any time.  Once it is turned off, then all existing
   vdevs will no longer considered to be fully backed up.

3. backup_to_pool - This is a read-only vdev property that will report
   "on" if all the data on the vdev is fully backed up to the pool.

Note that the backup to pool feature is now enabled by default on all
new pools.  This may create a performance penalty over pure alloc class
writes due to the extra backup copy write to the pool.  Alloc class
reads should not be affected as they always read from DVA 0 first (the
copy of the data on the special device).

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Contributor Author

Let's leave this PR on ice for now. I'm looking into a much better way of backing up the special metadata to the pool. Details to follow.

@tonyhutter
Copy link
Contributor Author

I think the allocation class backups could better be done via a dataset property. Please see #16095 for details and discussion. If you want to comment please post in #16095.

@tonyhutter tonyhutter mentioned this pull request May 10, 2024
13 tasks
@tonyhutter
Copy link
Contributor Author

I think the allocation class backups could better be done via a dataset property. Please see #16095 for details and discussion. If you want to comment please post in #16095.

Unfortunately this is not going to work, as the dedup data is per-pool.

I've since created a newer, simplied version of this PR that I think will work better: #16185. Closing this PR in favor of the new one.

@tonyhutter tonyhutter closed this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write-through special device
4 participants