Skip to content

Commit

Permalink
Backup allocation class vdev data to the pool
Browse files Browse the repository at this point in the history
This commit allows you to automatically backup allocation class vdevs to
the pool. If the alloc class vdev is fully backed up, it can fail
without the pool losing any data.  This also means you can safely create
pools with non-matching alloc class redundancy (like a mirrored pool
with a single special device).

It works by making sure all alloc class writes have at least two DVA
copies, and then having the 2nd copy always go to the pool itself. So
whenever you write to an alloc class vdev, another copy of the data is
also written to the pool.

This behavior is controlled via three properties:

1. feature@allow_backup_to_pool - This feature flag enables the backup
   subsystem.  It also prevents the backed-up pool from being imported
   read/write on an older version of ZFS that does not support alloc
   class backups.

2. backup_alloc_class_to_pool - This pool property is the main on/off
   switch to control the backup feature.  It is on by default but can be
   turned off at any time.  Once it is turned off, then all existing
   vdevs will no longer considered to be fully backed up.

3. backup_to_pool - This is a read-only vdev property that will report
   "on" if all the data on the vdev is fully backed up to the pool.

Note that the backup to pool feature is now enabled by default on all
new pools.  This may create a performance penalty over pure alloc class
writes due to the extra backup copy write to the pool.  Alloc class
reads should not be affected as they always read from DVA 0 first (the
copy of the data on the special device).

Closes: #15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
  • Loading branch information
tonyhutter committed Apr 9, 2024
1 parent 162cc80 commit 8281c31
Show file tree
Hide file tree
Showing 52 changed files with 2,468 additions and 277 deletions.
56 changes: 51 additions & 5 deletions cmd/zpool/zpool_vdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -480,13 +480,43 @@ is_raidz_draid(replication_level_t *a, replication_level_t *b)
return (B_FALSE);
}

/*
* Return true if 'props' contains either:
*
* feature@allow_backup_to_pool=disabled
*
* or
*
* backup_alloc_class_to_pool=off
*/
static boolean_t
is_backup_to_pool_disabled_in_props(nvlist_t *props)
{
const char *str = NULL;
if (nvlist_lookup_string(props, "feature@allow_backup_to_pool",
&str) == 0) {
if ((str != NULL) && strcmp(str, "disabled") == 0) {
return (B_TRUE); /* It is disabled */
}
}

if (nvlist_lookup_string(props, "backup_alloc_class_to_pool",
&str) == 0) {
if ((str != NULL) && strcmp(str, "off") == 0) {
return (B_TRUE); /* It is disabled */
}
}

return (B_FALSE);
}

/*
* Given a list of toplevel vdevs, return the current replication level. If
* the config is inconsistent, then NULL is returned. If 'fatal' is set, then
* an error message will be displayed for each self-inconsistent vdev.
*/
static replication_level_t *
get_replication(nvlist_t *nvroot, boolean_t fatal)
get_replication(nvlist_t *props, nvlist_t *nvroot, boolean_t fatal)
{
nvlist_t **top;
uint_t t, toplevels;
Expand All @@ -507,6 +537,7 @@ get_replication(nvlist_t *nvroot, boolean_t fatal)

for (t = 0; t < toplevels; t++) {
uint64_t is_log = B_FALSE;
const char *str = NULL;

nv = top[t];

Expand All @@ -518,6 +549,21 @@ get_replication(nvlist_t *nvroot, boolean_t fatal)
if (is_log)
continue;

/*
* By default, all alloc class devices have their backup to pool
* props enabled, so their replication level doesn't matter.
* However, if they're disabled for any reason, then we do need
* to force redundancy.
*/
(void) nvlist_lookup_string(nv, ZPOOL_CONFIG_ALLOCATION_BIAS,
&str);
if (str &&
((strcmp(str, VDEV_ALLOC_BIAS_SPECIAL) == 0) ||
(strcmp(str, VDEV_ALLOC_BIAS_DEDUP) == 0))) {
if (!is_backup_to_pool_disabled_in_props(props))
continue; /* We're backed up, skip redundancy */
}

/*
* Ignore holes introduced by removing aux devices, along
* with indirect vdevs introduced by previously removed
Expand Down Expand Up @@ -808,7 +854,7 @@ get_replication(nvlist_t *nvroot, boolean_t fatal)
* report any difference between the two.
*/
static int
check_replication(nvlist_t *config, nvlist_t *newroot)
check_replication(nvlist_t *props, nvlist_t *config, nvlist_t *newroot)
{
nvlist_t **child;
uint_t children;
Expand All @@ -825,7 +871,7 @@ check_replication(nvlist_t *config, nvlist_t *newroot)

verify(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
&nvroot) == 0);
if ((current = get_replication(nvroot, B_FALSE)) == NULL)
if ((current = get_replication(props, nvroot, B_FALSE)) == NULL)
return (0);
}
/*
Expand All @@ -850,7 +896,7 @@ check_replication(nvlist_t *config, nvlist_t *newroot)
* Get the replication level of the new vdev spec, reporting any
* inconsistencies found.
*/
if ((new = get_replication(newroot, B_TRUE)) == NULL) {
if ((new = get_replication(props, newroot, B_TRUE)) == NULL) {
free(current);
return (-1);
}
Expand Down Expand Up @@ -1888,7 +1934,7 @@ make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force, int check_rep,
* found. We include the existing pool spec, if any, as we need to
* catch changes against the existing replication level.
*/
if (check_rep && check_replication(poolconfig, newroot) != 0) {
if (check_rep && check_replication(props, poolconfig, newroot) != 0) {
nvlist_free(newroot);
return (NULL);
}
Expand Down
4 changes: 4 additions & 0 deletions include/sys/fs/zfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ typedef enum {
ZPOOL_PROP_BCLONEUSED,
ZPOOL_PROP_BCLONESAVED,
ZPOOL_PROP_BCLONERATIO,
ZPOOL_PROP_BACKUP_ALLOC_CLASS_TO_POOL,
ZPOOL_NUM_PROPS
} zpool_prop_t;

Expand Down Expand Up @@ -368,6 +369,7 @@ typedef enum {
VDEV_PROP_RAIDZ_EXPANDING,
VDEV_PROP_SLOW_IO_N,
VDEV_PROP_SLOW_IO_T,
VDEV_PROP_BACKUP_TO_POOL,
VDEV_NUM_PROPS
} vdev_prop_t;

Expand Down Expand Up @@ -845,6 +847,7 @@ typedef struct zpool_load_policy {
#define ZPOOL_CONFIG_EXPANSION_TIME "expansion_time" /* not stored */
#define ZPOOL_CONFIG_REBUILD_STATS "org.openzfs:rebuild_stats"
#define ZPOOL_CONFIG_COMPATIBILITY "compatibility"
#define ZPOOL_CONFIG_BACKUP_TO_POOL "backup_to_pool"

/*
* The persistent vdev state is stored as separate values rather than a single
Expand Down Expand Up @@ -1604,6 +1607,7 @@ typedef enum {
ZFS_ERR_CRYPTO_NOTSUP,
ZFS_ERR_RAIDZ_EXPAND_IN_PROGRESS,
ZFS_ERR_ASHIFT_MISMATCH,
ZFS_ERR_BACKUP_DISABLED_BUT_REQUESTED,
} zfs_errno_t;

/*
Expand Down
3 changes: 2 additions & 1 deletion include/sys/spa.h
Original file line number Diff line number Diff line change
Expand Up @@ -1113,7 +1113,8 @@ extern boolean_t spa_remap_blkptr(spa_t *spa, blkptr_t *bp,
extern uint64_t spa_get_last_removal_txg(spa_t *spa);
extern boolean_t spa_trust_config(spa_t *spa);
extern uint64_t spa_missing_tvds_allowed(spa_t *spa);
extern void spa_set_missing_tvds(spa_t *spa, uint64_t missing);
extern void spa_set_missing_tvds(spa_t *spa, uint64_t missing,
uint64_t missing_special);
extern boolean_t spa_top_vdevs_spacemap_addressable(spa_t *spa);
extern uint64_t spa_total_metaslabs(spa_t *spa);
extern boolean_t spa_multihost(spa_t *spa);
Expand Down
9 changes: 9 additions & 0 deletions include/sys/spa_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,12 @@ struct spa {
uint64_t spa_missing_tvds; /* unopenable tvds on load */
uint64_t spa_missing_tvds_allowed; /* allow loading spa? */

/*
* number of 'spa_missing_tvds' that are alloc class devices
* backed up to the pool, and thus recoverable from errors.
*/
uint64_t spa_missing_recovered_tvds;

uint64_t spa_nonallocating_dspace;
spa_removing_phys_t spa_removing_phys;
spa_vdev_removal_t *spa_vdev_removal;
Expand Down Expand Up @@ -465,6 +471,9 @@ struct spa {
*/
spa_config_lock_t spa_config_lock[SCL_LOCKS]; /* config changes */
zfs_refcount_t spa_refcount; /* number of opens */

/* Backup special/dedup devices data to the pool */
boolean_t spa_backup_alloc_class;
};

extern char *spa_config_path;
Expand Down
18 changes: 18 additions & 0 deletions include/sys/vdev.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,24 @@ extern uint32_t vdev_queue_length(vdev_t *vd);
extern uint64_t vdev_queue_last_offset(vdev_t *vd);
extern uint64_t vdev_queue_class_length(vdev_t *vq, zio_priority_t p);

typedef enum {
/* (special flag) dry-run, get count only */
VDEV_ARRAY_COUNT = 1ULL << 0,

VDEV_ARRAY_ANY_LEAF = 1ULL << 1, /* match any leaf */
VDEV_ARRAY_SPECIAL_LEAF = 1ULL << 2, /* match special vdev leaves */
VDEV_ARRAY_DEDUP_LEAF = 1ULL << 3, /* match dedup vdev leaves */
} vdev_array_flag_t;

struct vdev_array
{
vdev_t **vds; /* Array of vdev_t's */
int count;
};

extern struct vdev_array *vdev_array_alloc(vdev_t *rvd, uint64_t flags);
extern void vdev_array_free(struct vdev_array *vda);

extern void vdev_config_dirty(vdev_t *vd);
extern void vdev_config_clean(vdev_t *vd);
extern int vdev_config_sync(vdev_t **svd, int svdcount, uint64_t txg);
Expand Down
12 changes: 12 additions & 0 deletions include/sys/vdev_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,13 @@ struct vdev {
uint64_t vdev_failfast; /* device failfast setting */
boolean_t vdev_rz_expanding; /* raidz is being expanded? */
boolean_t vdev_ishole; /* is a hole in the namespace */

/*
* If this is set to true, then all the data on this vdev is backed up
* to the pool. This is only used by allocation class devices.
*/
boolean_t vdev_backup_to_pool;

uint64_t vdev_top_zap;
vdev_alloc_bias_t vdev_alloc_bias; /* metaslab allocation bias */

Expand Down Expand Up @@ -641,6 +648,11 @@ extern int vdev_obsolete_counts_are_precise(vdev_t *vd, boolean_t *are_precise);
int vdev_checkpoint_sm_object(vdev_t *vd, uint64_t *sm_obj);
void vdev_metaslab_group_create(vdev_t *vd);
uint64_t vdev_best_ashift(uint64_t logical, uint64_t a, uint64_t b);
extern boolean_t vdev_is_fully_backed_up(vdev_t *vd);
extern boolean_t vdev_is_leaf(vdev_t *vd);
extern boolean_t vdev_is_special(vdev_t *vd);
extern boolean_t vdev_is_dedup(vdev_t *vd);
extern boolean_t vdev_is_alloc_class(vdev_t *vd);

/*
* Vdev ashift optimization tunables
Expand Down
1 change: 1 addition & 0 deletions include/zfeature_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ typedef enum spa_feature {
SPA_FEATURE_AVZ_V2,
SPA_FEATURE_REDACTION_LIST_SPILL,
SPA_FEATURE_RAIDZ_EXPANSION,
SPA_FEATURE_ALLOW_BACKUP_TO_POOL,
SPA_FEATURES
} spa_feature_t;

Expand Down
6 changes: 6 additions & 0 deletions lib/libzfs/libzfs_util.c
Original file line number Diff line number Diff line change
Expand Up @@ -774,6 +774,12 @@ zpool_standard_error_fmt(libzfs_handle_t *hdl, int error, const char *fmt, ...)
case ZFS_ERR_ASHIFT_MISMATCH:
zfs_verror(hdl, EZFS_ASHIFT_MISMATCH, fmt, ap);
break;
case ZFS_ERR_BACKUP_DISABLED_BUT_REQUESTED:
zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
"Cannot enable backup to pool since "
"feature@allow_backup_to_pool is not active."));
zfs_verror(hdl, EZFS_IOC_NOTSUPPORTED, fmt, ap);
break;
default:
zfs_error_aux(hdl, "%s", zfs_strerror(error));
zfs_verror(hdl, EZFS_UNKNOWN, fmt, ap);
Expand Down
10 changes: 5 additions & 5 deletions lib/libzutil/zutil_import.c
Original file line number Diff line number Diff line change
Expand Up @@ -1924,7 +1924,7 @@ zpool_find_config(libpc_handle_t *hdl, const char *target, nvlist_t **configp,

/* Return if a vdev is a leaf vdev. Note: draid spares are leaf vdevs. */
static boolean_t
vdev_is_leaf(nvlist_t *nv)
vdev_is_leaf_nv(nvlist_t *nv)
{
uint_t children = 0;
nvlist_t **child;
Expand All @@ -1937,10 +1937,10 @@ vdev_is_leaf(nvlist_t *nv)

/* Return if a vdev is a leaf vdev and a real device (disk or file) */
static boolean_t
vdev_is_real_leaf(nvlist_t *nv)
vdev_is_real_leaf_nv(nvlist_t *nv)
{
const char *type = NULL;
if (!vdev_is_leaf(nv))
if (!vdev_is_leaf_nv(nv))
return (B_FALSE);

(void) nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type);
Expand Down Expand Up @@ -1973,7 +1973,7 @@ __for_each_vdev_macro_helper_func(void *state, nvlist_t *nv, void *last_nv,

/* The very first entry in the NV list is a special case */
if (*((nvlist_t **)state) == (nvlist_t *)FIRST_NV) {
if (real_leaves_only && !vdev_is_real_leaf(nv))
if (real_leaves_only && !vdev_is_real_leaf_nv(nv))
return (0);

*((nvlist_t **)last_nv) = nv;
Expand All @@ -1996,7 +1996,7 @@ __for_each_vdev_macro_helper_func(void *state, nvlist_t *nv, void *last_nv,
* we want.
*/
if (*(nvlist_t **)state == (nvlist_t *)NEXT_IS_MATCH) {
if (real_leaves_only && !vdev_is_real_leaf(nv))
if (real_leaves_only && !vdev_is_real_leaf_nv(nv))
return (0);

*((nvlist_t **)last_nv) = nv;
Expand Down
19 changes: 19 additions & 0 deletions man/man7/vdevprops.7
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,25 @@ If this device should perform new allocations, used to disable a device
when it is scheduled for later removal.
See
.Xr zpool-remove 8 .
.It Sy backup_to_pool
When
.Sy backup_to_pool
is "on" it means the vdev is fully backed up to the pool.
That is, there is an extra copy of all the vdev's data on the pool itself.
This allows vdevs with
.Sy backup_to_pool=on
to fail without losing data, regardless
of their redundancy level.
.Sy backup_to_pool
is only used for alloc class devices
(special and dedup) and is controlled by the
.Sy feature@allow_backup_to_pool
feature flag and
.Sy backup_alloc_class_to_pool
pool property.
The
.Sy backup_to_pool
vdev property is read-only.
.El
.Ss User Properties
In addition to the standard native properties, ZFS supports arbitrary user
Expand Down
40 changes: 40 additions & 0 deletions man/man7/zpool-features.7
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,46 @@ With device removal, it can be returned to the
.Sy enabled
state if all the dedicated allocation class vdevs are removed.
.
.feature org.zfsonlinux allow_backup_to_pool yes allocation_classes
This feature allows the
.Sy backup_alloc_class_to_pool
pool property to be used.
When the
.Sy backup_alloc_class_to_pool
pool property is set to "on" all proceeding writes to allocation class vdevs
(like special and dedup vdevs) will also generate an additional copy of the data
to be written to the pool.
This allows alloc class vdev data to be "backed up" to the pool.
A fully backed up allocation device vdev can fail without causing the pool to be
suspended, even if the alloc class device is not redundant.
.Pp
It is important to note the difference between the
.Sy allow_backup_to_pool
feature flag and a
.Sy backup_alloc_class_to_pool
pool property since they appear similar.
The
.Sy allow_backup_to_pool
feature flag is a safeguard to prevent a pool that is backed up from being
imported read/write on an older version of ZFS that does not support backup to
pool (and possibly compromising the integrity of the backup guarantees).
The pool property is what actually allows you to turn on/off the backup copy
writes.
You can think of it as if the
.Sy allow_backup_to_pool
feature "unlocks" the
.Sy backup_alloc_class_to_pool
pool property.
See the
.Sy backup_alloc_class_to_pool
pool property and
.Sy backup_to_pool
vdev property for more details.
.Pp
This feature becomes
.Sy active
by default on new pools (unless explicitly disabled at zpool creation time).
.
.feature com.delphix async_destroy yes
Destroying a file system requires traversing all of its data in order to
return its used space to the pool.
Expand Down
22 changes: 18 additions & 4 deletions man/man7/zpoolconcepts.7
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,31 @@ For more information, see the
section.
.It Sy dedup
A device solely dedicated for deduplication tables.
The redundancy of this device should match the redundancy of the other normal
devices in the pool.
If more than one dedup device is specified, then
allocations are load-balanced between those devices.
The dedup vdevs only need to match the redundancy level of the normal devices
if they are not being backed-up to the pool (backed-up is the default).
See the
.Sy feature@allow_backup_to_pool
feature flag,
.Sy backup_alloc_class_to_pool
pool property and
.Sy backup_to_pool
vdev property for more details.
.It Sy special
A device dedicated solely for allocating various kinds of internal metadata,
and optionally small file blocks.
The redundancy of this device should match the redundancy of the other normal
devices in the pool.
If more than one special device is specified, then
allocations are load-balanced between those devices.
The special vdevs only need to match the redundancy level of the normal devices
if they are not being backed-up to the pool (backed-up is the default).
See the
.Sy feature@allow_backup_to_pool
feature flag,
.Sy backup_alloc_class_to_pool
pool property and
.Sy backup_to_pool
vdev property for more details.
.Pp
For more information on special allocations, see the
.Sx Special Allocation Class
Expand Down
Loading

0 comments on commit 8281c31

Please sign in to comment.