-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zpool: allow sharing of spare device among pools #7999
Conversation
cmd/zpool/zpool_vdev.c
Outdated
@@ -423,7 +423,7 @@ check_disk(const char *path, blkid_cache cache, int force, | |||
if (!iswholedisk) | |||
return (check_slice(path, cache, force, isspare)); | |||
|
|||
if ((fd = open(path, O_RDONLY|O_DIRECT|O_EXCL)) < 0) { | |||
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 8c54ddd.
* zpool_add_005_pos.ksh - ... Add O_EXCL to the in-use check to prevent the -f (force) option from
working for mounted filesystems and improve the resulting error.
We'll need to find some other way to handle this is O_EXCL
is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems O_EXCL
is only buying us a slightly better error message here, with and without -f
:
System status (pools and mounted fs on /dev/sdc1
):
root@linux:/usr/src/zfs# zpool status
pool: p1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
p1 ONLINE 0 0 0
sda ONLINE 0 0 0
errors: No known data errors
pool: p2
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
p2 ONLINE 0 0 0
sdb ONLINE 0 0 0
errors: No known data errors
root@linux:/usr/src/zfs# df -h /dev/sdc1
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 120M 1.6M 110M 2% /mnt
root@linux:/usr/src/zfs#
Previous behaviour:
root@linux:/usr/src/zfs# zpool add p1 spare sdc
/dev/sdc is in use and contains a unknown filesystem.
root@linux:/usr/src/zfs# zpool add p1 spare sdc1
invalid vdev specification
use '-f' to override the following errors:
/dev/sdc1 contains a filesystem of type 'ext4'
root@linux:/usr/src/zfs#
root@linux:/usr/src/zfs#
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc
/dev/sdc is in use and contains a unknown filesystem.
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc1
cannot open '/dev/sdc1': Device or resource busy
cannot add to 'p1': one or more vdevs refer to the same device
root@linux:/usr/src/zfs#
Patched behaviour:
root@linux:/usr/src/zfs# zpool add p1 spare sdc
cannot label 'sdc': cannot label '/dev/sdc': unable to open device: 16
root@linux:/usr/src/zfs# zpool add p1 spare sdc1
invalid vdev specification
use '-f' to override the following errors:
/dev/sdc1 contains a filesystem of type 'ext4'
root@linux:/usr/src/zfs#
root@linux:/usr/src/zfs#
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc
cannot label 'sdc': cannot label '/dev/sdc': unable to open device: 16
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc1
cannot open '/dev/sdc1': Device or resource busy
cannot add to 'p1': one or more vdevs refer to the same device
root@linux:/usr/src/zfs#
I am going to try and keep the error "is in use and contains a unknown filesystem", though if i had to choose i would prefer a misleading error message (and the optional "documentation" issue requesting to improve it) rather than this missing functionality (shared spare).
Also it seems this could use a new test in the "inuse" group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it'd be nice to have both but we can live with a slightly worse error message to restore the missing functionality.
0f35f38
to
d738022
Compare
@@ -547,7 +552,7 @@ is_spare(nvlist_t *config, const char *path) | |||
uint_t i, nspares; | |||
boolean_t inuse; | |||
|
|||
if ((fd = open(path, O_RDONLY)) < 0) | |||
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to fix the spurious failures. I reproduced the issue on my CentOS7 home builder. Sometimes the spare cannot be added to the second pool ("testpool1"):
[root@linux ~]# zpool status
pool: testpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
testpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/var/tmp/file-safe-dev1 ONLINE 0 0 0
/var/tmp/file-fail-dev1 ONLINE 0 0 0
spares
sdd AVAIL
errors: No known data errors
pool: testpool1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
testpool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
/var/tmp/file-safe-dev2 ONLINE 0 0 0
/var/tmp/file-fail-dev2 ONLINE 0 0 0
errors: No known data errors
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]#
The error is coming from zpool_label_disk()
. We are (erroneously) trying to label the spare device which cannot be opened with O_EXCL
:
Breakpoint 1, 0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
#1 0x00007ffff5b7f0c3 in _IO_new_file_write () from /lib64/libc.so.6
#2 0x00007ffff5b7f960 in __GI__IO_file_xsputn () from /lib64/libc.so.6
#3 0x00007ffff5b52d6d in buffered_vfprintf () from /lib64/libc.so.6
#4 0x00007ffff5b4d6be in vfprintf () from /lib64/libc.so.6
#5 0x00007ffff5b58297 in fprintf () from /lib64/libc.so.6
#6 0x00007ffff777ce32 in zfs_verror (hdl=0x62c080, error=2044, fmt=<optimized out>, ap=<optimized out>) at libzfs_util.c:328
#7 0x00007ffff777d172 in zfs_error_fmt (hdl=hdl@entry=0x62c080, error=error@entry=2044, fmt=fmt@entry=0x7ffff779a97a "%s") at libzfs_util.c:349
#8 0x00007ffff777d191 in zfs_error (hdl=hdl@entry=0x62c080, error=error@entry=2044, msg=msg@entry=0x7fffffff46d0 "cannot label 'sdd'") at libzfs_util.c:338
#9 0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
#10 0x0000000000417aca in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1302
#11 0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#12 0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2,
argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#13 0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#14 0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) frame 9
#9 0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
4579 return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
(gdb) list
4574 * This shouldn't happen. We've long since verified that this
4575 * is a valid device.
4576 */
4577 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
4578 "label '%s': unable to open device: %d"), path, errno);
4579 return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
4580 }
4581
4582 if (efi_alloc_and_init(fd, EFI_NUMPAR, &vtoc) != 0) {
4583 /*
(gdb) p errno
$1 = 16
(gdb)
We should not be trying to label this device since it's already attached as a spare to "testpool". The underlying issue is we cannot find a valid label on this spare device (zpool_read_label()
returns an empty config): this leads us to believe this device is not a valid (and "inuse") spare:
(gdb) bt
#0 zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
#1 0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
#2 0x00000000004174d6 in is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:558
#3 0x00000000004179ec in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1288
#4 0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#5 0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2,
argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#6 0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#7 0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) fin
Run till exit from #0 zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
2472 if (zpool_read_label(fd, &config, NULL) != 0) {
Value returned is $1 = 0
(gdb) list
2467 aux_cbdata_t cb = { 0 };
2468 boolean_t isactive;
2469
2470 *inuse = B_FALSE;
2471
2472 if (zpool_read_label(fd, &config, NULL) != 0) {
2473 (void) no_memory(hdl);
2474 return (-1);
2475 }
2476
(gdb) p config
$2 = (nvlist_t *) 0x0
(gdb) n
2477 if (config == NULL)
(gdb)
2628 }
(gdb)
is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:559
559 !inuse ||
(gdb) p inuse
$3 = B_FALSE
(gdb) p state
$4 = POOL_STATE_ACTIVE
(gdb) n
558 if (zpool_in_use(g_zfs, fd, &state, &name, &inuse) != 0 ||
(gdb)
562 free(name);
(gdb)
563 (void) close(fd);
(gdb)
564 return (B_FALSE);
Trying to read the first label with dd
shows the data being all zeroes:
[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.000906013 s, 289 MB/s
00040000
[root@linux ~]#
Ultimately using O_DIRECT let us read the real label:
[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 iflag=direct | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003fd0 00 00 00 00 00 00 00 00 11 7a 0c b1 7a da 10 02 |.........z..z...|
00003fe0 3f 2a 6e 7f 80 8f f4 97 fc ce aa 58 16 9f 90 af |?*n........X....|
00003ff0 8b b4 6d ff 57 ea d1 cb ab 5f 46 0d db 92 c6 6e |..m.W...._F....n|
00004000 01 01 00 00 00 00 00 00 00 00 00 01 00 00 00 24 |...............$|
00004010 00 00 00 20 00 00 00 07 76 65 72 73 69 6f 6e 00 |... ....version.|
00004020 00 00 00 08 00 00 00 01 00 00 00 00 00 00 13 88 |................|
00004030 00 00 00 24 00 00 00 20 00 00 00 05 73 74 61 74 |...$... ....stat|
00004040 65 00 00 00 00 00 00 08 00 00 00 01 00 00 00 00 |e...............|
00004050 00 00 00 03 00 00 00 20 00 00 00 20 00 00 00 04 |....... ... ....|
00004060 67 75 69 64 00 00 00 08 00 00 00 01 16 7a 0c 8b |guid.........z..|
00004070 6b aa 06 4b 00 00 00 00 00 00 00 00 00 00 00 00 |k..K............|
00004080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0001ffd0 00 00 00 00 00 00 00 00 11 7a 0c b1 7a da 10 02 |.........z..z...|
0001ffe0 5a 71 40 b9 d8 aa df a0 38 cc 29 ef 28 82 6f b9 |Zq@.....8.).(.o.|
0001fff0 28 4d b0 fb 59 f7 3b 4d 2b 0b 0e e4 d9 74 61 7c |(M..Y.;M+....ta||
00020000 0c b1 ba 00 00 00 00 00 88 13 00 00 00 00 00 00 |................|
00020010 00 00 00 00 00 00 00 00 a9 95 9d ee c2 cf b7 af |................|
00020020 97 46 c2 5b 00 00 00 00 01 00 00 00 00 00 00 00 |.F.[............|
00020030 68 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |h...............|
00020040 68 80 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |h...............|
00020050 41 00 01 00 00 00 00 00 07 00 00 00 0f 07 0b 80 |A...............|
00020060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00020070 00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00 |................|
00020080 29 00 00 00 00 00 00 00 b1 9e 04 d1 07 00 00 00 |)...............|
00020090 a5 d1 cf 92 35 03 00 00 24 26 85 d4 e6 ab 00 00 |....5...$&......|
000200a0 4f 6e 26 93 21 71 18 00 88 13 00 00 00 00 00 00 |On&.!q..........|
000200b0 11 ea 1c a1 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000200c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.00151793 s, 173 MB/s
0003ffd0 00 00 00 00 00 00 00 00 11 7a 0c b1 7a da 10 02 |.........z..z...|
0003ffe0 d0 a4 4d 33 f5 34 84 4d 9d df d1 63 0a 65 ec 3c |..M3.4.M...c.e.<|
0003fff0 39 44 6c 6d 7c ab d4 d6 ed e6 84 b6 e0 6e c0 4e |9Dlm|........n.N|
00040000
[root@linux ~]#
Sorry for the long text; i spent hours debugging the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, the label would have been written in the kernel which would by pass the buffer cache. Passing O_DIRECT ensures we get the blocks from the disk and never a potentially stale cached version.
d738022
to
2cf3d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running this down!
@@ -547,7 +552,7 @@ is_spare(nvlist_t *config, const char *path) | |||
uint_t i, nspares; | |||
boolean_t inuse; | |||
|
|||
if ((fd = open(path, O_RDONLY)) < 0) | |||
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, the label would have been written in the kernel which would by pass the buffer cache. Passing O_DIRECT ensures we get the blocks from the disk and never a potentially stale cached version.
|
||
for vdev in $SAFE_FILEDEVPOOL1 $SAFE_FILEDEVPOOL2 $FAIL_FILEDEVPOOL1 \ | ||
$FAIL_FILEDEVPOOL2 $SPARE_FILEDEV; do | ||
truncate -s $SPA_MINDEVSIZE $vdev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_must truncate -s $SPA_MINDEVSIZE $vdev
verify_runnable "both" | ||
|
||
if is_linux; then | ||
load_scsi_debug $SDSIZE 1 1 1 '512b' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be $SPA_MINDEVSIZE
instead of $SDSIZE
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$SPA_MINDEVSIZE
is a number in bytes, scsi_debug wants the size parameter in MB (default $SDSIZE
is 256). I am going to copy "auto_spare_ashift.ksh":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, add the comment from there in too:
# 2. Add one 512e spare device (4Kn would generate IO errors on replace)
# NOTE: must be larger than the existing 512b devices, add 32m of fudge
ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future regression. Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2cf3d3b
to
081c049
Compare
ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future regression. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#7999
ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future regression. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#7999
ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future regression. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#7999
Motivation and Context
The following issue has been reported on the "zfs-discuss" mailing-list: https://list.zfsonlinux.org/pipermail/zfs-discuss/2018-September/032352.html
Description
ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future
regression.
How Has This Been Tested?
Test added to the ZFS Test Suite. Pushing as "work in progress" because said test fails occasionally on my home builders.
Types of changes
Checklist:
Signed-off-by
.