Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool: allow sharing of spare device among pools #7999

Merged
merged 1 commit into from
Oct 17, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions cmd/zpool/zpool_vdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -419,11 +419,16 @@ check_disk(const char *path, blkid_cache cache, int force,
char slice_path[MAXPATHLEN];
int err = 0;
int fd, i;
int flags = O_RDONLY|O_DIRECT;

if (!iswholedisk)
return (check_slice(path, cache, force, isspare));

if ((fd = open(path, O_RDONLY|O_DIRECT|O_EXCL)) < 0) {
/* only spares can be shared, other devices require exclusive access */
if (!isspare)
flags |= O_EXCL;

if ((fd = open(path, flags)) < 0) {
char *value = blkid_get_tag_value(cache, "TYPE", path);
(void) fprintf(stderr, gettext("%s is in use and contains "
"a %s filesystem.\n"), path, value ? value : "unknown");
Expand Down Expand Up @@ -547,7 +552,7 @@ is_spare(nvlist_t *config, const char *path)
uint_t i, nspares;
boolean_t inuse;

if ((fd = open(path, O_RDONLY)) < 0)
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fix the spurious failures. I reproduced the issue on my CentOS7 home builder. Sometimes the spare cannot be added to the second pool ("testpool1"):

[root@linux ~]# zpool status
  pool: testpool
 state: ONLINE
  scan: none requested
config:

	NAME                         STATE     READ WRITE CKSUM
	testpool                     ONLINE       0     0     0
	  mirror-0                   ONLINE       0     0     0
	    /var/tmp/file-safe-dev1  ONLINE       0     0     0
	    /var/tmp/file-fail-dev1  ONLINE       0     0     0
	spares
	  sdd                        AVAIL   

errors: No known data errors

  pool: testpool1
 state: ONLINE
  scan: none requested
config:

	NAME                         STATE     READ WRITE CKSUM
	testpool1                    ONLINE       0     0     0
	  mirror-0                   ONLINE       0     0     0
	    /var/tmp/file-safe-dev2  ONLINE       0     0     0
	    /var/tmp/file-fail-dev2  ONLINE       0     0     0

errors: No known data errors
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# 

The error is coming from zpool_label_disk(). We are (erroneously) trying to label the spare device which cannot be opened with O_EXCL:

Breakpoint 1, 0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
#1  0x00007ffff5b7f0c3 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00007ffff5b7f960 in __GI__IO_file_xsputn () from /lib64/libc.so.6
#3  0x00007ffff5b52d6d in buffered_vfprintf () from /lib64/libc.so.6
#4  0x00007ffff5b4d6be in vfprintf () from /lib64/libc.so.6
#5  0x00007ffff5b58297 in fprintf () from /lib64/libc.so.6
#6  0x00007ffff777ce32 in zfs_verror (hdl=0x62c080, error=2044, fmt=<optimized out>, ap=<optimized out>) at libzfs_util.c:328
#7  0x00007ffff777d172 in zfs_error_fmt (hdl=hdl@entry=0x62c080, error=error@entry=2044, fmt=fmt@entry=0x7ffff779a97a "%s") at libzfs_util.c:349
#8  0x00007ffff777d191 in zfs_error (hdl=hdl@entry=0x62c080, error=error@entry=2044, msg=msg@entry=0x7fffffff46d0 "cannot label 'sdd'") at libzfs_util.c:338
#9  0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
#10 0x0000000000417aca in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1302
#11 0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#12 0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2, 
    argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#13 0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#14 0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) frame 9
#9  0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
4579			return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
(gdb) list
4574			 * This shouldn't happen.  We've long since verified that this
4575			 * is a valid device.
4576			 */
4577			zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
4578			    "label '%s': unable to open device: %d"), path, errno);
4579			return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
4580		}
4581	
4582		if (efi_alloc_and_init(fd, EFI_NUMPAR, &vtoc) != 0) {
4583			/*
(gdb) p errno
$1 = 16
(gdb) 

We should not be trying to label this device since it's already attached as a spare to "testpool". The underlying issue is we cannot find a valid label on this spare device (zpool_read_label() returns an empty config): this leads us to believe this device is not a valid (and "inuse") spare:

(gdb) bt
#0  zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
#1  0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
#2  0x00000000004174d6 in is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:558
#3  0x00000000004179ec in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1288
#4  0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#5  0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2, 
    argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#6  0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#7  0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) fin
Run till exit from #0  zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
2472		if (zpool_read_label(fd, &config, NULL) != 0) {
Value returned is $1 = 0
(gdb) list
2467		aux_cbdata_t cb = { 0 };
2468		boolean_t isactive;
2469	
2470		*inuse = B_FALSE;
2471	
2472		if (zpool_read_label(fd, &config, NULL) != 0) {
2473			(void) no_memory(hdl);
2474			return (-1);
2475		}
2476	
(gdb) p config
$2 = (nvlist_t *) 0x0
(gdb) n
2477		if (config == NULL)
(gdb) 
2628	}
(gdb) 
is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:559
559		    !inuse ||
(gdb) p inuse 
$3 = B_FALSE
(gdb) p state
$4 = POOL_STATE_ACTIVE
(gdb) n
558		if (zpool_in_use(g_zfs, fd, &state, &name, &inuse) != 0 ||
(gdb) 
562			free(name);
(gdb) 
563			(void) close(fd);
(gdb) 
564			return (B_FALSE);

Trying to read the first label with dd shows the data being all zeroes:

[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.000906013 s, 289 MB/s
00040000
[root@linux ~]# 

Ultimately using O_DIRECT let us read the real label:

[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 iflag=direct | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
00003fe0  3f 2a 6e 7f 80 8f f4 97  fc ce aa 58 16 9f 90 af  |?*n........X....|
00003ff0  8b b4 6d ff 57 ea d1 cb  ab 5f 46 0d db 92 c6 6e  |..m.W...._F....n|
00004000  01 01 00 00 00 00 00 00  00 00 00 01 00 00 00 24  |...............$|
00004010  00 00 00 20 00 00 00 07  76 65 72 73 69 6f 6e 00  |... ....version.|
00004020  00 00 00 08 00 00 00 01  00 00 00 00 00 00 13 88  |................|
00004030  00 00 00 24 00 00 00 20  00 00 00 05 73 74 61 74  |...$... ....stat|
00004040  65 00 00 00 00 00 00 08  00 00 00 01 00 00 00 00  |e...............|
00004050  00 00 00 03 00 00 00 20  00 00 00 20 00 00 00 04  |....... ... ....|
00004060  67 75 69 64 00 00 00 08  00 00 00 01 16 7a 0c 8b  |guid.........z..|
00004070  6b aa 06 4b 00 00 00 00  00 00 00 00 00 00 00 00  |k..K............|
00004080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0001ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0001ffe0  5a 71 40 b9 d8 aa df a0  38 cc 29 ef 28 82 6f b9  |Zq@.....8.).(.o.|
0001fff0  28 4d b0 fb 59 f7 3b 4d  2b 0b 0e e4 d9 74 61 7c  |(M..Y.;M+....ta||
00020000  0c b1 ba 00 00 00 00 00  88 13 00 00 00 00 00 00  |................|
00020010  00 00 00 00 00 00 00 00  a9 95 9d ee c2 cf b7 af  |................|
00020020  97 46 c2 5b 00 00 00 00  01 00 00 00 00 00 00 00  |.F.[............|
00020030  68 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |h...............|
00020040  68 80 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |h...............|
00020050  41 00 01 00 00 00 00 00  07 00 00 00 0f 07 0b 80  |A...............|
00020060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00020070  00 00 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |................|
00020080  29 00 00 00 00 00 00 00  b1 9e 04 d1 07 00 00 00  |)...............|
00020090  a5 d1 cf 92 35 03 00 00  24 26 85 d4 e6 ab 00 00  |....5...$&......|
000200a0  4f 6e 26 93 21 71 18 00  88 13 00 00 00 00 00 00  |On&.!q..........|
000200b0  11 ea 1c a1 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000200c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.00151793 s, 173 MB/s
0003ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0003ffe0  d0 a4 4d 33 f5 34 84 4d  9d df d1 63 0a 65 ec 3c  |..M3.4.M...c.e.<|
0003fff0  39 44 6c 6d 7c ab d4 d6  ed e6 84 b6 e0 6e c0 4e  |9Dlm|........n.N|
00040000
[root@linux ~]# 

Sorry for the long text; i spent hours debugging the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, the label would have been written in the kernel which would by pass the buffer cache. Passing O_DIRECT ensures we get the blocks from the disk and never a potentially stale cached version.

return (B_FALSE);

if (zpool_in_use(g_zfs, fd, &state, &name, &inuse) != 0 ||
Expand Down
3 changes: 2 additions & 1 deletion tests/runfiles/linux.run
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,8 @@ tags = ['functional', 'exec']
[tests/functional/fault]
tests = ['auto_online_001_pos', 'auto_replace_001_pos', 'auto_spare_001_pos',
'auto_spare_002_pos', 'auto_spare_ashift', 'auto_spare_multiple',
'scrub_after_resilver', 'decrypt_fault', 'decompress_fault']
'auto_spare_shared', 'scrub_after_resilver', 'decrypt_fault',
'decompress_fault']
tags = ['functional', 'fault']

[tests/functional/features/async_destroy]
Expand Down
1 change: 1 addition & 0 deletions tests/zfs-tests/include/libtest.shlib
Original file line number Diff line number Diff line change
Expand Up @@ -3190,6 +3190,7 @@ function wait_scrubbed
{
typeset pool=${1:-$TESTPOOL}
typeset iter=${2:-10}
typeset -i i=0
for i in {1..$iter} ; do
if is_pool_scrubbed $pool ; then
return 0
Expand Down
1 change: 1 addition & 0 deletions tests/zfs-tests/tests/functional/fault/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dist_pkgdata_SCRIPTS = \
auto_spare_002_pos.ksh \
auto_spare_ashift.ksh \
auto_spare_multiple.ksh \
auto_spare_shared.ksh \
decrypt_fault.ksh \
decompress_fault.ksh \
scrub_after_resilver.ksh
Expand Down
119 changes: 119 additions & 0 deletions tests/zfs-tests/tests/functional/fault/auto_spare_shared.ksh
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/bin/ksh -p
#
# CDDL HEADER START
#
# This file and its contents are supplied under the terms of the
# Common Development and Distribution License ("CDDL"), version 1.0.
# You may only use this file in accordance with the terms of version
# 1.0 of the CDDL.
#
# A full copy of the text of the CDDL should have accompanied this
# source. A copy of the CDDL is also available via the Internet at
# http://www.illumos.org/license/CDDL.
#
# CDDL HEADER END
#

#
# Copyright 2018, loli10K <ezomori.nozomu@gmail.com>. All rights reserved.
#

. $STF_SUITE/include/libtest.shlib
. $STF_SUITE/include/math.shlib
. $STF_SUITE/tests/functional/fault/fault.cfg

#
# DESCRIPTION:
# Spare devices (both files and disks) can be shared among different ZFS pools.
#
# STRATEGY:
# 1. Create two pools
# 2. Add the same spare device to different pools
# 3. Inject IO errors with a zinject error handler
# 4. Start a scrub
# 5. Verify the ZED kicks in a hot spare and check pool/device status
# 6. Clear the fault
# 7. Verify the hot spare is available and check pool/device status
#

verify_runnable "both"

if is_linux; then
# Add one 512b spare device (4Kn would generate IO errors on replace)
# NOTE: must be larger than other "file" vdevs and minimum SPA devsize:
# add 32m of fudge
load_scsi_debug $(($SPA_MINDEVSIZE/1024/1024+32)) 1 1 1 '512b'
else
log_unsupported "scsi debug module unsupported"
fi

function cleanup
{
log_must zinject -c all
destroy_pool $TESTPOOL
destroy_pool $TESTPOOL1
unload_scsi_debug
rm -f $SAFE_FILEDEVPOOL1 $SAFE_FILEDEVPOOL2 $FAIL_FILEDEVPOOL1 \
$FAIL_FILEDEVPOOL2 $SPARE_FILEDEV
}

log_assert "Spare devices can be shared among different ZFS pools"
log_onexit cleanup

# Clear events from previous runs
zed_events_drain

SAFE_FILEDEVPOOL1="$TEST_BASE_DIR/file-safe-dev1"
FAIL_FILEDEVPOOL1="$TEST_BASE_DIR/file-fail-dev1"
SAFE_FILEDEVPOOL2="$TEST_BASE_DIR/file-safe-dev2"
FAIL_FILEDEVPOOL2="$TEST_BASE_DIR/file-fail-dev2"
SPARE_FILEDEV="$TEST_BASE_DIR/file-spare-dev"
SPARE_DISKDEV="$(get_debug_device)"

for vdev in $SAFE_FILEDEVPOOL1 $SAFE_FILEDEVPOOL2 $FAIL_FILEDEVPOOL1 \
$FAIL_FILEDEVPOOL2 $SPARE_FILEDEV; do
log_must truncate -s $SPA_MINDEVSIZE $vdev
done

for spare in $SPARE_FILEDEV $SPARE_DISKDEV; do
# 1. Create two pools
log_must zpool create -f $TESTPOOL mirror $SAFE_FILEDEVPOOL1 $FAIL_FILEDEVPOOL1
log_must zpool create -f $TESTPOOL1 mirror $SAFE_FILEDEVPOOL2 $FAIL_FILEDEVPOOL2

# 2. Add the same spare device to different pools
log_must_busy zpool add $TESTPOOL spare $spare
log_must_busy zpool add $TESTPOOL1 spare $spare
log_must wait_hotspare_state $TESTPOOL $spare "AVAIL"
log_must wait_hotspare_state $TESTPOOL1 $spare "AVAIL"

# 3. Inject IO errors with a zinject error handler
log_must zinject -d $FAIL_FILEDEVPOOL1 -e io -T all -f 100 $TESTPOOL
log_must zinject -d $FAIL_FILEDEVPOOL2 -e io -T all -f 100 $TESTPOOL1

# 4. Start a scrub
log_must zpool scrub $TESTPOOL
log_must zpool scrub $TESTPOOL1

# 5. Verify the ZED kicks in a hot spare and check pool/device status
log_note "Wait for ZED to auto-spare"
log_must wait_vdev_state $TESTPOOL $FAIL_FILEDEVPOOL1 "FAULTED" 60
log_must wait_vdev_state $TESTPOOL $spare "ONLINE" 60
log_must wait_hotspare_state $TESTPOOL $spare "INUSE"
log_must check_state $TESTPOOL "" "DEGRADED"

# 6. Clear the fault
log_must zinject -c all
log_must zpool clear $TESTPOOL $FAIL_FILEDEVPOOL1

# 7. Verify the hot spare is available and check pool/device status
log_must wait_vdev_state $TESTPOOL $FAIL_FILEDEVPOOL1 "ONLINE" 60
log_must wait_hotspare_state $TESTPOOL $spare "AVAIL"
log_must is_pool_resilvered $TESTPOOL
log_must check_state $TESTPOOL "" "ONLINE"

# Cleanup
destroy_pool $TESTPOOL
destroy_pool $TESTPOOL1
done

log_pass "Spare devices can be shared among different ZFS pools"