Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool: allow sharing of spare device among pools #7999

Merged
merged 1 commit into from
Oct 17, 2018

Conversation

loli10K
Copy link
Contributor

@loli10K loli10K commented Oct 8, 2018

Motivation and Context

The following issue has been reported on the "zfs-discuss" mailing-list: https://list.zfsonlinux.org/pipermail/zfs-discuss/2018-September/032352.html

Hi,
I'm running ZFS 7.10-1 under Redhat 6.10 and am unable to add a hot spare to multiple pools. It previously worked under zfs-0.6.5.8-1, but stopped working under zfs 0.7.2 and hasn't worked since then.

I can add a spare to any pool and it works, but when I try to add it to a second pool it fails with an error about it already being in use.

#zpool add zpool7 spare scsi-2001b4d20d7859d4b
/dev/disk/by-id/scsi-2001b4d20d7859d4b is in use and contains a unknown
filesystem.

Description

ZFS allows, by default, sharing of spare devices among different pools; this commit simply restores this functionality for disk devices and adds an additional tests case to the ZFS Test Suite to prevent future
regression.

How Has This Been Tested?

Test added to the ZFS Test Suite. Pushing as "work in progress" because said test fails occasionally on my home builders.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@loli10K loli10K added the Status: Work in Progress Not yet ready for general review label Oct 8, 2018
@@ -423,7 +423,7 @@ check_disk(const char *path, blkid_cache cache, int force,
if (!iswholedisk)
return (check_slice(path, cache, force, isspare));

if ((fd = open(path, O_RDONLY|O_DIRECT|O_EXCL)) < 0) {
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 8c54ddd.

    * zpool_add_005_pos.ksh - ...  Add O_EXCL to the in-use check to prevent the -f (force) option from
      working for mounted filesystems and improve the resulting error.

We'll need to find some other way to handle this is O_EXCL is removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems O_EXCL is only buying us a slightly better error message here, with and without -f:

System status (pools and mounted fs on /dev/sdc1):

root@linux:/usr/src/zfs# zpool status 
  pool: p1
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	p1          ONLINE       0     0     0
	  sda       ONLINE       0     0     0

errors: No known data errors

  pool: p2
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	p2          ONLINE       0     0     0
	  sdb       ONLINE       0     0     0

errors: No known data errors
root@linux:/usr/src/zfs# df -h /dev/sdc1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       120M  1.6M  110M   2% /mnt
root@linux:/usr/src/zfs# 

Previous behaviour:

root@linux:/usr/src/zfs# zpool add p1 spare sdc
/dev/sdc is in use and contains a unknown filesystem.
root@linux:/usr/src/zfs# zpool add p1 spare sdc1
invalid vdev specification
use '-f' to override the following errors:
/dev/sdc1 contains a filesystem of type 'ext4'
root@linux:/usr/src/zfs# 
root@linux:/usr/src/zfs# 
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc
/dev/sdc is in use and contains a unknown filesystem.
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc1
cannot open '/dev/sdc1': Device or resource busy
cannot add to 'p1': one or more vdevs refer to the same device
root@linux:/usr/src/zfs# 

Patched behaviour:

root@linux:/usr/src/zfs# zpool add p1 spare sdc
cannot label 'sdc': cannot label '/dev/sdc': unable to open device: 16
root@linux:/usr/src/zfs# zpool add p1 spare sdc1
invalid vdev specification
use '-f' to override the following errors:
/dev/sdc1 contains a filesystem of type 'ext4'
root@linux:/usr/src/zfs# 
root@linux:/usr/src/zfs# 
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc
cannot label 'sdc': cannot label '/dev/sdc': unable to open device: 16
root@linux:/usr/src/zfs# zpool add -f p1 spare sdc1
cannot open '/dev/sdc1': Device or resource busy
cannot add to 'p1': one or more vdevs refer to the same device
root@linux:/usr/src/zfs#

I am going to try and keep the error "is in use and contains a unknown filesystem", though if i had to choose i would prefer a misleading error message (and the optional "documentation" issue requesting to improve it) rather than this missing functionality (shared spare).

Also it seems this could use a new test in the "inuse" group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it'd be nice to have both but we can live with a slightly worse error message to restore the missing functionality.

@loli10K loli10K force-pushed the one_spare_many_pools branch 2 times, most recently from 0f35f38 to d738022 Compare October 13, 2018 16:41
@@ -547,7 +552,7 @@ is_spare(nvlist_t *config, const char *path)
uint_t i, nspares;
boolean_t inuse;

if ((fd = open(path, O_RDONLY)) < 0)
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fix the spurious failures. I reproduced the issue on my CentOS7 home builder. Sometimes the spare cannot be added to the second pool ("testpool1"):

[root@linux ~]# zpool status
  pool: testpool
 state: ONLINE
  scan: none requested
config:

	NAME                         STATE     READ WRITE CKSUM
	testpool                     ONLINE       0     0     0
	  mirror-0                   ONLINE       0     0     0
	    /var/tmp/file-safe-dev1  ONLINE       0     0     0
	    /var/tmp/file-fail-dev1  ONLINE       0     0     0
	spares
	  sdd                        AVAIL   

errors: No known data errors

  pool: testpool1
 state: ONLINE
  scan: none requested
config:

	NAME                         STATE     READ WRITE CKSUM
	testpool1                    ONLINE       0     0     0
	  mirror-0                   ONLINE       0     0     0
	    /var/tmp/file-safe-dev2  ONLINE       0     0     0
	    /var/tmp/file-fail-dev2  ONLINE       0     0     0

errors: No known data errors
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# zpool add testpool1 spare sdd
cannot label 'sdd': cannot label '/dev/sdd': unable to open device: 16
[root@linux ~]# 

The error is coming from zpool_label_disk(). We are (erroneously) trying to label the spare device which cannot be opened with O_EXCL:

Breakpoint 1, 0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff5bf4cc0 in write () from /lib64/libc.so.6
#1  0x00007ffff5b7f0c3 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00007ffff5b7f960 in __GI__IO_file_xsputn () from /lib64/libc.so.6
#3  0x00007ffff5b52d6d in buffered_vfprintf () from /lib64/libc.so.6
#4  0x00007ffff5b4d6be in vfprintf () from /lib64/libc.so.6
#5  0x00007ffff5b58297 in fprintf () from /lib64/libc.so.6
#6  0x00007ffff777ce32 in zfs_verror (hdl=0x62c080, error=2044, fmt=<optimized out>, ap=<optimized out>) at libzfs_util.c:328
#7  0x00007ffff777d172 in zfs_error_fmt (hdl=hdl@entry=0x62c080, error=error@entry=2044, fmt=fmt@entry=0x7ffff779a97a "%s") at libzfs_util.c:349
#8  0x00007ffff777d191 in zfs_error (hdl=hdl@entry=0x62c080, error=error@entry=2044, msg=msg@entry=0x7fffffff46d0 "cannot label 'sdd'") at libzfs_util.c:338
#9  0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
#10 0x0000000000417aca in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1302
#11 0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#12 0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2, 
    argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#13 0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#14 0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) frame 9
#9  0x00007ffff7771906 in zpool_label_disk (hdl=0x62c080, zhp=zhp@entry=0x62df30, name=name@entry=0x7fffffff6bc5 "sdd") at libzfs_pool.c:4579
4579			return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
(gdb) list
4574			 * This shouldn't happen.  We've long since verified that this
4575			 * is a valid device.
4576			 */
4577			zfs_error_aux(hdl, dgettext(TEXT_DOMAIN, "cannot "
4578			    "label '%s': unable to open device: %d"), path, errno);
4579			return (zfs_error(hdl, EZFS_OPENFAILED, errbuf));
4580		}
4581	
4582		if (efi_alloc_and_init(fd, EFI_NUMPAR, &vtoc) != 0) {
4583			/*
(gdb) p errno
$1 = 16
(gdb) 

We should not be trying to label this device since it's already attached as a spare to "testpool". The underlying issue is we cannot find a valid label on this spare device (zpool_read_label() returns an empty config): this leads us to believe this device is not a valid (and "inuse") spare:

(gdb) bt
#0  zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
#1  0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
#2  0x00000000004174d6 in is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:558
#3  0x00000000004179ec in make_disks (zhp=zhp@entry=0x62df30, nv=0x62e438) at zpool_vdev.c:1288
#4  0x0000000000417c85 in make_disks (zhp=zhp@entry=0x62df30, nv=nv@entry=0x62e2d0) at zpool_vdev.c:1346
#5  0x00000000004196b6 in make_root_vdev (zhp=zhp@entry=0x62df30, props=0x0, force=force@entry=0, check_rep=check_rep@entry=1, replacing=replacing@entry=B_FALSE, dryrun=dryrun@entry=B_FALSE, argc=argc@entry=2, 
    argv=argv@entry=0x62ddf8) at zpool_vdev.c:1901
#6  0x0000000000411fcb in zpool_do_add (argc=2, argv=0x62ddf8) at zpool_main.c:739
#7  0x0000000000405d2f in main (argc=5, argv=0x7fffffffe588) at zpool_main.c:8659
(gdb) fin
Run till exit from #0  zpool_read_label (fd=fd@entry=11, config=config@entry=0x7fffffff5a10, num_labels=num_labels@entry=0x0) at libzfs_import.c:1370
0x00007ffff776641a in zpool_in_use (hdl=0x62c080, fd=fd@entry=11, state=state@entry=0x7fffffff5af4, namestr=namestr@entry=0x7fffffff5ae8, inuse=inuse@entry=0x7fffffff5ab8) at libzfs_import.c:2472
2472		if (zpool_read_label(fd, &config, NULL) != 0) {
Value returned is $1 = 0
(gdb) list
2467		aux_cbdata_t cb = { 0 };
2468		boolean_t isactive;
2469	
2470		*inuse = B_FALSE;
2471	
2472		if (zpool_read_label(fd, &config, NULL) != 0) {
2473			(void) no_memory(hdl);
2474			return (-1);
2475		}
2476	
(gdb) p config
$2 = (nvlist_t *) 0x0
(gdb) n
2477		if (config == NULL)
(gdb) 
2628	}
(gdb) 
is_spare (config=config@entry=0x0, path=path@entry=0x7fffffff5bc0 "/dev/sdd1") at zpool_vdev.c:559
559		    !inuse ||
(gdb) p inuse 
$3 = B_FALSE
(gdb) p state
$4 = POOL_STATE_ACTIVE
(gdb) n
558		if (zpool_in_use(g_zfs, fd, &state, &name, &inuse) != 0 ||
(gdb) 
562			free(name);
(gdb) 
563			(void) close(fd);
(gdb) 
564			return (B_FALSE);

Trying to read the first label with dd shows the data being all zeroes:

[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.000906013 s, 289 MB/s
00040000
[root@linux ~]# 

Ultimately using O_DIRECT let us read the real label:

[root@linux ~]# dd if=/dev/sdd1 bs=256K count=1 iflag=direct | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
00003fe0  3f 2a 6e 7f 80 8f f4 97  fc ce aa 58 16 9f 90 af  |?*n........X....|
00003ff0  8b b4 6d ff 57 ea d1 cb  ab 5f 46 0d db 92 c6 6e  |..m.W...._F....n|
00004000  01 01 00 00 00 00 00 00  00 00 00 01 00 00 00 24  |...............$|
00004010  00 00 00 20 00 00 00 07  76 65 72 73 69 6f 6e 00  |... ....version.|
00004020  00 00 00 08 00 00 00 01  00 00 00 00 00 00 13 88  |................|
00004030  00 00 00 24 00 00 00 20  00 00 00 05 73 74 61 74  |...$... ....stat|
00004040  65 00 00 00 00 00 00 08  00 00 00 01 00 00 00 00  |e...............|
00004050  00 00 00 03 00 00 00 20  00 00 00 20 00 00 00 04  |....... ... ....|
00004060  67 75 69 64 00 00 00 08  00 00 00 01 16 7a 0c 8b  |guid.........z..|
00004070  6b aa 06 4b 00 00 00 00  00 00 00 00 00 00 00 00  |k..K............|
00004080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0001ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0001ffe0  5a 71 40 b9 d8 aa df a0  38 cc 29 ef 28 82 6f b9  |Zq@.....8.).(.o.|
0001fff0  28 4d b0 fb 59 f7 3b 4d  2b 0b 0e e4 d9 74 61 7c  |(M..Y.;M+....ta||
00020000  0c b1 ba 00 00 00 00 00  88 13 00 00 00 00 00 00  |................|
00020010  00 00 00 00 00 00 00 00  a9 95 9d ee c2 cf b7 af  |................|
00020020  97 46 c2 5b 00 00 00 00  01 00 00 00 00 00 00 00  |.F.[............|
00020030  68 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |h...............|
00020040  68 80 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |h...............|
00020050  41 00 01 00 00 00 00 00  07 00 00 00 0f 07 0b 80  |A...............|
00020060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00020070  00 00 00 00 00 00 00 00  06 00 00 00 00 00 00 00  |................|
00020080  29 00 00 00 00 00 00 00  b1 9e 04 d1 07 00 00 00  |)...............|
00020090  a5 d1 cf 92 35 03 00 00  24 26 85 d4 e6 ab 00 00  |....5...$&......|
000200a0  4f 6e 26 93 21 71 18 00  88 13 00 00 00 00 00 00  |On&.!q..........|
000200b0  11 ea 1c a1 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000200c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
1+0 records in
1+0 records out
262144 bytes (262 kB) copied, 0.00151793 s, 173 MB/s
0003ffd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
0003ffe0  d0 a4 4d 33 f5 34 84 4d  9d df d1 63 0a 65 ec 3c  |..M3.4.M...c.e.<|
0003fff0  39 44 6c 6d 7c ab d4 d6  ed e6 84 b6 e0 6e c0 4e  |9Dlm|........n.N|
00040000
[root@linux ~]# 

Sorry for the long text; i spent hours debugging the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, the label would have been written in the kernel which would by pass the buffer cache. Passing O_DIRECT ensures we get the blocks from the disk and never a potentially stale cached version.

@loli10K loli10K force-pushed the one_spare_many_pools branch from d738022 to 2cf3d3b Compare October 13, 2018 20:12
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running this down!

@@ -547,7 +552,7 @@ is_spare(nvlist_t *config, const char *path)
uint_t i, nspares;
boolean_t inuse;

if ((fd = open(path, O_RDONLY)) < 0)
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, the label would have been written in the kernel which would by pass the buffer cache. Passing O_DIRECT ensures we get the blocks from the disk and never a potentially stale cached version.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 14, 2018
@behlendorf behlendorf requested a review from tonyhutter October 15, 2018 22:41

for vdev in $SAFE_FILEDEVPOOL1 $SAFE_FILEDEVPOOL2 $FAIL_FILEDEVPOOL1 \
$FAIL_FILEDEVPOOL2 $SPARE_FILEDEV; do
truncate -s $SPA_MINDEVSIZE $vdev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_must truncate -s $SPA_MINDEVSIZE $vdev

verify_runnable "both"

if is_linux; then
load_scsi_debug $SDSIZE 1 1 1 '512b'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be $SPA_MINDEVSIZE instead of $SDSIZE here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$SPA_MINDEVSIZE is a number in bytes, scsi_debug wants the size parameter in MB (default $SDSIZE is 256). I am going to copy "auto_spare_ashift.ksh":

https://github.com/zfsonlinux/zfs/blob/b8a90418f3a9c23b89c5d2c729a4dd0fea644508/tests/zfs-tests/tests/functional/fault/auto_spare_ashift.ksh#L74

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, add the comment from there in too:

# 2. Add one 512e spare device (4Kn would generate IO errors on replace)
# NOTE: must be larger than the existing 512b devices, add 32m of fudge

ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
@loli10K loli10K force-pushed the one_spare_many_pools branch from 2cf3d3b to 081c049 Compare October 16, 2018 19:28
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 17, 2018
@behlendorf behlendorf merged commit 2e55034 into openzfs:master Oct 17, 2018
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this pull request Oct 18, 2018
ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes openzfs#7999
ghfields pushed a commit to ghfields/zfs that referenced this pull request Oct 29, 2018
ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes openzfs#7999
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes openzfs#7999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants