zpool reports 16E expandsize on disks with oddball number of sectors #8391

loli10K · 2019-02-10T12:59:19Z

Motivation and Context

Description

The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space:

zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition.
kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity().

How Has This Been Tested?

This change has been tested manually with virtual disks on a Debian builder: unfortunately this issue affects only wholedisk devices where the size is not a multiple of 1m, so we cannot use device mapper/loop/scsi_debug devices to test this in the ZFS Test Suite.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf

This makes good sense, thank you for the clear explanation and concise fix. As for a ZTS test case we could probably work something up, possibly using a zvol and some debug code, but I agree it's not critical.

grwilson · 2019-02-12T20:53:22Z

module/zfs/vdev_disk.c

-			psize = available;
-		else
-			psize = bdev_capacity(bdev);
+		psize = MAX(available, bdev_capacity(bdev));


Your analysis looks correct and I think it would be good document that here.

shartse · 2019-02-12T20:54:43Z

Nice fix! Could you update the comment in bdev_max_capacity to with the reasoning about why the function should always return at least bdev_capacity?

loli10K · 2019-02-15T20:47:45Z

Does it make sense to also ASSERT(asize <= max_asize) in vdev_disk_open() or vdev_open()?

behlendorf · 2019-02-15T21:03:26Z

@loli10K that should always be true, but I'd be reluctant to add it as an ASSERT since it based on values returned to us by the underlying devices. If that devices are buggy/damaged it would be best not to assert, and instead simply consider them as unavailable. How about adding a normal error exit path for this instead in vdev_open() instead, i.e if (osize > max_osize) { }/

loli10K · 2019-02-16T09:18:40Z

With the following changes on local builder

diff --git a/module/zfs/vdev.c b/module/zfs/vdev.c
index 81c34da07..343f0cbf8 100644
--- a/module/zfs/vdev.c
+++ b/module/zfs/vdev.c
@@ -1642,6 +1642,17 @@ vdev_open(vdev_t *vd)
 
        error = vd->vdev_ops->vdev_op_open(vd, &osize, &max_osize, &ashift);
 
+       /*
+        * Physical volume size should never be larger than its max size, unless
+        * the disk has shrunk while we were reading it or the device is buggy
+        * or damaged.
+        */
+       if (osize > max_osize) {
+               vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
+                   VDEV_AUX_OPEN_FAILED);
+               return (SET_ERROR(EOVERFLOW));
+       }
+
        /*
         * Reset the vdev_reopening flag so that we actually close
         * the vdev on error.

i am simulating a disk reporting osize=10G and max_osize=1G: zpool import doesn't seem to deal with EOVERFLOW and dumps core:

root@linux:~# zpool import -d /dev/ testpool
internal error: Value too large for defined data type
Aborted (core dumped)
root@linux:~# gdb -q zpool /var/crash/zpool.16588 
Reading symbols from zpool...done.
[New LWP 16588]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `zpool import -d /dev/ testpool'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f6758496067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f6758496067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f6758497448 in __GI_abort () at abort.c:89
#2  0x00007f67595dd17d in zfs_verror (hdl=0x2523060, error=2083, fmt=0x7f675960e834 "%s", ap=0x7ffe35b8c338) at libzfs_util.c:332
#3  0x00007f67595ddfeb in zpool_standard_error_fmt (hdl=0x2523060, error=75, fmt=0x7f675960e834 "%s") at libzfs_util.c:603
#4  0x00007f67595dd8f1 in zpool_standard_error (hdl=0x2523060, error=75, msg=0x7ffe35b8c570 "cannot import 'testpool'") at libzfs_util.c:487
#5  0x00007f67595c8dac in zpool_import_props (hdl=0x2523060, config=0x2527378, newname=0x0, props=0x0, flags=0) at libzfs_pool.c:2065
#6  0x000000000040c821 in do_import (config=0x2527378, newname=0x0, mntopts=0x0, props=0x0, flags=0) at zpool_main.c:2702
#7  0x000000000040db62 in zpool_do_import (argc=1, argv=0x2524d20) at zpool_main.c:3279
#8  0x000000000041a28e in main (argc=5, argv=0x7ffe35b93d08) at zpool_main.c:9037
(gdb) frame 5
#5  0x00007f67595c8dac in zpool_import_props (hdl=0x2523060, config=0x2527378, newname=0x0, props=0x0, flags=0) at libzfs_pool.c:2065
2065				(void) zpool_standard_error(hdl, error, desc);
(gdb) list
2060				    "new name of at least one dataset is longer than "
2061				    "the maximum allowable length"));
2062				(void) zfs_error(hdl, EZFS_NAMETOOLONG, desc);
2063				break;
2064			default:
2065				(void) zpool_standard_error(hdl, error, desc);
2066				zpool_explain_recover(hdl,
2067				    newname ? origname : thename, -error, nv);
2068				break;
2069			}
(gdb) p error
$1 = 75
(gdb)

behlendorf · 2019-02-19T19:26:52Z

module/zfs/vdev.c

+	if (osize > max_osize) {
+		vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
+		    VDEV_AUX_OPEN_FAILED);
+		return (SET_ERROR(EOVERFLOW));


Rather than add EOVERFLOW to zpool_standard_error_fmt() why don't we return ENXIO here instead to maintain compatibility. EOVERFLOW is more descriptive, but for this very unlikely case I think the generic error, "one or more devices is currently unavailable", would be acceptable.

The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space: * zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition. * kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition. For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition: Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F Device Start End Sectors Size Type /dev/sda1 2048 2078719 2076672 1014M Solaris /usr & Apple ZFS /dev/sda9 2078720 2095103 16384 8M Solaris reserved 1 When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space(). This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity(). Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

codecov · 2019-02-21T03:46:02Z

Codecov Report

Merging #8391 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master    #8391      +/-   ##
==========================================
- Coverage   78.55%   78.53%   -0.02%     
==========================================
  Files         380      380              
  Lines      116004   116005       +1     
==========================================
- Hits        91125    91106      -19     
- Misses      24879    24899      +20

Flag	Coverage Δ
#kernel	`79.03% <50%> (+0.06%)`	⬆️
#user	`67.04% <33.33%> (-0.29%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9abbee4...55a9367. Read the comment docs.

behlendorf · 2019-02-22T17:50:38Z

@shartse @grwilson this was updated with your review feedback. Would you mind giving it another look.

shartse

Looks great - thanks for adding the clear explanations

loli10K added the Status: Code Review Needed Ready for review and testing label Feb 10, 2019

ahrens requested a review from grwilson February 10, 2019 16:27

behlendorf approved these changes Feb 12, 2019

View reviewed changes

behlendorf requested a review from shartse February 12, 2019 20:17

grwilson suggested changes Feb 12, 2019

View reviewed changes

behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Feb 13, 2019

loli10K force-pushed the issue-1468 branch from c1c6d8b to ca8c862 Compare February 16, 2019 20:36

behlendorf reviewed Feb 19, 2019

View reviewed changes

loli10K force-pushed the issue-1468 branch from ca8c862 to 55a9367 Compare February 20, 2019 20:15

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Feb 20, 2019

grwilson approved these changes Feb 22, 2019

View reviewed changes

shartse approved these changes Feb 22, 2019

View reviewed changes

behlendorf merged commit 0c637f3 into openzfs:master Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zpool reports 16E expandsize on disks with oddball number of sectors #8391

zpool reports 16E expandsize on disks with oddball number of sectors #8391

loli10K commented Feb 10, 2019

behlendorf left a comment

grwilson Feb 12, 2019

shartse commented Feb 12, 2019

loli10K commented Feb 15, 2019

behlendorf commented Feb 15, 2019

loli10K commented Feb 16, 2019

behlendorf Feb 19, 2019

codecov bot commented Feb 21, 2019 •

edited

Loading

behlendorf commented Feb 22, 2019

shartse left a comment

zpool reports 16E expandsize on disks with oddball number of sectors #8391

zpool reports 16E expandsize on disks with oddball number of sectors #8391

Conversation

loli10K commented Feb 10, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

grwilson Feb 12, 2019

Choose a reason for hiding this comment

shartse commented Feb 12, 2019

loli10K commented Feb 15, 2019

behlendorf commented Feb 15, 2019

loli10K commented Feb 16, 2019

behlendorf Feb 19, 2019

Choose a reason for hiding this comment

codecov bot commented Feb 21, 2019 • edited Loading

Codecov Report

behlendorf commented Feb 22, 2019

shartse left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 21, 2019 •

edited

Loading