Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool reports 16E expandsize on disks with oddball number of sectors #8391

Merged
merged 1 commit into from
Feb 22, 2019

Conversation

loli10K
Copy link
Contributor

@loli10K loli10K commented Feb 10, 2019

Motivation and Context

Fix #1468

Description

The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space:

  • zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition.

  • kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity().

How Has This Been Tested?

This change has been tested manually with virtual disks on a Debian builder: unfortunately this issue affects only wholedisk devices where the size is not a multiple of 1m, so we cannot use device mapper/loop/scsi_debug devices to test this in the ZFS Test Suite.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@loli10K loli10K added the Status: Code Review Needed Ready for review and testing label Feb 10, 2019
@ahrens ahrens requested a review from grwilson February 10, 2019 16:27
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes good sense, thank you for the clear explanation and concise fix. As for a ZTS test case we could probably work something up, possibly using a zvol and some debug code, but I agree it's not critical.

@behlendorf behlendorf requested a review from shartse February 12, 2019 20:17
psize = available;
else
psize = bdev_capacity(bdev);
psize = MAX(available, bdev_capacity(bdev));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your analysis looks correct and I think it would be good document that here.

@shartse
Copy link
Contributor

shartse commented Feb 12, 2019

Nice fix! Could you update the comment in bdev_max_capacity to with the reasoning about why the function should always return at least bdev_capacity?

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Feb 13, 2019
@loli10K
Copy link
Contributor Author

loli10K commented Feb 15, 2019

Does it make sense to also ASSERT(asize <= max_asize) in vdev_disk_open() or vdev_open()?

@behlendorf
Copy link
Contributor

@loli10K that should always be true, but I'd be reluctant to add it as an ASSERT since it based on values returned to us by the underlying devices. If that devices are buggy/damaged it would be best not to assert, and instead simply consider them as unavailable. How about adding a normal error exit path for this instead in vdev_open() instead, i.e if (osize > max_osize) { }/

@loli10K
Copy link
Contributor Author

loli10K commented Feb 16, 2019

With the following changes on local builder

diff --git a/module/zfs/vdev.c b/module/zfs/vdev.c
index 81c34da07..343f0cbf8 100644
--- a/module/zfs/vdev.c
+++ b/module/zfs/vdev.c
@@ -1642,6 +1642,17 @@ vdev_open(vdev_t *vd)
 
        error = vd->vdev_ops->vdev_op_open(vd, &osize, &max_osize, &ashift);
 
+       /*
+        * Physical volume size should never be larger than its max size, unless
+        * the disk has shrunk while we were reading it or the device is buggy
+        * or damaged.
+        */
+       if (osize > max_osize) {
+               vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
+                   VDEV_AUX_OPEN_FAILED);
+               return (SET_ERROR(EOVERFLOW));
+       }
+
        /*
         * Reset the vdev_reopening flag so that we actually close
         * the vdev on error.

i am simulating a disk reporting osize=10G and max_osize=1G: zpool import doesn't seem to deal with EOVERFLOW and dumps core:

root@linux:~# zpool import -d /dev/ testpool
internal error: Value too large for defined data type
Aborted (core dumped)
root@linux:~# gdb -q zpool /var/crash/zpool.16588 
Reading symbols from zpool...done.
[New LWP 16588]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `zpool import -d /dev/ testpool'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f6758496067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f6758496067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f6758497448 in __GI_abort () at abort.c:89
#2  0x00007f67595dd17d in zfs_verror (hdl=0x2523060, error=2083, fmt=0x7f675960e834 "%s", ap=0x7ffe35b8c338) at libzfs_util.c:332
#3  0x00007f67595ddfeb in zpool_standard_error_fmt (hdl=0x2523060, error=75, fmt=0x7f675960e834 "%s") at libzfs_util.c:603
#4  0x00007f67595dd8f1 in zpool_standard_error (hdl=0x2523060, error=75, msg=0x7ffe35b8c570 "cannot import 'testpool'") at libzfs_util.c:487
#5  0x00007f67595c8dac in zpool_import_props (hdl=0x2523060, config=0x2527378, newname=0x0, props=0x0, flags=0) at libzfs_pool.c:2065
#6  0x000000000040c821 in do_import (config=0x2527378, newname=0x0, mntopts=0x0, props=0x0, flags=0) at zpool_main.c:2702
#7  0x000000000040db62 in zpool_do_import (argc=1, argv=0x2524d20) at zpool_main.c:3279
#8  0x000000000041a28e in main (argc=5, argv=0x7ffe35b93d08) at zpool_main.c:9037
(gdb) frame 5
#5  0x00007f67595c8dac in zpool_import_props (hdl=0x2523060, config=0x2527378, newname=0x0, props=0x0, flags=0) at libzfs_pool.c:2065
2065				(void) zpool_standard_error(hdl, error, desc);
(gdb) list
2060				    "new name of at least one dataset is longer than "
2061				    "the maximum allowable length"));
2062				(void) zfs_error(hdl, EZFS_NAMETOOLONG, desc);
2063				break;
2064			default:
2065				(void) zpool_standard_error(hdl, error, desc);
2066				zpool_explain_recover(hdl,
2067				    newname ? origname : thename, -error, nv);
2068				break;
2069			}
(gdb) p error
$1 = 75
(gdb) 

if (osize > max_osize) {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_OPEN_FAILED);
return (SET_ERROR(EOVERFLOW));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than add EOVERFLOW to zpool_standard_error_fmt() why don't we return ENXIO here instead to maintain compatibility. EOVERFLOW is more descriptive, but for this very unlikely case I think the generic error, "one or more devices is currently unavailable", would be acceptable.

The issue is caused by a small discrepancy in how userland creates the
partition layout and the kernel estimates available space:

 * zpool command: subtract 9M from the usable device size, then align
   to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M
   EFI "reserved" partition.

 * kernel module: subtract 10M from the device size. 10M is the sum of
   1M "start" partition alignment + 1m "end" partition alignment + 8M
   EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the
alignment size the zpool command will create a partition layout which
reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up
being slightly smaller than asize: this results in a huge number (16E)
reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller
than bdev_capacity().

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Revision Needed Changes are required for the PR to be accepted labels Feb 20, 2019
@codecov
Copy link

codecov bot commented Feb 21, 2019

Codecov Report

Merging #8391 into master will decrease coverage by 0.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8391      +/-   ##
==========================================
- Coverage   78.55%   78.53%   -0.02%     
==========================================
  Files         380      380              
  Lines      116004   116005       +1     
==========================================
- Hits        91125    91106      -19     
- Misses      24879    24899      +20
Flag Coverage Δ
#kernel 79.03% <50%> (+0.06%) ⬆️
#user 67.04% <33.33%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9abbee4...55a9367. Read the comment docs.

@behlendorf
Copy link
Contributor

@shartse @grwilson this was updated with your review feedback. Would you mind giving it another look.

Copy link
Contributor

@shartse shartse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for adding the clear explanations

@behlendorf behlendorf merged commit 0c637f3 into openzfs:master Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

expandsize=16.0E
4 participants