pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

phillxnet · 2019-01-27T17:46:20Z

Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extended to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes.

Summary:

Execute device removal asynchronously.
Monitor the consequent 'internal' balances by existing mechanisms where possible.
Only remove pool members pool associations once their associated 'internal' balance has finished.
Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided.
Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'.
Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI.
Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability.
Improve pool uuid tracking and add related Web-UI element.
Add Web-UI element in balance status tab to identify regular or 'internal' balance type.
Add devid tracking and related Web-UI element.
Use devid Disk model info to ascertain pool info for detached disks.
Add per device allocation tracking and related Web-UI element.
Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32.
Fix prior TODO: re unlabeled pools caveat.
Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0.
Add guidance on common detached disk removal reboot requirement (only affects older kernels).
Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id look-ups.
Add TODO re removing legacy formatted disk raid role pre openSUSE move.
Update scan_disks() unit tests for new 'path included' output.
Address TODO in scan_disks() unit tests and normalise on pre-sort method.

Fixes #1722
And by way of a trivial application of the added per device allocation:
Fixes #1918
"Incorrect size calculation while removing disk from disk pool"

@suman Ready for review.
Please note that this pr assumes the prior merge of:
"regression in unit tests - environment outdated since 3.9.2-45. Fixes #1993" #1994 (Fixes unit tests)
"pin python-engineio to 2.3.2 as recent 3.0.0 update breaks gevent. Fixes #1995" #1996 (Fixes basic build fail)
and:
"Implement Add Labels feature for already-installed Rock-Ons. Fixes #1998" #1999 (has a prior storageadmin db migration 0007_auto_20181210_0740.py - I’m trying to keep our migrations path simple)

Testing:

All existing osi and btrfs unit test were confirmed to pass prior to and post pr (given #1994) however as indicated above the scan_disks() unit tests required modification but only to accommodate the new behaviour introduced in scan_disks() where we request from lsblk all device paths. From the osi unit tests point of view this was a cosmetic change in test data: and no functional changes were made bar a trivial robustness improvement by way of an existing TODO.

Many of the system configurations used to originally generate the osi unit test data were also tested in their install instance counterparts (ie bios raid system disk, LUKS, btrfs in partition, etc) and were also used during development to help ensure minimal regression.

A full functional test on real hardware was also conducted over multiple cycles of removing (and re-adding a post 'wipefs -a' disk where appropriate). These tests are details in the comments below and indicate expected behaviour in both legacy CentOS and openSUSE (Tumbleweed in this case) installs.

Caveats:

Our keying from devid = 0 (for 'Page refresh required' UI element) may cause confusion during a disk replace (as yet unimplemented: see issue #1611 ) as it is understood that currently within btrfs one of the two disks involved during a 'btrfs replace start ...' operation is temporarily assigned a devid of 0. The cited issue can address this as and when needed.

…ockstor#1722 Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extending to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement. - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id lookups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address todo in scan_disks() unit tests and normalise on pre sort method.

phillxnet · 2019-01-27T17:50:59Z

CentOS test
images 1-10
Clean build - import existing 3 x 500GB pool
pool resize -> remove disk
check pool details page and balance record during and after
pool resize -> add (now blank disk) back
check pool details page and balance records (current and last) during and after.

phillxnet · 2019-01-27T17:53:30Z

CentOS install continued:
Power off and physically remove a disk
images 11-16
degraded,rw options and reboot (as advised from Web-UI due to known detached not found quirk - N.B. no required in Tumbleweed)
pool resize -> remove detached disk
Remove rw,degraded (pool ok now).
Remove detached disk from Disks page (it now has no pool association), and power of.
Removed disk is now wiped (wipefs -a another system) so it can be used again to check adding a ‘new’ disk back to the pool.

phillxnet · 2019-01-27T17:55:18Z

CentOS continued:
Attach our ‘new’ blank disk and power up.
images 17-21
add our 'new' disk to our pool.
show “Page refresh required” Btrfs DevID notice in Pool details page
ongoing balances table report
ongoing pool details disks section

phillxnet · 2019-01-27T18:00:18Z

Tumbleweed install 20190115 (4.20.0-1-default kernel) on same machine as above CentOS install.
Import pool
All as for CentOS test above but this time we start with a regular balance.

FroggyFlox

Confirmed to work as intended on Rockstor CentOS as well as OpenSUSE Leap 15.1 (build 420).
I could reproduce all steps listed in the tests in a VM.

phillxnet · 2019-07-09T14:10:31Z

@FroggyFlox A very belated thanks for double checking my post pr observed behaviour here.
Merging as this has been waiting quite some time and makes quite a few changes that we need in place to minimise the number of re-bases that may be required on pending and future pull requests.

FroggyFlox approved these changes Feb 28, 2019

View reviewed changes

This was referenced Mar 1, 2019

Consider supporting partitions and not just whole drives. #509

Closed

intended internal disk update fails #1623

Open

phillxnet merged commit 5d2f9b6 into rockstor:master Jul 9, 2019

phillxnet deleted the 1722_pool_resize_disk_removal_unknown_internal_error_and_no_UI_counterpart branch September 25, 2019 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019 •

edited

Loading

phillxnet commented Jan 27, 2019

FroggyFlox left a comment

phillxnet commented Jul 9, 2019

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

Conversation

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019

phillxnet commented Jan 27, 2019 • edited Loading

phillxnet commented Jan 27, 2019

FroggyFlox left a comment

Choose a reason for hiding this comment

phillxnet commented Jul 9, 2019

phillxnet commented Jan 27, 2019 •

edited

Loading