Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

Conversation

phillxnet
Copy link
Member

Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extended to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes.

Summary:

  • Execute device removal asynchronously.
  • Monitor the consequent 'internal' balances by existing mechanisms where possible.
  • Only remove pool members pool associations once their associated 'internal' balance has finished.
  • Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided.
  • Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'.
  • Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI.
  • Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability.
  • Improve pool uuid tracking and add related Web-UI element.
  • Add Web-UI element in balance status tab to identify regular or 'internal' balance type.
  • Add devid tracking and related Web-UI element.
  • Use devid Disk model info to ascertain pool info for detached disks.
  • Add per device allocation tracking and related Web-UI element.
  • Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32.
  • Fix prior TODO: re unlabeled pools caveat.
  • Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0.
  • Add guidance on common detached disk removal reboot requirement (only affects older kernels).
  • Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id look-ups.
  • Add TODO re removing legacy formatted disk raid role pre openSUSE move.
  • Update scan_disks() unit tests for new 'path included' output.
  • Address TODO in scan_disks() unit tests and normalise on pre-sort method.

Fixes #1722
And by way of a trivial application of the added per device allocation:
Fixes #1918
"Incorrect size calculation while removing disk from disk pool"

@suman Ready for review.
Please note that this pr assumes the prior merge of:
"regression in unit tests - environment outdated since 3.9.2-45. Fixes #1993" #1994 (Fixes unit tests)
"pin python-engineio to 2.3.2 as recent 3.0.0 update breaks gevent. Fixes #1995" #1996 (Fixes basic build fail)
and:
"Implement Add Labels feature for already-installed Rock-Ons. Fixes #1998" #1999 (has a prior storageadmin db migration 0007_auto_20181210_0740.py - I’m trying to keep our migrations path simple)

Testing:

All existing osi and btrfs unit test were confirmed to pass prior to and post pr (given #1994) however as indicated above the scan_disks() unit tests required modification but only to accommodate the new behaviour introduced in scan_disks() where we request from lsblk all device paths. From the osi unit tests point of view this was a cosmetic change in test data: and no functional changes were made bar a trivial robustness improvement by way of an existing TODO.

Many of the system configurations used to originally generate the osi unit test data were also tested in their install instance counterparts (ie bios raid system disk, LUKS, btrfs in partition, etc) and were also used during development to help ensure minimal regression.

A full functional test on real hardware was also conducted over multiple cycles of removing (and re-adding a post 'wipefs -a' disk where appropriate). These tests are details in the comments below and indicate expected behaviour in both legacy CentOS and openSUSE (Tumbleweed in this case) installs.

Caveats:

Our keying from devid = 0 (for 'Page refresh required' UI element) may cause confusion during a disk replace (as yet unimplemented: see issue #1611 ) as it is understood that currently within btrfs one of the two disks involved during a 'btrfs replace start ...' operation is temporarily assigned a devid of 0. The cited issue can address this as and when needed.

…ockstor#1722

Fix disk removal timeout failure re "Unknown internal error doing a PUT
.../remove" by asynchronously executing 'btrfs dev remove'. The
pool_balance model was extending to accommodate for what are arbitrarily
named (within Rockstor) 'internal' balances: those automatically
initiated upon every 'btrfs dev delete' by the btrfs subsystem itself.
A complication of 'internal' balances is  their invisibility via 'btrfs
balance status'. An inference mechanism was thus constructed to 'fake'
the output of a regular balance status so that our existing Web-UI
balance surfacing mechanisms could be extended to serve these 'internal'
variants similarly. The new state of device 'in removal' and the above
mentioned inference mechanism required that we now track and update
devid and per device allocation. These were added as disk model fields
and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were
found to negatively impact Web-UI interactivity. This was in part
alleviated by refactoring the lowest levels of our disk/pool scan
mechanisms. In essence this refactoring significantly reduces the number
of system and python calls required to attain the same system wide dev /
pool info and simplifies low level device name handling. Existing unit
tests were employed to aid in this refactoring. Minor additional code
was required to account for regressions (predominantly in LUKS device
name handling) that were introduced by these low level device name code
changes.

Summary:

- Execute device removal asynchronously.
- Monitor the consequent 'internal' balances by existing mechanisms
where possible.
- Only remove pool members pool associations once their associated
'internal' balance has finished.
- Improve low level efficiency/clarity re device/pool scanning by moving
to a single call of the lighter get_dev_pool_info() rather than calling
the slower get_pool_info() btrfs disk count times; get_pool_info() is
retained for pool import duties as it’s structure is ideally suited to
that task. Multiple prior temp_name/by-id conversions are also avoided.
- Improve user messaging re system performance / Web-UI responsiveness
during a balance operation, regular or 'internal'.
- Fix bug re reliance on "None" disk label removing a fragility
concerning disk pool association within the Web-UI.
- Improve auto pool labeling subsystem by abstracting and generalising
ready for pool renaming capability.
- Improve pool uuid tracking and add related Web-UI element.
- Add Web-UI element in balance status tab to identify regular or
'internal' balance type.
- Add devid tracking and related Web-UI element.
- Use devid Disk model info to ascertain pool info for detached disks.
- Add per device allocation tracking and related Web-UI element.
- Fix prior TODO: re btrfs in partition failure point introduced in git
tag 3.9.2-32.
- Fix prior TODO: re unlabeled pools caveat.
- Add pool details disks table ‘Page refresh required’ indicator keyed
from devid=0.
- Add guidance on common detached disk removal reboot requirement.
- Remove a low level special case for LUKS dev matching (mapped devices)
which affected the performance of all dev name by-id lookups.
- Add TODO re removing legacy formatted disk raid role pre openSUSE
move.
- Update scan_disks() unit tests for new 'path included' output.
- Address todo in scan_disks() unit tests and normalise on pre sort
method.
@phillxnet
Copy link
Member Author

CentOS test
images 1-10
Clean build - import existing 3 x 500GB pool
pool resize -> remove disk
check pool details page and balance record during and after
pool resize -> add (now blank disk) back
check pool details page and balance records (current and last) during and after.
1_3disks-balanced
2_devid2-beginning-removal
3_balances-report-tab-running
4_mostly-done
5_attached-disk-removal-finished-in-balance-tab
6_devid-2-now-removed-by-_update_disk_state
7_adding-same-now-blank-disk-back-to-pool
8_associated-regular-balance-on-adding-bland-dev-back
9_returned-blank-drive-now-balanced
10_regular-add-balance-finished

@phillxnet
Copy link
Member Author

CentOS install continued:
Power off and physically remove a disk
images 11-16
degraded,rw options and reboot (as advised from Web-UI due to known detached not found quirk - N.B. no required in Tumbleweed)
pool resize -> remove detached disk
Remove rw,degraded (pool ok now).
Remove detached disk from Disks page (it now has no pool association), and power of.
Removed disk is now wiped (wipefs -a another system) so it can be used again to check adding a ‘new’ disk back to the pool.
11_degraded rw-detached-zero-size
12_detached-removal-running-status-balances-tab
13_detached-removal-mid-removal-less-allocated
14_detached-btrfs-fs-show_and-btrfs-dev-usage-terminals
15_detached-devid1-successfully-removed
16_detached-removal-finished-status-balances-tab

@phillxnet
Copy link
Member Author

phillxnet commented Jan 27, 2019

CentOS continued:
Attach our ‘new’ blank disk and power up.
images 17-21
add our 'new' disk to our pool.
show “Page refresh required” Btrfs DevID notice in Pool details page
ongoing balances table report
ongoing pool details disks section

17_fresh-disk-add-page-refresh-required-notice
18_add-disk-has-regular-balance-entry-running
19_fi-show_and_dev-usage-regular-disk-add-balance
20_new-disk-add-allocation-increasing-and-new-devid
21_new-disk-add-regular-balance-percentage-feedback

@phillxnet
Copy link
Member Author

Tumbleweed install 20190115 (4.20.0-1-default kernel) on same machine as above CentOS install.
Import pool
All as for CentOS test above but this time we start with a regular balance.
1ng_3disks-balanced
2ng_divid5-beginning-removal
3ng_balance-report-tab-running
4ng_mostly-done
5ng_attached-disk-removal-finished-in-balance-tab
6ng-devid-5-now-removed-by-_update_disk_state
7ng_adding-same-now-blank-disk-back-to-pool-refresh-needed-text
8ng_associated-regular-balance-on-adding-blank-dev-back
9ng_returned-blank-drive-mid-balance
10ng_regular-add-balance-finished
11a_ng_bootup-unmounted-as-no-degraded-mount
11b_ng_degraded rw-detached-zero-size
12ng_detached-removal-running-status-balances-tab
13ng_detached-removal-mid-removal-less-allocated
14ng_detached-btrfs-fi-show_and-btrfs-dev-usage-terminals
15ng_detached-devid3-successfully-removed
16ng_detached-removal-finished-status-balances-tab
17ng_fresh-disk-add-page-refresh-required-notice
18ng_add-disk-has-regular-balance-entry-running
19ng_fi-show_and_dev-usage-regular-disk-add-balance
20ng_new-disk-add-allocation-increasing-and-new-devid
21ng_new-disk-add-regular-balance-percentage-feedback
22ng_new-disk-add-regular-balance-done
23ng_new-disk-add-balance-done-pool-details-disks

Copy link
Member

@FroggyFlox FroggyFlox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed to work as intended on Rockstor CentOS as well as OpenSUSE Leap 15.1 (build 420).
I could reproduce all steps listed in the tests in a VM.

@phillxnet
Copy link
Member Author

@FroggyFlox A very belated thanks for double checking my post pr observed behaviour here.
Merging as this has been waiting quite some time and makes quite a few changes that we need in place to minimise the number of re-bases that may be required on pending and future pull requests.

@phillxnet phillxnet merged commit 5d2f9b6 into rockstor:master Jul 9, 2019
@phillxnet phillxnet deleted the 1722_pool_resize_disk_removal_unknown_internal_error_and_no_UI_counterpart branch September 25, 2019 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment