-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool resize disk removal unknown internal error and no UI counterpart #1722
Comments
Please update the above referenced forum thread with this issues resolution. |
In working on the following related issue: Once the changes associated with the above referenced issue #1700 have been committed we can move to treating our resize_pool() as we currently do our balance, ie via async task with monitoring wrapper. Some progress has been made towards the design of this arrangement and is likely to mirror what we have for balance / scrub quite closely, bar the inability to query the btrfs subsystem for progress / status of this 'btrfs dev delete' initiated internal balance operation. So more work/research is required in that direction for this issue to be addressed fully. |
A counter example of this issue is where the disk to be removed presents little in the way of a balance requirement prior to it's removal (via btrfs device delete mount-point) ie:
Where we are to remove /dev/sdc which has "used 0.00B". In situations such as this our existing 'treatment' (although the tested code had the consequent Rockstor initiated balance removed) works as intended. Due to the internal 'btrfs dev delete' initiated balance completing in a sufficiently short time: hence no time outs.
And our db representation is also updated accordingly. |
Linking to another recent forum thread where juchong also reported this issue: |
I am currently working on this issue. |
Please also update following user cferra in the following thread with this issues resolution: |
Code/pr for this issue's fix is in final testing/preparations. |
…ockstor#1722 Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extending to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement. - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id lookups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address todo in scan_disks() unit tests and normalise on pre sort method.
…_unknown_internal_error_and_no_UI_counterpart pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722
…ockstor#1722 Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extending to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement. - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id lookups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address todo in scan_disks() unit tests and normalise on pre sort method.
Thanks to forum member Noggin for highlighting this behaviour. Occasionally when removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation:
There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress).
The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of btrfs fi show:
and then later on:
As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time.
Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr):
https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319
where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge).
The text was updated successfully, but these errors were encountered: