pool resize disk removal unknown internal error and no UI counterpart #1722

phillxnet · 2017-06-01T18:19:38Z

Thanks to forum member Noggin for highlighting this behaviour. Occasionally when removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation:

There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress).

btrfs balance status /mnt2/time_machine_pool/
No balance found on '/mnt2/time_machine_pool/'

The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of btrfs fi show:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 31.57GiB
	devid    3 size 149.05GiB used 17.03GiB path /dev/sdd
	devid    4 size 0.00B used 5.00GiB path /dev/sda
	devid    5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

and then later on:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 32.57GiB
	devid    3 size 149.05GiB used 18.03GiB path /dev/sdd
	devid    4 size 0.00B used 2.00GiB path /dev/sda
	devid    5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time.

Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr):
https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319
where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge).

The text was updated successfully, but these errors were encountered:

phillxnet · 2017-06-01T19:07:01Z

Please update the above referenced forum thread with this issues resolution.

phillxnet · 2018-09-29T11:12:15Z

In working on the following related issue:
"Implement a delete missing disk in pool UI" #1700
I have been able to assess the cause of this behaviour. Essentially when running a 'btrfs device delete (dev-name and/or missing) an internal balance is initiated. This, in almost all 'real hardware' cases, causes the indicated error. Essentially a 'time out' as a balance will typically take hours and we currently fail to run the associated code resize_pool() as an async taks like we do with regular balance operations and so our db commit code is 'frozen' until such time as the 'btrfs device delete' completes it's internal balance. Note also that we previous to the as yet uncommitted changes to be associated with #1700 also then attempted to run a redundant balance after the remove (btrfs dev delete) operation.

Once the changes associated with the above referenced issue #1700 have been committed we can move to treating our resize_pool() as we currently do our balance, ie via async task with monitoring wrapper. Some progress has been made towards the design of this arrangement and is likely to mirror what we have for balance / scrub quite closely, bar the inability to query the btrfs subsystem for progress / status of this 'btrfs dev delete' initiated internal balance operation. So more work/research is required in that direction for this issue to be addressed fully.

phillxnet · 2018-09-29T13:08:43Z

A counter example of this issue is where the disk to be removed presents little in the way of a balance requirement prior to it's removal (via btrfs device delete mount-point) ie:

Label: 'rock-pool'  uuid: 8adf7f0b-65ec-4e00-83cc-7f5855201185
	Total devices 3 FS bytes used 39.22GiB
	devid    1 size 465.76GiB used 42.03GiB path /dev/sda
	devid    2 size 465.76GiB used 42.03GiB path /dev/sdb
	devid    3 size 465.76GiB used 0.00B path /dev/sdc

Where we are to remove /dev/sdc which has "used 0.00B".

In situations such as this our existing 'treatment' (although the tested code had the consequent Rockstor initiated balance removed) works as intended. Due to the internal 'btrfs dev delete' initiated balance completing in a sufficiently short time: hence no time outs.

Label: 'rock-pool'  uuid: 8adf7f0b-65ec-4e00-83cc-7f5855201185
	Total devices 2 FS bytes used 39.22GiB
	devid    1 size 465.76GiB used 42.03GiB path /dev/sda
	devid    2 size 465.76GiB used 42.03GiB path /dev/sdb

And our db representation is also updated accordingly.

phillxnet · 2018-10-10T20:39:45Z

Linking to another recent forum thread where juchong also reported this issue:
https://forum.rockstor.com/t/cant-remove-disk-even-though-i-have-enough-space/5287/6

phillxnet · 2018-11-26T18:18:34Z

I am currently working on this issue.

phillxnet · 2018-12-02T11:53:34Z

Please also update following user cferra in the following thread with this issues resolution:
https://forum.rockstor.com/t/problems-replacing-upgrading-a-disk/5515

phillxnet · 2019-01-07T21:27:16Z

Some visible elements added by the pending code that addresses this issue (awaiting current pr merges so that I can rebase on pending db migrations as this code also requires some db additions:

Track and surface Pool UUID:

detached disk devid and allocated message:

Enhanced final message re balance initialisation:

Freshly added disk pre allocation devid scan required notice:

last stage of disk removal attached:

running finished failed disk removal internal balance report:

Notice in the above we also surface devid and per dev allocated (%).

phillxnet · 2019-01-16T18:25:25Z

Code/pr for this issue's fix is in final testing/preparations.

…ockstor#1722 Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extending to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement. - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id lookups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address todo in scan_disks() unit tests and normalise on pre sort method.

…_unknown_internal_error_and_no_UI_counterpart pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722

…ockstor#1722 Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remove" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extending to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement. - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id lookups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address todo in scan_disks() unit tests and normalise on pre sort method.

phillxnet mentioned this issue Oct 2, 2018

Implement a delete missing disk in pool UI #1700 #1971

Merged

phillxnet mentioned this issue Oct 17, 2018

Upgrade to a more recent kernel + btrfs-progs version #1975

Closed

phillxnet mentioned this issue Nov 1, 2018

Replace ancient 64-btrfs.rules with newer systemd rules #1984

Closed

phillxnet mentioned this issue Jan 17, 2019

Support Pool renaming #864

Open

phillxnet mentioned this issue Jan 27, 2019

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 #2010

Merged

This was referenced Mar 1, 2019

Consider supporting partitions and not just whole drives. #509

Closed

intended internal disk update fails #1623

Open

phillxnet closed this as completed in #2010 Jul 9, 2019

phillxnet added a commit that referenced this issue Jul 9, 2019

Merge pull request #2010 from phillxnet/1722_pool_resize_disk_removal…

5d2f9b6

…_unknown_internal_error_and_no_UI_counterpart pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pool resize disk removal unknown internal error and no UI counterpart #1722

pool resize disk removal unknown internal error and no UI counterpart #1722

phillxnet commented Jun 1, 2017

phillxnet commented Jun 1, 2017

phillxnet commented Sep 29, 2018

phillxnet commented Sep 29, 2018

phillxnet commented Oct 10, 2018

phillxnet commented Nov 26, 2018

phillxnet commented Dec 2, 2018

phillxnet commented Jan 7, 2019

phillxnet commented Jan 16, 2019

pool resize disk removal unknown internal error and no UI counterpart #1722

pool resize disk removal unknown internal error and no UI counterpart #1722

Comments

phillxnet commented Jun 1, 2017

phillxnet commented Jun 1, 2017

phillxnet commented Sep 29, 2018

phillxnet commented Sep 29, 2018

phillxnet commented Oct 10, 2018

phillxnet commented Nov 26, 2018

phillxnet commented Dec 2, 2018

phillxnet commented Jan 7, 2019

phillxnet commented Jan 16, 2019