Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool resize disk removal unknown internal error and no UI counterpart #1722

Closed
phillxnet opened this issue Jun 1, 2017 · 8 comments · Fixed by #2010
Closed

pool resize disk removal unknown internal error and no UI counterpart #1722

phillxnet opened this issue Jun 1, 2017 · 8 comments · Fixed by #2010

Comments

@phillxnet
Copy link
Member

Thanks to forum member Noggin for highlighting this behaviour. Occasionally when removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation:

harmless-put-timeout-on-dev-remove

There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress).

btrfs balance status /mnt2/time_machine_pool/
No balance found on '/mnt2/time_machine_pool/'

The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of btrfs fi show:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 31.57GiB
	devid    3 size 149.05GiB used 17.03GiB path /dev/sdd
	devid    4 size 0.00B used 5.00GiB path /dev/sda
	devid    5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

and then later on:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 32.57GiB
	devid    3 size 149.05GiB used 18.03GiB path /dev/sdd
	devid    4 size 0.00B used 2.00GiB path /dev/sda
	devid    5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time.

Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr):
https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319
where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge).

@phillxnet
Copy link
Member Author

Please update the above referenced forum thread with this issues resolution.

@phillxnet
Copy link
Member Author

In working on the following related issue:
"Implement a delete missing disk in pool UI" #1700
I have been able to assess the cause of this behaviour. Essentially when running a 'btrfs device delete (dev-name and/or missing) an internal balance is initiated. This, in almost all 'real hardware' cases, causes the indicated error. Essentially a 'time out' as a balance will typically take hours and we currently fail to run the associated code resize_pool() as an async taks like we do with regular balance operations and so our db commit code is 'frozen' until such time as the 'btrfs device delete' completes it's internal balance. Note also that we previous to the as yet uncommitted changes to be associated with #1700 also then attempted to run a redundant balance after the remove (btrfs dev delete) operation.

Once the changes associated with the above referenced issue #1700 have been committed we can move to treating our resize_pool() as we currently do our balance, ie via async task with monitoring wrapper. Some progress has been made towards the design of this arrangement and is likely to mirror what we have for balance / scrub quite closely, bar the inability to query the btrfs subsystem for progress / status of this 'btrfs dev delete' initiated internal balance operation. So more work/research is required in that direction for this issue to be addressed fully.

@phillxnet
Copy link
Member Author

A counter example of this issue is where the disk to be removed presents little in the way of a balance requirement prior to it's removal (via btrfs device delete mount-point) ie:

Label: 'rock-pool'  uuid: 8adf7f0b-65ec-4e00-83cc-7f5855201185
	Total devices 3 FS bytes used 39.22GiB
	devid    1 size 465.76GiB used 42.03GiB path /dev/sda
	devid    2 size 465.76GiB used 42.03GiB path /dev/sdb
	devid    3 size 465.76GiB used 0.00B path /dev/sdc

Where we are to remove /dev/sdc which has "used 0.00B".

In situations such as this our existing 'treatment' (although the tested code had the consequent Rockstor initiated balance removed) works as intended. Due to the internal 'btrfs dev delete' initiated balance completing in a sufficiently short time: hence no time outs.

Label: 'rock-pool'  uuid: 8adf7f0b-65ec-4e00-83cc-7f5855201185
	Total devices 2 FS bytes used 39.22GiB
	devid    1 size 465.76GiB used 42.03GiB path /dev/sda
	devid    2 size 465.76GiB used 42.03GiB path /dev/sdb

And our db representation is also updated accordingly.

@phillxnet
Copy link
Member Author

Linking to another recent forum thread where juchong also reported this issue:
https://forum.rockstor.com/t/cant-remove-disk-even-though-i-have-enough-space/5287/6

@phillxnet
Copy link
Member Author

I am currently working on this issue.

@phillxnet
Copy link
Member Author

Please also update following user cferra in the following thread with this issues resolution:
https://forum.rockstor.com/t/problems-replacing-upgrading-a-disk/5515

@phillxnet
Copy link
Member Author

Some visible elements added by the pending code that addresses this issue (awaiting current pr merges so that I can rebase on pending db migrations as this code also requires some db additions:

Track and surface Pool UUID:
uuid-added-to-pool-details-page-section

detached disk devid and allocated message:
detached-disk-devid-and-allocated-message

Enhanced final message re balance initialisation:

enhanced-final-message-re-balance-initialisation

Freshly added disk pre allocation devid scan required notice:

freshly-added-disk-pre-allocation-devid-scan-required-notice

last stage of disk removal attached:
last-stage-of-disk-removal-attached

running finished failed disk removal internal balance report:
running-finished-failed-disk-removal-internal-balance-report

Notice in the above we also surface devid and per dev allocated (%).

@phillxnet
Copy link
Member Author

Code/pr for this issue's fix is in final testing/preparations.

phillxnet added a commit to phillxnet/rockstor-core that referenced this issue Jan 27, 2019
…ockstor#1722

Fix disk removal timeout failure re "Unknown internal error doing a PUT
.../remove" by asynchronously executing 'btrfs dev remove'. The
pool_balance model was extending to accommodate for what are arbitrarily
named (within Rockstor) 'internal' balances: those automatically
initiated upon every 'btrfs dev delete' by the btrfs subsystem itself.
A complication of 'internal' balances is  their invisibility via 'btrfs
balance status'. An inference mechanism was thus constructed to 'fake'
the output of a regular balance status so that our existing Web-UI
balance surfacing mechanisms could be extended to serve these 'internal'
variants similarly. The new state of device 'in removal' and the above
mentioned inference mechanism required that we now track and update
devid and per device allocation. These were added as disk model fields
and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were
found to negatively impact Web-UI interactivity. This was in part
alleviated by refactoring the lowest levels of our disk/pool scan
mechanisms. In essence this refactoring significantly reduces the number
of system and python calls required to attain the same system wide dev /
pool info and simplifies low level device name handling. Existing unit
tests were employed to aid in this refactoring. Minor additional code
was required to account for regressions (predominantly in LUKS device
name handling) that were introduced by these low level device name code
changes.

Summary:

- Execute device removal asynchronously.
- Monitor the consequent 'internal' balances by existing mechanisms
where possible.
- Only remove pool members pool associations once their associated
'internal' balance has finished.
- Improve low level efficiency/clarity re device/pool scanning by moving
to a single call of the lighter get_dev_pool_info() rather than calling
the slower get_pool_info() btrfs disk count times; get_pool_info() is
retained for pool import duties as it’s structure is ideally suited to
that task. Multiple prior temp_name/by-id conversions are also avoided.
- Improve user messaging re system performance / Web-UI responsiveness
during a balance operation, regular or 'internal'.
- Fix bug re reliance on "None" disk label removing a fragility
concerning disk pool association within the Web-UI.
- Improve auto pool labeling subsystem by abstracting and generalising
ready for pool renaming capability.
- Improve pool uuid tracking and add related Web-UI element.
- Add Web-UI element in balance status tab to identify regular or
'internal' balance type.
- Add devid tracking and related Web-UI element.
- Use devid Disk model info to ascertain pool info for detached disks.
- Add per device allocation tracking and related Web-UI element.
- Fix prior TODO: re btrfs in partition failure point introduced in git
tag 3.9.2-32.
- Fix prior TODO: re unlabeled pools caveat.
- Add pool details disks table ‘Page refresh required’ indicator keyed
from devid=0.
- Add guidance on common detached disk removal reboot requirement.
- Remove a low level special case for LUKS dev matching (mapped devices)
which affected the performance of all dev name by-id lookups.
- Add TODO re removing legacy formatted disk raid role pre openSUSE
move.
- Update scan_disks() unit tests for new 'path included' output.
- Address todo in scan_disks() unit tests and normalise on pre sort
method.
phillxnet added a commit that referenced this issue Jul 9, 2019
…_unknown_internal_error_and_no_UI_counterpart

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722
FroggyFlox pushed a commit to FroggyFlox/rockstor-core that referenced this issue Jul 9, 2019
…ockstor#1722

Fix disk removal timeout failure re "Unknown internal error doing a PUT
.../remove" by asynchronously executing 'btrfs dev remove'. The
pool_balance model was extending to accommodate for what are arbitrarily
named (within Rockstor) 'internal' balances: those automatically
initiated upon every 'btrfs dev delete' by the btrfs subsystem itself.
A complication of 'internal' balances is  their invisibility via 'btrfs
balance status'. An inference mechanism was thus constructed to 'fake'
the output of a regular balance status so that our existing Web-UI
balance surfacing mechanisms could be extended to serve these 'internal'
variants similarly. The new state of device 'in removal' and the above
mentioned inference mechanism required that we now track and update
devid and per device allocation. These were added as disk model fields
and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were
found to negatively impact Web-UI interactivity. This was in part
alleviated by refactoring the lowest levels of our disk/pool scan
mechanisms. In essence this refactoring significantly reduces the number
of system and python calls required to attain the same system wide dev /
pool info and simplifies low level device name handling. Existing unit
tests were employed to aid in this refactoring. Minor additional code
was required to account for regressions (predominantly in LUKS device
name handling) that were introduced by these low level device name code
changes.

Summary:

- Execute device removal asynchronously.
- Monitor the consequent 'internal' balances by existing mechanisms
where possible.
- Only remove pool members pool associations once their associated
'internal' balance has finished.
- Improve low level efficiency/clarity re device/pool scanning by moving
to a single call of the lighter get_dev_pool_info() rather than calling
the slower get_pool_info() btrfs disk count times; get_pool_info() is
retained for pool import duties as it’s structure is ideally suited to
that task. Multiple prior temp_name/by-id conversions are also avoided.
- Improve user messaging re system performance / Web-UI responsiveness
during a balance operation, regular or 'internal'.
- Fix bug re reliance on "None" disk label removing a fragility
concerning disk pool association within the Web-UI.
- Improve auto pool labeling subsystem by abstracting and generalising
ready for pool renaming capability.
- Improve pool uuid tracking and add related Web-UI element.
- Add Web-UI element in balance status tab to identify regular or
'internal' balance type.
- Add devid tracking and related Web-UI element.
- Use devid Disk model info to ascertain pool info for detached disks.
- Add per device allocation tracking and related Web-UI element.
- Fix prior TODO: re btrfs in partition failure point introduced in git
tag 3.9.2-32.
- Fix prior TODO: re unlabeled pools caveat.
- Add pool details disks table ‘Page refresh required’ indicator keyed
from devid=0.
- Add guidance on common detached disk removal reboot requirement.
- Remove a low level special case for LUKS dev matching (mapped devices)
which affected the performance of all dev name by-id lookups.
- Add TODO re removing legacy formatted disk raid role pre openSUSE
move.
- Update scan_disks() unit tests for new 'path included' output.
- Address todo in scan_disks() unit tests and normalise on pre sort
method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant