generated from canonical/template-operator
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shards don't get assigned when the Primary get's removed and only two units are left #327
Labels
Comments
reneradoi
added a commit
that referenced
this issue
Jun 11, 2024
## Issue When attaching an existing storage to a new unit, 2 issues happen: - Snap install failed because of permissions / ownership of directories - snap_common gets completely deleted ## Solution - bump snap version, use the fixed one (the fixed revision is 47, this is already outdated as a newer version of the snap is already available and merged to main prior to this PR) - enhance test coverage for integration tests ## Integration Testing Tests for attaching existing storage can be found in integration/ha/test_storage.py. There are now three test cases: 1. test_storage_reuse_after_scale_down: remove one unit from the deployment, afterwards add a new one re-using the storage from the removed unit. check if the continuous writes are ok and a testfile that was created intially is still there. 2. test_storage_reuse_after_scale_to_zero: remove both units from the deployment, keep the application, add two new units using the storage again. check the continuous writes. 3. test_storage_reuse_in_new_cluster_after_app_removal: from a cluster of three units, remove all of them and remove the application. deploy a new application (with one unit) to the same model, attach the storage, then add two more units with the other storage volumes. check the continuous writes. ## Other Issues - As part of this PR, another issue is addressed: #306. It is resolved with this commit: 19f843c - Furthermore problems with acquiring the OpenSearch lock are worked around with this PR, especially when the shards for the locking index within OpenSearch are not assigned to a new primary when removing the former primary. This was also reported in #243 and will be further investigated in #327.
This could be linked to #324 |
Suggested resolution to this: when nodes are removed from the peer relation and only two nodes remain, one of them should be added to the voting exclusions in order to avoid split brain situations. |
reneradoi
added a commit
that referenced
this issue
Aug 9, 2024
## Issue This PR addresses the issues #327 and #324. When Opensearch is in the process of shutting down, the operator currently does not wait for data to be moved away from the stopping unit. This may result in shards not being assigned and could cause loss of data. In cases where the index `.charm_node_lock` is impacted, the operator can no longer acquire the lock to start or stop Opensearch. This will result in `503` errors in the logfile. The behavior can be seen in this CI run, with some additional logging information added for debugging: https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387 ``` Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}] Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}] ``` Shortly after, the error is there: ``` unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287 unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock Traceback (most recent call last): ... File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired unit = self._unit_with_lock(host) File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock document_data = self._opensearch.request( File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request raise OpenSearchHttpError( charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503 self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503} unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event ``` ## Solution When stopping Opensearch, the operator should wait for the shards relocation to be completed. This should be happening right after adding the currently stopping unit to the allocation exclusions. The check should be blocking, meaning that Opensearch must not stop until the relocation is finished. This will look something like this: ``` unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...] [...] unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...] ``` To check if there are still some moving shards, the API `_cluster/health` can be queried for `"relocating_shards"`. If these are not `0`, the process of stopping should be halted. Depending on the amount of data, this can take quite some time. A reasonable maximum waiting time of 15 minutes has been added, after that an error will be raised.
Was fixed with #387. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Steps to reproduce
UNASSIGNED
state and don't get assigned anymoreExpected behavior
If there are only two units left, one should become the primary for the unassigned shards.
Actual behavior
No units takes over the primary role for the unassigned shards, thus esp. the index
.charm_node_lock
may stay unavailable and no further locks for scaling down (or up) can be acquired.Log output
Juju debug log:
Additional context
Also see this issue, where some investigations and workarounds are discussed.
The text was updated successfully, but these errors were encountered: