Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DPE-4931] fix locking with unassigned shards (#387)
## Issue This PR addresses the issues #327 and #324. When Opensearch is in the process of shutting down, the operator currently does not wait for data to be moved away from the stopping unit. This may result in shards not being assigned and could cause loss of data. In cases where the index `.charm_node_lock` is impacted, the operator can no longer acquire the lock to start or stop Opensearch. This will result in `503` errors in the logfile. The behavior can be seen in this CI run, with some additional logging information added for debugging: https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387 ``` Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}] Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}] ``` Shortly after, the error is there: ``` unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287 unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock Traceback (most recent call last): ... File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired unit = self._unit_with_lock(host) File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock document_data = self._opensearch.request( File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request raise OpenSearchHttpError( charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503 self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503} unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event ``` ## Solution When stopping Opensearch, the operator should wait for the shards relocation to be completed. This should be happening right after adding the currently stopping unit to the allocation exclusions. The check should be blocking, meaning that Opensearch must not stop until the relocation is finished. This will look something like this: ``` unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...] [...] unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...] ``` To check if there are still some moving shards, the API `_cluster/health` can be queried for `"relocating_shards"`. If these are not `0`, the process of stopping should be halted. Depending on the amount of data, this can take quite some time. A reasonable maximum waiting time of 15 minutes has been added, after that an error will be raised.
- Loading branch information