Shards don't get assigned when the Primary get's removed and only two units are left #327

reneradoi · 2024-06-11T06:26:14Z

Steps to reproduce

start a cluster with three nodes
remove the application
after the first unit is removed, occasionally (e.g. when the application leader gets removed first) some shards are in UNASSIGNED state and don't get assigned anymore
the application does not scale down any further because the other units can't get the OpenSearch lock (this is currently worked around with this patch)

Expected behavior

If there are only two units left, one should become the primary for the unassigned shards.

Actual behavior

No units takes over the primary role for the unassigned shards, thus esp. the index .charm_node_lock may stay unavailable and no further locks for scaling down (or up) can be acquired.

Log output

Juju debug log:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 224, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 251, in call
    response.raise_for_status()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://192.168.235.252:9200/.charm_node_lock/_source/0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 467, in _on_opensearch_data_storage_detaching
    self.node_lock.release()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 327, in release
    if self._unit_with_lock(host) == self._charm.unit.name:
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 199, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 284, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}

Additional context

Also see this issue, where some investigations and workarounds are discussed.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-11T06:26:31Z

https://warthogs.atlassian.net/browse/DPE-4603

## Issue When attaching an existing storage to a new unit, 2 issues happen: - Snap install failed because of permissions / ownership of directories - snap_common gets completely deleted ## Solution - bump snap version, use the fixed one (the fixed revision is 47, this is already outdated as a newer version of the snap is already available and merged to main prior to this PR) - enhance test coverage for integration tests ## Integration Testing Tests for attaching existing storage can be found in integration/ha/test_storage.py. There are now three test cases: 1. test_storage_reuse_after_scale_down: remove one unit from the deployment, afterwards add a new one re-using the storage from the removed unit. check if the continuous writes are ok and a testfile that was created intially is still there. 2. test_storage_reuse_after_scale_to_zero: remove both units from the deployment, keep the application, add two new units using the storage again. check the continuous writes. 3. test_storage_reuse_in_new_cluster_after_app_removal: from a cluster of three units, remove all of them and remove the application. deploy a new application (with one unit) to the same model, attach the storage, then add two more units with the other storage volumes. check the continuous writes. ## Other Issues - As part of this PR, another issue is addressed: #306. It is resolved with this commit: 19f843c - Furthermore problems with acquiring the OpenSearch lock are worked around with this PR, especially when the shards for the locking index within OpenSearch are not assigned to a new primary when removing the former primary. This was also reported in #243 and will be further investigated in #327.

phvalguima · 2024-06-26T16:03:32Z

This could be linked to #324

reneradoi · 2024-07-01T13:54:36Z

Suggested resolution to this: when nodes are removed from the peer relation and only two nodes remain, one of them should be added to the voting exclusions in order to avoid split brain situations.

## Issue This PR addresses the issues #327 and #324. When Opensearch is in the process of shutting down, the operator currently does not wait for data to be moved away from the stopping unit. This may result in shards not being assigned and could cause loss of data. In cases where the index `.charm_node_lock` is impacted, the operator can no longer acquire the lock to start or stop Opensearch. This will result in `503` errors in the logfile. The behavior can be seen in this CI run, with some additional logging information added for debugging: https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387 ``` Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}] Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}] ``` Shortly after, the error is there: ``` unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287 unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock Traceback (most recent call last): ... File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired unit = self._unit_with_lock(host) File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock document_data = self._opensearch.request( File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request raise OpenSearchHttpError( charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503 self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503} unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event ``` ## Solution When stopping Opensearch, the operator should wait for the shards relocation to be completed. This should be happening right after adding the currently stopping unit to the allocation exclusions. The check should be blocking, meaning that Opensearch must not stop until the relocation is finished. This will look something like this: ``` unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...] [...] unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...] ``` To check if there are still some moving shards, the API `_cluster/health` can be queried for `"relocating_shards"`. If these are not `0`, the process of stopping should be halted. Depending on the amount of data, this can take quite some time. A reasonable maximum waiting time of 15 minutes has been added, after that an error will be raised.

reneradoi · 2024-08-09T06:52:35Z

Was fixed with #387.

reneradoi added the bug Something isn't working label Jun 11, 2024

reneradoi mentioned this issue Jun 11, 2024

[DPE-2119] fix issues and tests when reusing storage #272

Merged

reneradoi mentioned this issue Jun 11, 2024

Destroying opensearch application results in 2/3 errored units #243

Closed

phvalguima added the 24.10-2/beta Bugs targeted to be fixed for 2/beta on 24.10 label Jun 26, 2024

reneradoi mentioned this issue Jul 1, 2024

Initializing the security index times out when re-using storage with OpenSearch 2.14.0 #326

Closed

phvalguima mentioned this issue Jul 2, 2024

**OLD** [DPE-4575] Add voting settle logic at start and stop service #345

Closed

phvalguima mentioned this issue Jul 16, 2024

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

Closed

juditnovak assigned phvalguima Jul 25, 2024

reneradoi mentioned this issue Aug 7, 2024

[DPE-4931] fix locking with unassigned shards #387

Merged

reneradoi closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shards don't get assigned when the Primary get's removed and only two units are left #327

Shards don't get assigned when the Primary get's removed and only two units are left #327

reneradoi commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

phvalguima commented Jun 26, 2024

reneradoi commented Jul 1, 2024

reneradoi commented Aug 9, 2024

Shards don't get assigned when the Primary get's removed and only two units are left #327

Shards don't get assigned when the Primary get's removed and only two units are left #327

Comments

reneradoi commented Jun 11, 2024

Steps to reproduce

Expected behavior

Actual behavior

Log output

Additional context

github-actions bot commented Jun 11, 2024

phvalguima commented Jun 26, 2024

reneradoi commented Jul 1, 2024

reneradoi commented Aug 9, 2024