Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shards don't get assigned when the Primary get's removed and only two units are left #327

Closed
reneradoi opened this issue Jun 11, 2024 · 4 comments
Assignees
Labels
24.10-2/beta Bugs targeted to be fixed for 2/beta on 24.10 bug Something isn't working

Comments

@reneradoi
Copy link
Contributor

Steps to reproduce

  • start a cluster with three nodes
  • remove the application
  • after the first unit is removed, occasionally (e.g. when the application leader gets removed first) some shards are in UNASSIGNED state and don't get assigned anymore
  • the application does not scale down any further because the other units can't get the OpenSearch lock (this is currently worked around with this patch)

Expected behavior

If there are only two units left, one should become the primary for the unassigned shards.

Actual behavior

No units takes over the primary role for the unassigned shards, thus esp. the index .charm_node_lock may stay unavailable and no further locks for scaling down (or up) can be acquired.

Log output

Juju debug log:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 272, in request
    resp = call(urls[0])
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 224, in call
    for attempt in Retrying(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 251, in call
    response.raise_for_status()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://192.168.235.252:9200/.charm_node_lock/_source/0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-1/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 467, in _on_opensearch_data_storage_detaching
    self.node_lock.release()
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 327, in release
    if self._unit_with_lock(host) == self._charm.unit.name:
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 199, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-1/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 284, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}

Additional context

Also see this issue, where some investigations and workarounds are discussed.

@reneradoi reneradoi added the bug Something isn't working label Jun 11, 2024
Copy link
Contributor

reneradoi added a commit that referenced this issue Jun 11, 2024
## Issue
When attaching an existing storage to a new unit, 2 issues happen:

- Snap install failed because of permissions / ownership of directories 
- snap_common gets completely deleted

## Solution
- bump snap version, use the fixed one (the fixed revision is 47, this
is already outdated as a newer version of the snap is already available
and merged to main prior to this PR)
- enhance test coverage for integration tests

## Integration Testing
Tests for attaching existing storage can be found in
integration/ha/test_storage.py. There are now three test cases:
1. test_storage_reuse_after_scale_down: remove one unit from the
deployment, afterwards add a new one re-using the storage from the
removed unit. check if the continuous writes are ok and a testfile that
was created intially is still there.
2. test_storage_reuse_after_scale_to_zero: remove both units from the
deployment, keep the application, add two new units using the storage
again. check the continuous writes.
3. test_storage_reuse_in_new_cluster_after_app_removal: from a cluster
of three units, remove all of them and remove the application. deploy a
new application (with one unit) to the same model, attach the storage,
then add two more units with the other storage volumes. check the
continuous writes.

## Other Issues
- As part of this PR, another issue is addressed:
#306. It is
resolved with this commit:
19f843c
- Furthermore problems with acquiring the OpenSearch lock are worked around with this PR, especially when the shards for the locking index within OpenSearch are not assigned to a new primary when removing the former primary. This was also reported in #243 and will be further investigated in #327.
@phvalguima
Copy link
Contributor

This could be linked to #324

@reneradoi
Copy link
Contributor Author

Suggested resolution to this: when nodes are removed from the peer relation and only two nodes remain, one of them should be added to the voting exclusions in order to avoid split brain situations.

reneradoi added a commit that referenced this issue Aug 9, 2024
## Issue
This PR addresses the issues
#327 and
#324.

When Opensearch is in the process of shutting down, the operator
currently does not wait for data to be moved away from the stopping
unit. This may result in shards not being assigned and could cause loss
of data. In cases where the index `.charm_node_lock` is impacted, the
operator can no longer acquire the lock to start or stop Opensearch.
This will result in `503` errors in the logfile.

The behavior can be seen in this CI run, with some additional logging
information added for debugging:
https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387
```
Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}]

Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}]
```

Shortly after, the error is there:
```
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287
unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock
Traceback (most recent call last):
...
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired
    unit = self._unit_with_lock(host)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event
```

## Solution
When stopping Opensearch, the operator should wait for the shards
relocation to be completed. This should be happening right after adding
the currently stopping unit to the allocation exclusions. The check
should be blocking, meaning that Opensearch must not stop until the
relocation is finished.

This will look something like this:
```
unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...]

[...]

unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...]
```
To check if there are still some moving shards, the API
`_cluster/health` can be queried for `"relocating_shards"`. If these are
not `0`, the process of stopping should be halted. Depending on the
amount of data, this can take quite some time. A reasonable maximum
waiting time of 15 minutes has been added, after that an error will be
raised.
@reneradoi
Copy link
Contributor Author

Was fixed with #387.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
24.10-2/beta Bugs targeted to be fixed for 2/beta on 24.10 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants