One or more replica shards... #324

juditnovak · 2024-06-07T16:29:42Z

Steps to reproduce

Particularly loaded host system
Check out opensearch-dsahboards-operator and run pipeline:

 tox run -e integration -- tests/integration/test_upgrade.py --model testing --keep-models

Expected behavior

No errorrs

Actual behavior

See attached screenshots. The problem was permanent, the system didn't recover state (as timestamp on the top indicates).

Versions

Operating system: jammy

Juju CLI: 3.1.8-genericlinux-amd64

Juju agent: 3.1.8

Charm revision: Most likely 90 or 99 (in case caching may be applied on charmhub, 98 has a chance too)

LXD: 5.0.3 (?)

Log output

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-07T16:29:58Z

https://warthogs.atlassian.net/browse/DPE-4575

phvalguima · 2024-06-10T17:57:38Z

I am seeing the same problem with upgrades. I believe this is caused by GH runner disk usage and opensearch's disk watermark threshold when allocating unassigned shards. Check this comment: #319 (comment)

phvalguima · 2024-06-14T10:27:46Z

Sorry, the merge above should've not close this issue. I want to investigate it further.

phvalguima · 2024-06-20T13:12:19Z

Hi @juditnovak I tried twice this test scenario and cannot reproduce it in my own machine. If you are able to reproduce, can you provide two information:

Shard status: curl -sk -u admin:<PWD> https://<IP>:9200/_cat/shards
Cluster allocation explain, specially for any unassigned shards seen above: curl -XGET -H 'Content-Type: application/json' -sk -u admin:<PWD> https://<IP>:9200/_cluster/allocation/explain -d '{ "index": "TARGET_INDEX" }'

juditnovak · 2024-06-25T09:15:44Z

Sure, I'll totally do that. I foresee running similar pipelines locally quite a bit, so we can confirm if the issues occurs again.

phvalguima · 2024-06-25T09:17:04Z

Thanks @juditnovak. Let's leave this issue open for now, so we can come back here if we ever see this same issue happening somewhere

juditnovak · 2024-08-02T07:57:15Z

This issue is still going on as of today (rev 120). It has actually got worse :-(

…getting worse

juditnovak · 2024-08-02T10:13:11Z

Even worse... IT's happening for 3-unit installations :-( (Latest revision still 120)

https://github.com/canonical/opensearch-dashboards-operator/actions/runs/10212627790/job/28256884240#step:26:112

…getting worse

## Issue This PR addresses the issues #327 and #324. When Opensearch is in the process of shutting down, the operator currently does not wait for data to be moved away from the stopping unit. This may result in shards not being assigned and could cause loss of data. In cases where the index `.charm_node_lock` is impacted, the operator can no longer acquire the lock to start or stop Opensearch. This will result in `503` errors in the logfile. The behavior can be seen in this CI run, with some additional logging information added for debugging: https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387 ``` Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}] Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}] ``` Shortly after, the error is there: ``` unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287 unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock Traceback (most recent call last): ... File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired unit = self._unit_with_lock(host) File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock document_data = self._opensearch.request( File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request raise OpenSearchHttpError( charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503 self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503} unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event ``` ## Solution When stopping Opensearch, the operator should wait for the shards relocation to be completed. This should be happening right after adding the currently stopping unit to the allocation exclusions. The check should be blocking, meaning that Opensearch must not stop until the relocation is finished. This will look something like this: ``` unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...] [...] unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...] ``` To check if there are still some moving shards, the API `_cluster/health` can be queried for `"relocating_shards"`. If these are not `0`, the process of stopping should be halted. Depending on the amount of data, this can take quite some time. A reasonable maximum waiting time of 15 minutes has been added, after that an error will be raised.

reneradoi · 2024-08-09T06:54:15Z

Was fixed with #387, the operator will now wait for all shards to be moved to other nodes before shutting down Opensearch.

juditnovak added the bug Something isn't working label Jun 7, 2024

phvalguima mentioned this issue Jun 11, 2024

[DPE-4560] Rollback test scenario from consider local built -> original charm rev + update libs #319

Merged

phvalguima closed this as completed in #319 Jun 14, 2024

phvalguima reopened this Jun 14, 2024

This was referenced Jun 26, 2024

Shards don't get assigned when the Primary get's removed and only two units are left #327

Closed

**OLD** [DPE-4575] Add voting settle logic at start and stop service #345

Closed

phvalguima mentioned this issue Jul 16, 2024

[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367

Closed

juditnovak assigned phvalguima Jul 25, 2024

juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024

Increasing Opensearch units due to canonical/opensearch-operator#324 …

74f736c

…getting worse

juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024

Increasing Opensearch units due to canonical/opensearch-operator#324 …

abc95de

…getting worse

juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024

Increasing Opensearch units due to canonical/opensearch-operator#324 …

6eae03c

…getting worse

reneradoi mentioned this issue Aug 7, 2024

[DPE-4931] fix locking with unassigned shards #387

Merged

reneradoi closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One or more replica shards... #324

One or more replica shards... #324

juditnovak commented Jun 7, 2024

github-actions bot commented Jun 7, 2024

phvalguima commented Jun 10, 2024

phvalguima commented Jun 14, 2024

phvalguima commented Jun 20, 2024

juditnovak commented Jun 25, 2024

phvalguima commented Jun 25, 2024

juditnovak commented Aug 2, 2024

juditnovak commented Aug 2, 2024

reneradoi commented Aug 9, 2024

One or more replica shards... #324

One or more replica shards... #324

Comments

juditnovak commented Jun 7, 2024

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Jun 7, 2024

phvalguima commented Jun 10, 2024

phvalguima commented Jun 14, 2024

phvalguima commented Jun 20, 2024

juditnovak commented Jun 25, 2024

phvalguima commented Jun 25, 2024

juditnovak commented Aug 2, 2024

juditnovak commented Aug 2, 2024

reneradoi commented Aug 9, 2024