Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One or more replica shards... #324

Closed
juditnovak opened this issue Jun 7, 2024 · 9 comments · Fixed by #319
Closed

One or more replica shards... #324

juditnovak opened this issue Jun 7, 2024 · 9 comments · Fixed by #319
Assignees
Labels
bug Something isn't working

Comments

@juditnovak
Copy link
Contributor

Steps to reproduce

  1. Particularly loaded host system
  2. Check out opensearch-dsahboards-operator and run pipeline:
 tox run -e integration -- tests/integration/test_upgrade.py --model testing --keep-models

Expected behavior

No errorrs

Actual behavior

See attached screenshots. The problem was permanent, the system didn't recover state (as timestamp on the top indicates).

Screenshot from 2024-06-07 15-31-41

Screenshot from 2024-06-07 15-45-11

Versions

Operating system: jammy

Juju CLI: 3.1.8-genericlinux-amd64

Juju agent: 3.1.8

Charm revision: Most likely 90 or 99 (in case caching may be applied on charmhub, 98 has a chance too)

LXD: 5.0.3 (?)

Log output

Screenshot from 2024-06-07 15-47-50
Screenshot from 2024-06-07 15-44-24
Screenshot from 2024-06-07 15-41-58
Screenshot from 2024-06-07 15-39-24
Screenshot from 2024-06-07 15-36-06
Screenshot from 2024-06-07 15-35-13
Screenshot from 2024-06-07 15-33-37
Screenshot from 2024-06-07 15-21-01

Additional context

@juditnovak juditnovak added the bug Something isn't working label Jun 7, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

@phvalguima
Copy link
Contributor

I am seeing the same problem with upgrades. I believe this is caused by GH runner disk usage and opensearch's disk watermark threshold when allocating unassigned shards. Check this comment: #319 (comment)

@phvalguima
Copy link
Contributor

Sorry, the merge above should've not close this issue. I want to investigate it further.

@phvalguima
Copy link
Contributor

Hi @juditnovak I tried twice this test scenario and cannot reproduce it in my own machine. If you are able to reproduce, can you provide two information:

  1. Shard status: curl -sk -u admin:<PWD> https://<IP>:9200/_cat/shards
  2. Cluster allocation explain, specially for any unassigned shards seen above: curl -XGET -H 'Content-Type: application/json' -sk -u admin:<PWD> https://<IP>:9200/_cluster/allocation/explain -d '{ "index": "TARGET_INDEX" }'

@juditnovak
Copy link
Contributor Author

Sure, I'll totally do that. I foresee running similar pipelines locally quite a bit, so we can confirm if the issues occurs again.

@phvalguima
Copy link
Contributor

Thanks @juditnovak. Let's leave this issue open for now, so we can come back here if we ever see this same issue happening somewhere

juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024
juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024
@juditnovak
Copy link
Contributor Author

Even worse... IT's happening for 3-unit installations :-( (Latest revision still 120)

https://github.com/canonical/opensearch-dashboards-operator/actions/runs/10212627790/job/28256884240#step:26:112

juditnovak added a commit to canonical/opensearch-dashboards-operator that referenced this issue Aug 2, 2024
reneradoi added a commit that referenced this issue Aug 9, 2024
## Issue
This PR addresses the issues
#327 and
#324.

When Opensearch is in the process of shutting down, the operator
currently does not wait for data to be moved away from the stopping
unit. This may result in shards not being assigned and could cause loss
of data. In cases where the index `.charm_node_lock` is impacted, the
operator can no longer acquire the lock to start or stop Opensearch.
This will result in `503` errors in the logfile.

The behavior can be seen in this CI run, with some additional logging
information added for debugging:
https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387
```
Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}]

Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}]
```

Shortly after, the error is there:
```
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287
unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock
Traceback (most recent call last):
...
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired
    unit = self._unit_with_lock(host)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event
```

## Solution
When stopping Opensearch, the operator should wait for the shards
relocation to be completed. This should be happening right after adding
the currently stopping unit to the allocation exclusions. The check
should be blocking, meaning that Opensearch must not stop until the
relocation is finished.

This will look something like this:
```
unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...]

[...]

unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...]
```
To check if there are still some moving shards, the API
`_cluster/health` can be queried for `"relocating_shards"`. If these are
not `0`, the process of stopping should be halted. Depending on the
amount of data, this can take quite some time. A reasonable maximum
waiting time of 15 minutes has been added, after that an error will be
raised.
@reneradoi
Copy link
Contributor

Was fixed with #387, the operator will now wait for all shards to be moved to other nodes before shutting down Opensearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment