Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Remote State] Remote state auto restore fails when process restarts before successful master election. #10776

Closed
linuxpi opened this issue Oct 20, 2023 · 0 comments · Fixed by #10748
Assignees
Labels
bug Something isn't working Cluster Manager Search:Remote Search Storage Issues and PRs relating to data and metadata storage v2.12.0 Issues and PRs related to version 2.12.0

Comments

@linuxpi
Copy link
Collaborator

linuxpi commented Oct 20, 2023

Describe the bug

  • With Remote State, we auto restore remote state none of the cluster manager nodes have local disk state
  • When creating the new cluster state with remote state, the cluster UUID is still UNKNOWN_UUID.
  • Right after constructing this state, we commit it to local disk via LucenePersistedState.
  • Then bootstrapping continues. Note that cluster manager has not been elected yet.
  • During certain scenarios, to succeed the cluster manager election, we might have to update yml with initial_master_nodes(seeding).
  • For these new values in yml to take effect we need to restart the OpenSearch process
  • After restarting the process, in GatewayMetaState it tries to find the last written state on disk and finds the state we wrote to disk just after constructing it from remote state. This state was with UNKNOWN_UUID cluster uuid, but will have metadata populated from remote state
  • Since the cluster uuid is still UNKNOWN_UUID, we will trigger the remote state restore flow with current state as the state restored from local disk
  • Local Disk already contains the metadata restored from remote state. Refer to the 3rd point above. This leads to a conflict in RemoteStoreRestoreService.validate

Stacktrace

[2023-10-19T12:31:35,836][INFO ][o.o.t.TransportService   ] [5b051c2d18809bb66d027e72fc3f2384] Remote clusters initialized successfully.
[2023-10-19T12:31:36,843][INFO ][o.o.i.r.RemoteStoreRestoreService] [5b051c2d18809bb66d027e72fc3f2384] cannot restore index [logs-221998] because an open index with same name/uuid already exists in the cluster.
[2023-10-19T12:31:36,843][ERROR][o.o.b.Bootstrap          ] [5b051c2d18809bb66d027e72fc3f2384] Exception
java.lang.IllegalStateException: cannot restore index [logs-221998] because an open index with same name/uuid already exists in the cluster.
        at org.opensearch.index.recovery.RemoteStoreRestoreService.validate(RemoteStoreRestoreService.java:336)
        at org.opensearch.index.recovery.RemoteStoreRestoreService.restore(RemoteStoreRestoreService.java:166)
        at org.opensearch.gateway.GatewayMetaState.start(GatewayMetaState.java:178)
        at org.opensearch.node.Node.start(Node.java:1444)
        at org.opensearch.bootstrap.Bootstrap.start(Bootstrap.java:339)
        at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:413)
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
        at org.opensearch.cli.Command.main(Command.java:101)
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)
[2023-10-19T12:31:36,843][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [5b051c2d18809bb66d027e72fc3f2384] uncaught exception in thread [main]
java.lang.IllegalStateException: cannot restore index [logs-221998] because an open index with same name/uuid already exists in the cluster.
        at org.opensearch.index.recovery.RemoteStoreRestoreService.validate(RemoteStoreRestoreService.java:336)
        at org.opensearch.index.recovery.RemoteStoreRestoreService.restore(RemoteStoreRestoreService.java:166)
        at org.opensearch.gateway.GatewayMetaState.start(GatewayMetaState.java:178)
        at org.opensearch.node.Node.start(Node.java:1444)
        at org.opensearch.bootstrap.Bootstrap.start(Bootstrap.java:339)
        at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:413)
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180)
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171)
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104)
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138)
        at org.opensearch.cli.Command.main(Command.java:101)
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137)
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103)

To Reproduce
Steps to reproduce the behavior: Refer the description above

Expected behavior
Remote state restore should not fail if local disk contains any metadata with UNKNOWN_UUID cluster uuid.

@linuxpi linuxpi added bug Something isn't working untriaged labels Oct 20, 2023
@linuxpi linuxpi added Storage Issues and PRs relating to data and metadata storage Cluster Manager Search:Remote Search v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Oct 23, 2023
@linuxpi linuxpi self-assigned this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager Search:Remote Search Storage Issues and PRs relating to data and metadata storage v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant