Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect replicated data streams against local rollovers #64710

Merged
merged 28 commits into from
Dec 8, 2020

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Nov 6, 2020

When a data stream is being auto followed then a rollover in a local cluster can break auto following,
if the local cluster performs a rollover then it creates a new write index and if then later the remote
cluster rolls over as well then that new write index can't be replicated, because it has the same name
as in the write index in the local cluster, which was created earlier.

If a data stream is managed by ccr, then the local cluster should not do a rollover for those data streams.
The data stream should be rolled over in the remote cluster and that change should replicate to the local
cluster. Performing a rollover in the local cluster is an operation that the data stream support in ccr should
perform.

To protect against rolling over a replicated data stream, this PR adds a replicate field to DataStream class.
The rollover api will fail with an error in case a data stream is being rolled over and the targeted data stream is
a replicated data stream. When the put follow api creates a data stream in the local cluster then the replicate flag
is set to true. There should be a way to turn a replicated data stream into a regular data stream when for example
during disaster recovery. The newly added api in this pr (promote data stream api) is doing that. After a replicated
data stream is promoted to a regular data stream then the local data stream can be rolled over, so that the new
write index is no longer a follower index. Also if the put follow api is attempting to update this data stream
(for example to attempt to resume auto following) then that with fail, because the data stream is no longer a
replicated data stream.

Today with time based indices behind an alias, the is_write_index property isn't replicated from remote cluster
to the local cluster, so when attempting to rollover the alias in the local cluster the rollover fails, because the
alias doesn't have a write index. The added replicated field in the DataStream class and added validation
achieve the same kind of protection, but in a more robust way.

A followup from #61993.

(labelling this PR as non-issue, since data stream support for CCR hasn't been released yet)

@martijnvg martijnvg added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features :Data Management/Data streams Data streams and their lifecycles labels Nov 6, 2020
@martijnvg martijnvg requested a review from dnhatn November 6, 2020 13:45
throw new ResourceNotFoundException("data stream [" + request.getName() + "] does not exist");
}

DataStream promotedDataStream = dataStream.promoteDataStream();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we embed this in the unfollow api? I'm not sure, because that api is centered around unfollowing a regular index and this operation is different and that would add ambiguity to the unfollow api. However the argument can also be made in reverse, in that this is a related operation.

@dnhatn
Copy link
Member

dnhatn commented Nov 9, 2020

Thanks Martijn. I wonder how we use bi-directional replication with data streams. Today CCR does not require manual intervention for bi-directional replication user cases.

@martijnvg
Copy link
Member Author

I wonder how we use bi-directional replication with data streams.

As far as I know, that doesn't require manual intervention as well. Auto following a data stream is similar to following an index pattern. Auto following a data stream also updates the data stream in the local cluster, when auto following an index pattern just IndexMetadata is added/updated.

@martijnvg martijnvg marked this pull request as ready for review November 23, 2020 10:48
@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 23, 2020
}
}

public void testDataStreamsBiDirectionalReplication() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg Sorry, I think I didn't explain well my concern about supporting bi-directional replication. What I meant was JasonZ's blog. In his setup, we can send the same indexing request (uses the same write alias) to either cluster. In this test, we use different data streams for indexing requests. That means users can't simply reroute all indexing to a single cluster when another is not available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently an alias can't point to a data stream or its backing indices. So the test can't completely follow the bi directional scenario. I did add a few logs-http* searches to mimic reading from a logs-http alias, as best effort replacement for the fact aliases can't point to data streams.

We're planning to add alias support to data streams. These aliases would only be able point to data streams and not to a data stream's backing indices, other indices or other aliases. Like aliases defined on indices, aliases on data stream could also have a write flag, which indicates to which data stream write requests are redirected to. I will add a TODO here, that these searches on logs-http* pattern should be replaced with searches and writes via aliases when alias support for data streams has landed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to add alias support to data streams.

Thanks for explaning + adding the TODO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnhatn I've opened: #66163

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martijnvg.

"indices.promote_data_stream":{
"documentation":{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/master/data-streams.html",
"description":"Promotes a data stream from a replicate data stream managed by CCR to a regular data stream"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/replicate/replicated

}
}

public void testDataStreamsBiDirectionalReplication() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning to add alias support to data streams.

Thanks for explaning + adding the TODO.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks so much for all iterations @martijnvg!

@martijnvg martijnvg merged commit 52afaf2 into elastic:master Dec 8, 2020
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Dec 8, 2020
Backporting elastic#64710 to the 7.x branch.

When a data stream is being auto followed then a rollover in a local cluster can break auto following,
if the local cluster performs a rollover then it creates a new write index and if then later the remote
cluster rolls over as well then that new write index can't be replicated, because it has the same name
as in the write index in the local cluster, which was created earlier.

If a data stream is managed by ccr, then the local cluster should not do a rollover for those data streams.
The data stream should be rolled over in the remote cluster and that change should replicate to the local
cluster. Performing a rollover in the local cluster is an operation that the data stream support in ccr should
perform.

To protect against rolling over a replicated data stream, this PR adds a replicate field to DataStream class.
The rollover api will fail with an error in case a data stream is being rolled over and the targeted data stream is
a replicated data stream. When the put follow api creates a data stream in the local cluster then the replicate flag
is set to true. There should be a way to turn a replicated data stream into a regular data stream when for example
during disaster recovery. The newly added api in this pr (promote data stream api) is doing that. After a replicated
data stream is promoted to a regular data stream then the local data stream can be rolled over, so that the new
write index is no longer a follower index. Also if the put follow api is attempting to update this data stream
(for example to attempt to resume auto following) then that with fail, because the data stream is no longer a
replicated data stream.

Today with time based indices behind an alias, the is_write_index property isn't replicated from remote cluster
to the local cluster, so when attempting to rollover the alias in the local cluster the rollover fails, because the
alias doesn't have a write index. The added replicated field in the DataStream class and added validation
achieve the same kind of protection, but in a more robust way.

A followup from elastic#61993.
martijnvg added a commit that referenced this pull request Dec 8, 2020
Backporting #64710 to the 7.x branch.

When a data stream is being auto followed then a rollover in a local cluster can break auto following,
if the local cluster performs a rollover then it creates a new write index and if then later the remote
cluster rolls over as well then that new write index can't be replicated, because it has the same name
as in the write index in the local cluster, which was created earlier.

If a data stream is managed by ccr, then the local cluster should not do a rollover for those data streams.
The data stream should be rolled over in the remote cluster and that change should replicate to the local
cluster. Performing a rollover in the local cluster is an operation that the data stream support in ccr should
perform.

To protect against rolling over a replicated data stream, this PR adds a replicate field to DataStream class.
The rollover api will fail with an error in case a data stream is being rolled over and the targeted data stream is
a replicated data stream. When the put follow api creates a data stream in the local cluster then the replicate flag
is set to true. There should be a way to turn a replicated data stream into a regular data stream when for example
during disaster recovery. The newly added api in this pr (promote data stream api) is doing that. After a replicated
data stream is promoted to a regular data stream then the local data stream can be rolled over, so that the new
write index is no longer a follower index. Also if the put follow api is attempting to update this data stream
(for example to attempt to resume auto following) then that with fail, because the data stream is no longer a
replicated data stream.

Today with time based indices behind an alias, the is_write_index property isn't replicated from remote cluster
to the local cluster, so when attempting to rollover the alias in the local cluster the rollover fails, because the
alias doesn't have a write index. The added replicated field in the DataStream class and added validation
achieve the same kind of protection, but in a more robust way.

A followup from #61993
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Dec 8, 2020
martijnvg added a commit that referenced this pull request Dec 8, 2020
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Dec 8, 2020
martijnvg added a commit that referenced this pull request Dec 8, 2020
Backporting #66004 to 7.x branch.
Relates to #64710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants