-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sequence Numbers to write operations #10708
Comments
I'd argue the first use case is making replication semantics more sound :) |
Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces `primary terms` to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers. Relates to #10708 Closes #14062
Adds a counter to each write operation on a shard. This sequence numbers is indexed into lucene using doc values, for now (we will probably require indexing to support range searchers in the future). On top of this, primary term semantics are enforced and shards will refuse write operation coming from an older primary. Other notes: - The add SequenceServiceNumber is just a skeleton and will be replaced with much heavier one, once we have all the building blocks (i.e., checkpoints). - I completely ignored recovery - for this we will need checkpoints as well. - A new based class is introduced for all single doc write operations. This is handy to unify common logic (like toXContent). - For now, we don't use seq# as versioning. We could in the future. Relates to #10708 Closes #14651
…operations The work for elastic#10708 requires tighter integration with the current shard routing of a shard. As such, we need to make sure it is set before the IndexService exposes the shard to external operations.
It's not clear as to what would happen in the following split brain scenario (scenario-1):
In this case we need a strategy for reconciling the differences in the indexes, if there were change operations in both the networks. Does a strategy like that exist today? So far it seems like this situation is preventable by using min_master_nodes. However in case min_master_nodes is not set appropriately, some default strategy should come into effect I would think. An example strategy could be:
Another interesting situation (scenario-2) to consider:
This would happen if there is no reconciliation strategy in effect. I do see that the sequence numbering method will keep shards that have connectivity to both the networks, in integral state, in the case of scenario-1. In the case of scenario-2, it is possible that the same shard gets operations with same I am still trying to understand Elasticsearch's cluster behavior. It's possible that I might have made assumptions that aren't correct. |
The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master. As you noted, this falls apart when there are two residing masters in the cluster. Indeed, the only way to prevent this is by setting minimum master nodes - which is the number one most important setting to set in ES (tell it what the expected cluster size is) If min master nodes is not set and a split brain occurs, resolution will come when one of the masters steps down (either by manual intervention or by detecting the other one). In that case all replicas will "reset" to the primary designated by the left over master.
This is similar to what ES does - nodes with no master will only serve read requests and block writes (by default, it can be configured to block reads).
If the term is the same from both primaries, the replica will accept them according to the current plan. The situation will be resolved when the network restores and the left over primary and replica sync but indeed there are potential troubles there. I have some ideas on how to fix this specific secondary failure (split brain is the true issue, after which all bets are off) but there are bigger fish to catch first :) |
Thank you very much for your clarification. I rather enjoy all these discussions and your comments.
I would like to clearly understand the reset/sync scenarios. What triggers reset/sync? I can think of a couple of "normal" operation scenarios
In the case of split brain, with multi-network replicas (assuming min master nodes is set), primary-1 has been assuming that this replica R (on this third node, say N-3) has been failing (because of its allegiance to primary-2 ) but still is in the network. Hence it would attempt sync/reset. How does this protocol work? Should master-1 attempt to decommission R at some point, going by assumption (2)? This problem will occur in a loop if R is decommissioned but another replica is installed on N-3 in its place, by the same protocol. There will be contention on N-3 for "reset"-ing replica shards by both the masters. I suppose one way to resolve this is by letting a node choose a master if there are multiple masters. If we did this, then whenever a node loses its master, it would choose the other master, and there will be a sync/reset and all is well. However if the node chooses its master, the other master will lose quorum, and hence cease to exist, which is a good resolution for this issue in my opinion. |
The two issues you mention indeed trigger a primary/replica sync. I'm not sure I follow the rest, I would like to ask you to continue the discussion on discuss.elastic.co . We try to keep github for issues and work items. Thx! |
Sure. Posted it here: https://discuss.elastic.co/t/sequence-numbers-to-write-ops-split-brain-scenario/43748 |
any plan to release this? |
@makeyang this will be released as soon as it is done. There's still a lot of work to do.
ES is currently and will stay CP in the foreseeable future. If a node is partitioned away from the cluster it will serve read requests (configurable) but will block writes, in which case we drop availability. Of course in future there are many options but currently there are no concrete plans to make it any different. |
…y control (#37857) The delete and update by query APIs both offer protection against overriding concurrent user changes to the documents they touch. They currently are using internal versioning. This PR changes that to rely on sequences numbers and primary terms. Relates #37639 Relates #36148 Relates #10708
…y control (elastic#37857) The delete and update by query APIs both offer protection against overriding concurrent user changes to the documents they touch. They currently are using internal versioning. This PR changes that to rely on sequences numbers and primary terms. Relates elastic#37639 Relates elastic#36148 Relates elastic#10708
…ic#37872) The update request has a lesser known support for a one off update of a known document version. This PR adds an a seq# based alternative to power these operations. Relates elastic#36148 Relates elastic#10708
…37872 (#38155) * Move update and delete by query to use seq# for optimistic concurrency control (#37857) The delete and update by query APIs both offer protection against overriding concurrent user changes to the documents they touch. They currently are using internal versioning. This PR changes that to rely on sequences numbers and primary terms. Relates #37639 Relates #36148 Relates #10708 * Add Seq# based optimistic concurrency control to UpdateRequest (#37872) The update request has a lesser known support for a one off update of a known document version. This PR adds an a seq# based alternative to power these operations. Relates #36148 Relates #10708 * Move watcher to use seq# and primary term for concurrency control (#37977) * Adapt minimum versions for seq# power operations After backporting #37977, #37857 and #37872
The work prescribed in this issue now completed and will be part of the coming 6.7 and 7.0 releases. There are still some small follow ups we want to do, but they do no need to be tracked as part of this issue. We now consider this completed. |
Introduction
An Elasticsearch shard can receive indexing, update, and delete commands. Those changes are applied first on the primary shard, maintaining per doc semantics and are then replicated to all the replicas. All these operations happen concurrently. While we maintain ordering on a per doc basis, using versioning support there is no way to order them with respect to each other. Having such a per shard operation ordering will enable us to implement higher level features such as Changes API (follow changes to documents in a shard and index) and Reindexing API (take all data from a shard and reindex it into another, potentially mutating the data). Internally we could use this ordering to speed up shard recoveries, by identifying which specific operations need to be replayed to the recovering replica instead of falling back to a file based sync.
To get such ordering, each operation will be assigned a unique and ever increasing Sequence Number (in short, seq#). This sequence number will be assigned on the primary and replicated to all replicas. Seq# are to be indexed in Lucene to allow sorting, range filtering etc.
Warning, research ahead
What follows in this ticket is the current thinking about how to best implement this feature. It may change in subtle or major ways as the work continues. Is is important to implement this infrastructure in a way that is correct, resilient to failures, and without slowing down indexing speed. We feel confident with the approach described below, but we may have to backtrack or change the approach completely.
What is a Sequence
Applying an operation order on a primary is a simple question of incrementing a local counter for every operation. However, this is not sufficient to guarantee global uniqueness and monotonicity under error conditions where the primary shard can be isolated by a network partition. For those, the identity of the current primary needs to be baked into each operation. For example, late to arrive operations from an old primary can be detected and rejected.
In short, each operation is assigned two numbers:
term
- this number is incremented with every primary assignment and is determined by the cluster master. This is very similar to the notion of aterm
in Raft, aview-number
in Viewstamped Replication or anepoch
in Zab.seq#
- this number is incremented by the primary with each operation it processes.To achieve ordering, when comparing two operations ,
o1
&o2
, we say thato1
<o2
if and only ifs1.seq#
<s2.seq#
or (s1.seq#
==s2.seq#
ands1.term
<s2.term
). Equality and greater than are defined in a similar fashion.For reasons explained later on, we maintain for each shard copy two special seq#:
local checkpoint#
- this is the highest seq# for which all lower seq# have been processed . Note that this is not the highest seq# the shard has processed due to the concurrent indexing, which means that some changes can be processed while previous more heavy ones can still be on going.global checkpoint#
(or justcheckpoint#
) - the highest seq# for which the local shard can guarantee that all previous (included) seq# have been processed on all active shard copies (i.e., primary and replicas).Those two numbers will be maintained in memory but also persisted in the metadata of every lucene commit.
Changes to indexing flow on primaries
Here is a sketch of the indexing code on primaries. Much of it is identical to the current logic. Changes or additions are marked in bold .
local checkpoint#
.checkpoint#
to replicas (can be folded into a heartbeat/next index req).Changes to indexing flow on replicas
As above, this is sketch of the indexing code on replicas. Changes with the current logic are marked as bold.
local checkpoint#
Global Checkpoint# increment on replicas
The primary advances its global
checkpoint#
based on its knowledge of its local and replica'slocal checkpoint#
. Periodically it shares its knowledge with the replicasglobal checkpoint#
were processed and local checkpoint# is of the same primary term. If not, fail shard.global checkpoint#
, if it's lower than the incoming global checkpoint.Note that the global checkpoint is a local knowledge of that is update under the mandate of the primary. It may be that the primary information is lagging compared to a replica. This can happen when a replica is promoted to a primary (but still has stale info).
First use case - faster replica recovery
Have an ordering of operations allows us to speed up the recovery process of an existing replica and synchronization with the primary. At the moment, we do file based sync which typically results in over-copying of data. Having a clearly marked
checkpoint#
allows us to limit the sync operation to just those documents that have changed subsequently. In many cases we expect to have no documents to sync at all. This improvement will be tracked in a separate issue.Road map
Basic infra
TransportWriteAction
) should keep the current behavior. (@dakrone) Change certain replica failures not to fail the replica shard #22874IndexShard#verifyPrimary
) (@jasontedor, Introduce primary context #25122)Replica recovery (no rollback)
A best effort doc based replica recovery, based on local last commit. By best effort we refer to having no guarantees on the primary
translog state and the likelihood of doc based recovery to succeed and not requiring a file sync
We currently have no guarantee that all ops above the local checkpoint baked into the commit will be replayed. That means that delete operations with a seq# > local checkpoint will not be replayed. To work around it (for now), we will move the local checkpoint artificially (at the potential expense of correctness) (@jasontedor)
Review correctness of POC and extract requirements for the primary side (@jasontedor)replaced with TLA+ workTranslog seq# based API
Currently translog keeps all operations that are not persisted into the last lucene commit. This doesn't imply that it can serve all operations from a given seq# and up. We want to move seq# based recovery where a lucene commit indicates what seq# a fully baked into it and the translog recovers from there.
Primary recovery (no rollback)
Primary promotion
Live replica/primary sync (no rollback)
Add a task that streams all operations from the primary's global checkpoint to all shards. (@ywelsch Remove TranslogRecoveryPerformer #24858)
When a replica shard increases its primary term under the mandate of a new primary, it should also update its global checkpoint; this gives us the guarantee that its global checkpoint is at least as high as the new primary and gives a starting point for the primary/replica resync (@ywelsch, Update global checkpoint when increasing primary term on replica #25422)
Replicas should throw their local checkpoint to the global checkpoint when detected a new primary. (@jasontedor, Throw back replica local checkpoint on new primary #25452)
Primary recovery with rollback
Needed to deal with discrepancies between translog and commit point that can result of failure during primary/replica sync
Replica recovery with rollback
Needed to throw away potential wrong doc versions that ended up in lucene. Those "wrong doc versions" may still be in the translog of the replica but since we ignore the translog on replica recovery they will be removed.
Live replica/primary sync with rollback
Seq# as versioning
Shrunk indices
Shrunk indices have mixed histories.
max_unsafe_auto_id_timestamp
@jasontedor, Initialize max unsafe auto ID timestamp on shrink #25356)Adopt Me
Introduce new shard states to indicated an ongoing primary sync on promotion. See Live primary-replica resync (no rollback) #24841 (review). We now have an alternative plan for this - see Introduce promoting index shard state #28004 (comment)TBD
Completed Miscellaneous
If minimum of all local checkpoints is less than global checkpoint on the primary, do we fail the shard?No, this can happen when replicas pull back their local checkpoints to their version of the global checkpointFailed shards who's local checkpoint is lagging with more than 10000 (?) ops behind the primary . This is a temporary measure to allow merging into master without closing translog gaps during primary promotion on a live shard. Those will require the replicas to pick them up, which will take a replica/primary live syncThe text was updated successfully, but these errors were encountered: