Block older operations on primary term transition #24779

jasontedor · 2017-05-18T20:52:36Z

Today a replica learns of a new primary term via a cluster state update and there is not a clean transition between the older primary term and the newer primary term. This commit modifies this situation so that:

a replica shard learns of a new primary term via replication operations executed under the mandate of the new primary
when a replica shard learns of a new primary term, it blocks operations on older terms from reaching the engine, with a clear transition point between the operations on the older term and the operations on the newer term

This work paves the way for a primary/replica sync on primary promotion. Future work will also ensure a clean transition point on a promoted primary, and prepare a replica shard for a sync with the promoted primary.

Relates #10708

Today a replica learns of a new primary term via a cluster state update and there is not a clean transition between the older primary term and the newer primary term. This commit modifies this situation so that: - a replica shard learns of a new primary term via replication operations executed under the mandate of the new primary - when a replica shard learns of a new primary term, it blocks operations on older terms from reaching the engine, with a clear transition point between the operations on the older term and the operations on the newer term This work paves the way for a primary/replica sync on primary promotion. Future work will also ensure a clean transition point on a promoted primary, and prepare a replica shard for a sync with the promoted primary.

ywelsch · 2017-05-19T07:51:00Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-            throw new IllegalArgumentException(LoggerMessageFormat.format("{} operation term [{}] is too old (current [{}])",
-                shardId, opPrimaryTerm, primaryTerm));
+        if (operationPrimaryTerm > primaryTerm
+                && pendingPrimaryTerm.accumulateAndGet(operationPrimaryTerm, Math::max) == operationPrimaryTerm) {


If there are many incoming operations with higher term, each one of them will go into this branch and invoke blockOperations (until one completes). This can create additional contention when the first blockOperations is completed and subsequent operations unnecessarily call blockOperations. I've adapted your code in 7ff4a7c so that only the first operation with higher term calls blockOperations.

This solution has the problem that if, primaryTerm == 0, an operation comes in with operationPrimaryTerm == 1 then another operation comes in with operationPrimaryTerm == 2 and then another ops comes in with operationPrimaryTerm == 1, it maybe be that the last op is processed before the primaryTerm was incremented to 1 (or 2). This can happen if the first ops passed the check but didn't submit it's block. The 2 op incremented pendingPrimaryTerm but didn't submit the block and then the 3rd op just passes this along without waiting.

and now I see that we guard against it with if (operationPrimaryTerm == currentPrimaryTerm) later on, so the third operation will be failed but with the wrong message (we will say it's too old and give a currentPrimaryTerm of 0 while the ops term is 1). I think is all just too complex and isn't worth it given how rare primary promotions are.

bleskes

Looks great. I think the main discussion point is the concurrency control. I will reach out to discuss in another channel.

bleskes · 2017-05-19T11:12:07Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -180,7 +178,7 @@ protected void resolveRequest(final IndexMetaData indexMetaData, final Request r

    /**
     * Synchronous replica operation on nodes with replica copies. This is done under the lock form


nit: this is done while having (under?) a permit

bleskes · 2017-05-19T11:26:51Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                && pendingPrimaryTerm.accumulateAndGet(operationPrimaryTerm, Math::max) == operationPrimaryTerm) {
+            try {
+                indexShardOperationPermits.blockOperations(30, TimeUnit.MINUTES, () -> {
+                    if (operationPrimaryTerm > primaryTerm) {


can you add a comment as to how it's possible that the term will not be higher (i.e. race condition between checking pendingPrimaryTerm and submitting the blockOperations

bleskes · 2017-05-19T11:38:11Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-            throw new IllegalArgumentException(LoggerMessageFormat.format("{} operation term [{}] is too old (current [{}])",
-                shardId, opPrimaryTerm, primaryTerm));
+        if (operationPrimaryTerm > primaryTerm
+                && pendingPrimaryTerm.accumulateAndGet(operationPrimaryTerm, Math::max) == operationPrimaryTerm) {


This solution has the problem that if, primaryTerm == 0, an operation comes in with operationPrimaryTerm == 1 then another operation comes in with operationPrimaryTerm == 2 and then another ops comes in with operationPrimaryTerm == 1, it maybe be that the last op is processed before the primaryTerm was incremented to 1 (or 2). This can happen if the first ops passed the check but didn't submit it's block. The 2 op incremented pendingPrimaryTerm but didn't submit the block and then the 3rd op just passes this along without waiting.

bleskes · 2017-05-19T11:44:39Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-            throw new IllegalArgumentException(LoggerMessageFormat.format("{} operation term [{}] is too old (current [{}])",
-                shardId, opPrimaryTerm, primaryTerm));
+        if (operationPrimaryTerm > primaryTerm
+                && pendingPrimaryTerm.accumulateAndGet(operationPrimaryTerm, Math::max) == operationPrimaryTerm) {


and now I see that we guard against it with if (operationPrimaryTerm == currentPrimaryTerm) later on, so the third operation will be failed but with the wrong message (we will say it's too old and give a currentPrimaryTerm of 0 while the ops term is 1). I think is all just too complex and isn't worth it given how rare primary promotions are.

bleskes · 2017-05-19T11:45:36Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                        public void onResponse(final Releasable releasable) {
+                            assert operationPrimaryTerm <= primaryTerm
+                                    : "operation primary term [" + operationPrimaryTerm + "] should be at most [" + primaryTerm + "]";
+                            if (operationPrimaryTerm < primaryTerm) {


add a comment please on how this can happen...

bleskes · 2017-05-19T11:45:54Z

core/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

     */
    public void blockOperations(long timeout, TimeUnit timeUnit, Runnable onBlocked) throws InterruptedException, TimeoutException {
        if (closed) {
            throw new IndexShardClosedException(shardId);
        }
        try {
            if (semaphore.tryAcquire(TOTAL_PERMITS, timeout, timeUnit)) {
+                assert semaphore.availablePermits() == 0;


bleskes · 2017-05-19T11:47:33Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/ShardStateIT.java

+        ensureYellow();
+
+        // this forces the primary term to propagate to the replicas
+        client().index(new IndexRequest("test", "type", "1").source("{ \"f\": \"1\"}", XContentType.JSON)).get();


how do we make sure we change it/only do it sometimes once we can?

bleskes · 2017-05-19T13:07:14Z

core/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -561,6 +561,7 @@ private void updateShard(DiscoveryNodes nodes, ShardRouting shardRouting, Shard
                        allocationIdsForShardsOnNodesThatUnderstandSeqNos(indexShardRoutingTable.activeShards(), nodes);
                final Set<String> initializingIds =
                        allocationIdsForShardsOnNodesThatUnderstandSeqNos(indexShardRoutingTable.getAllInitializingShards(), nodes);
+                shard.updatePrimaryTerm(clusterState.metaData().index(shard.shardId().getIndex()).primaryTerm(shard.shardId().id()));


I'm wondering if we should move this if clause to before update the routing entry.. @ywelsch this class is your baby, any thoughts?

Let's discuss this separately and proceed in a follow-up if needed.

bleskes · 2017-05-19T13:14:02Z

core/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

+                    ThreadPool.Names.INDEX);
+        };
+
+        final Thread first = new Thread(function.apply(randomBoolean()));


can we lock down the expected end term based on these booleans and assert for that?

bleskes · 2017-05-19T14:17:57Z

core/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

+                        @Override
+                        public void onResponse(Releasable releasable) {
+                            counter.incrementAndGet();
+                            latch.countDown();


can we add a check on the term here?

bleskes

LGTM. Thanks @jasontedor

With #24779 in place, we can now guaranteed that a single translog generation file will never have a sequence number conflict that needs to be resolved by looking at primary terms. These conflicts can a occur when a replica contains an operation which isn't part of the history of a newly promoted primary. That primary can then assign a different operation to the same slot and replicate it to the replica. PS. Knowing that each generation file is conflict free will simplifying repairing these conflicts when we read from the translog. PPS. This PR also fixes some bugs in the piping of primary terms in the bulk shard action. These bugs are a result of the legacy of IndexRequest/DeleteRequest being a ReplicationRequest. We need to change that as a follow up. Relates to #10708

jasontedor added :Sequence IDs v6.0.0 labels May 18, 2017

jasontedor requested review from bleskes and ywelsch May 18, 2017 20:52

jasontedor force-pushed the block-party branch 3 times, most recently from 5110598 to bf5ab75 Compare May 18, 2017 21:14

jasontedor force-pushed the block-party branch 2 times, most recently from 64bc8c3 to 857fcdb Compare May 19, 2017 02:09

ywelsch reviewed May 19, 2017

View reviewed changes

bleskes suggested changes May 19, 2017

View reviewed changes

bleskes mentioned this pull request May 19, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

jasontedor added 5 commits May 19, 2017 11:58

Fix comment

0690601

Concurrency control

5251a98

More assertions

0275c8d

Remove dead variable

6df88c3

Mutex

3057f95

bleskes approved these changes May 19, 2017

View reviewed changes

jasontedor added 2 commits May 19, 2017 16:14

Take that test

2df1106

Fix typo

ab45e7b

jasontedor merged commit 4cd70cf into elastic:master May 19, 2017

jasontedor deleted the block-party branch May 19, 2017 20:18

clintongormley added the >enhancement label May 22, 2017

bleskes mentioned this pull request May 22, 2017

Guarantee that translog generations are seqNo conflict free #24825

Merged

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

clintongormley added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Feb 14, 2018

clintongormley removed the :Sequence IDs label Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block older operations on primary term transition #24779

Block older operations on primary term transition #24779

jasontedor commented May 18, 2017

ywelsch May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017 •

edited

Loading

bleskes left a comment

bleskes May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017 •

edited

Loading

bleskes May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017

jasontedor May 19, 2017

bleskes May 19, 2017

bleskes May 19, 2017

bleskes left a comment

		@@ -180,7 +178,7 @@ protected void resolveRequest(final IndexMetaData indexMetaData, final Request r

		/**
		* Synchronous replica operation on nodes with replica copies. This is done under the lock form

Block older operations on primary term transition #24779

Block older operations on primary term transition #24779

Conversation

jasontedor commented May 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes May 19, 2017 • edited Loading

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes May 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

bleskes May 19, 2017 •

edited

Loading

bleskes May 19, 2017 •

edited

Loading