Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw back replica local checkpoint on new primary #25452

Merged
merged 18 commits into from
Jul 5, 2017

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Jun 28, 2017

This commit causes a replica to throw back its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.

Relates #10708, relates #25355

This commit causes a replica to throwback its local checkpoint to the
global checkpoint when learning of a new primary through a replica
operation.
@jasontedor jasontedor force-pushed the local-checkpoint-throwback branch from f09a99f to 2115f4a Compare June 28, 2017 18:39
@jasontedor jasontedor changed the title Throwback replica local checkpoint on new primary Throw back replica local checkpoint on new primary Jun 28, 2017
assert checkpoint <= this.checkpoint;
processedSeqNo.clear();
firstProcessedSeqNo = checkpoint + 1;
nextSeqNo = checkpoint + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that resetting nextSeqNo is incorrect. Assume that the primary-replica resync fails and that the shard here would be promoted to primary, in that case it would reuse the sequence numbers to override stuff it already had. I'll reach out to discuss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a very long discussion about this. The solution here is fine if we add a follow-up that resets the local checkpoint tracker state on a primary during promotion (the newly promoted primary needs to reset its local checkpoint and mark the sequence numbers in its translog as completed to reestablish the state of the local checkpoint tracker, it has to do this before filling the gaps).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, such a follow-up will introduce a test that captures the problem here, namely that if we do not do something as outlined above, in this scenario a newly promoted primary can overwrite history.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo variable which is exposed as getMaxSeqNo. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.

Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.

getLocalCheckpoint(),
globalCheckpoint,
globalCheckpoint);
getEngine().seqNoService().resetLocalCheckpoint(globalCheckpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global checkpoint that is provided by the new primary might be lower than the global checkpoint that we currently have (e.g. the failed primary did communicate the latest global checkpoint to us, but not to the newly appointed primary).
First we have to update the global checkpoint, then use the newly computed global checkpoint to reset the local checkpoint, otherwise the local checkpoint could end up below the global checkpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 5e9d79f.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested a change in how nextSeqNo is updated and left 2 nits.

assert checkpoint <= this.checkpoint;
processedSeqNo.clear();
firstProcessedSeqNo = checkpoint + 1;
nextSeqNo = checkpoint + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo variable which is exposed as getMaxSeqNo. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.

Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.

@@ -2057,7 +2057,16 @@ public void acquireReplicaOperationPermit(final long operationPrimaryTerm, final
assert operationPrimaryTerm > primaryTerm :
"shard term already update. op term [" + operationPrimaryTerm + "], shardTerm [" + primaryTerm + "]";
primaryTerm = operationPrimaryTerm;
logger.trace(
"detected new primary with primary term [{}], "
+ "resetting local checkpoint from [{}] to [{}], "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line is incorrect, we don't know the value yet at this point towards which we are going to reset the local checkpoint. It is only determined after setting the global checkpoint in the line below. I think it's easiest to move the logging one line below and use getGlobalCheckpoint(). I would also leave out the part which says "updating global checkpoint to {}" as the given value might be below the current global checkpoint, which might be misleading in this message (we already have trace logging for the global checkpoint updates).

.build()),
SequenceNumbersService.NO_OPS_PERFORMED,
SequenceNumbersService.NO_OPS_PERFORMED
IndexSettingsModule.newIndexSettings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why reformat?

@jasontedor
Copy link
Member Author

Thanks @ywelsch, I have addressed your feedback.

* master: (52 commits)
  Include shared/attributes.asciidoc from docs master
  Fixed page breaks for ICU Collation Keyword Fields
  Remove QueryParseContext (elastic#25486)
  [Test] Use a common testing class for all XContent filtering tests (elastic#25491)
  Tests fix - Significant terms/text aggs  (elastic#25499)
  [DOCS] add docs for REST high level client index method (elastic#25501)
  Tests: Add Debian 9 (Stretch) to the packaging tests
  test: Run flush before upgrade and refresh after upgrade.
  Fix third party audit for repository-hdfs
  [TEST] Expect nodes getting disconnected quickly
  testPrimaryFailureIncreasesTerm should use assertBusy to wait for yellow
  Cleanup network / transport related settings (elastic#25489)
  Fix repository-hdfs plugin packaging test
  Remove allocation id from replica replication response (elastic#25488)
  Adjust BWC version on bad allocation request test
  Upgrading HDFS Repository Plugin to use HDFS 2.8.1 Client (elastic#25497)
  Adjust status on bad allocation explain requests
  Preliminary support for ARM
  Add doc note regarding explicit publish host
  Fix typo in name of test
  ...
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one more edge-case that needs to be covered (when global checkpoint is SequenceNumbersService.UNASSIGNED_SEQ_NO), otherwise PR looks good to me.


public void testResetCheckpoint() {
final int operations = 1024 - scaledRandomIntBetween(0, 1024);
int maxSeqNo = Math.toIntExact(SequenceNumbersService.NO_OPS_PERFORMED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat, I did not know about this method

operationPrimaryTerm,
getLocalCheckpoint(),
getGlobalCheckpoint());
getEngine().seqNoService().resetLocalCheckpoint(getGlobalCheckpoint());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are feeding this method the current global checkpoint, which could be still unknown, is it possible that we call resetLocalCheckpoint with SequenceNumbersService.UNASSIGNED_SEQ_NO? If so, I think that that would be bad. The method resetLocalCheckpoint should have an assertion, similar to the constructor. Also we need to make sure to special-case this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is possible after we address #25415. A newly created primary will update its local checkpoint to -1 and calculate a global checkpoint of -1. Replicas that recover from this primary will receive a global checkpoint of -1 that they would maintain if promoted. Similarly for relocation. Thus I think that we will never see -2 here.

I think we should only add an assertion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap a discussion we had via another channel, we do have to worry about -2 here in the case when a primary on 5.x dies and a replica on 6.x is promoted and initiates are re-sync to another 6.x replica. I pushed a d1e0ec2.

* master:
  [Analysis] Support normalizer in request param (elastic#24767)
  Remove deprecated IdsQueryBuilder constructor (elastic#25529)
  Adds check for negative search request size (elastic#25397)
  test: also inspect the upgrade api response to check whether the upgrade really ran
  [DOCS] restructure java clients docs pages (elastic#25517)
@jasontedor
Copy link
Member Author

@ywelsch I addressed your feedback, would you look again?

@jasontedor jasontedor requested a review from ywelsch July 4, 2017 16:30
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jasontedor jasontedor merged commit 7dcd81b into elastic:master Jul 5, 2017
@jasontedor jasontedor deleted the local-checkpoint-throwback branch July 5, 2017 13:17
@clintongormley clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants