-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw back replica local checkpoint on new primary #25452
Throw back replica local checkpoint on new primary #25452
Conversation
This commit causes a replica to throwback its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.
f09a99f
to
2115f4a
Compare
assert checkpoint <= this.checkpoint; | ||
processedSeqNo.clear(); | ||
firstProcessedSeqNo = checkpoint + 1; | ||
nextSeqNo = checkpoint + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that resetting nextSeqNo is incorrect. Assume that the primary-replica resync fails and that the shard here would be promoted to primary, in that case it would reuse the sequence numbers to override stuff it already had. I'll reach out to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a very long discussion about this. The solution here is fine if we add a follow-up that resets the local checkpoint tracker state on a primary during promotion (the newly promoted primary needs to reset its local checkpoint and mark the sequence numbers in its translog as completed to reestablish the state of the local checkpoint tracker, it has to do this before filling the gaps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, such a follow-up will introduce a test that captures the problem here, namely that if we do not do something as outlined above, in this scenario a newly promoted primary can overwrite history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo
variable which is exposed as getMaxSeqNo
. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.
Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.
getLocalCheckpoint(), | ||
globalCheckpoint, | ||
globalCheckpoint); | ||
getEngine().seqNoService().resetLocalCheckpoint(globalCheckpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global checkpoint that is provided by the new primary might be lower than the global checkpoint that we currently have (e.g. the failed primary did communicate the latest global checkpoint to us, but not to the newly appointed primary).
First we have to update the global checkpoint, then use the newly computed global checkpoint to reset the local checkpoint, otherwise the local checkpoint could end up below the global checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 5e9d79f.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I requested a change in how nextSeqNo
is updated and left 2 nits.
assert checkpoint <= this.checkpoint; | ||
processedSeqNo.clear(); | ||
firstProcessedSeqNo = checkpoint + 1; | ||
nextSeqNo = checkpoint + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the nextSeqNo
variable which is exposed as getMaxSeqNo
. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.
Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.
@@ -2057,7 +2057,16 @@ public void acquireReplicaOperationPermit(final long operationPrimaryTerm, final | |||
assert operationPrimaryTerm > primaryTerm : | |||
"shard term already update. op term [" + operationPrimaryTerm + "], shardTerm [" + primaryTerm + "]"; | |||
primaryTerm = operationPrimaryTerm; | |||
logger.trace( | |||
"detected new primary with primary term [{}], " | |||
+ "resetting local checkpoint from [{}] to [{}], " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line is incorrect, we don't know the value yet at this point towards which we are going to reset the local checkpoint. It is only determined after setting the global checkpoint in the line below. I think it's easiest to move the logging one line below and use getGlobalCheckpoint()
. I would also leave out the part which says "updating global checkpoint to {}" as the given value might be below the current global checkpoint, which might be misleading in this message (we already have trace logging for the global checkpoint updates).
.build()), | ||
SequenceNumbersService.NO_OPS_PERFORMED, | ||
SequenceNumbersService.NO_OPS_PERFORMED | ||
IndexSettingsModule.newIndexSettings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why reformat?
Thanks @ywelsch, I have addressed your feedback. |
* master: (52 commits) Include shared/attributes.asciidoc from docs master Fixed page breaks for ICU Collation Keyword Fields Remove QueryParseContext (elastic#25486) [Test] Use a common testing class for all XContent filtering tests (elastic#25491) Tests fix - Significant terms/text aggs (elastic#25499) [DOCS] add docs for REST high level client index method (elastic#25501) Tests: Add Debian 9 (Stretch) to the packaging tests test: Run flush before upgrade and refresh after upgrade. Fix third party audit for repository-hdfs [TEST] Expect nodes getting disconnected quickly testPrimaryFailureIncreasesTerm should use assertBusy to wait for yellow Cleanup network / transport related settings (elastic#25489) Fix repository-hdfs plugin packaging test Remove allocation id from replica replication response (elastic#25488) Adjust BWC version on bad allocation request test Upgrading HDFS Repository Plugin to use HDFS 2.8.1 Client (elastic#25497) Adjust status on bad allocation explain requests Preliminary support for ARM Add doc note regarding explicit publish host Fix typo in name of test ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one more edge-case that needs to be covered (when global checkpoint is SequenceNumbersService.UNASSIGNED_SEQ_NO
), otherwise PR looks good to me.
|
||
public void testResetCheckpoint() { | ||
final int operations = 1024 - scaledRandomIntBetween(0, 1024); | ||
int maxSeqNo = Math.toIntExact(SequenceNumbersService.NO_OPS_PERFORMED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat, I did not know about this method
operationPrimaryTerm, | ||
getLocalCheckpoint(), | ||
getGlobalCheckpoint()); | ||
getEngine().seqNoService().resetLocalCheckpoint(getGlobalCheckpoint()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are feeding this method the current global checkpoint, which could be still unknown, is it possible that we call resetLocalCheckpoint
with SequenceNumbersService.UNASSIGNED_SEQ_NO
? If so, I think that that would be bad. The method resetLocalCheckpoint
should have an assertion, similar to the constructor. Also we need to make sure to special-case this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is possible after we address #25415. A newly created primary will update its local checkpoint to -1 and calculate a global checkpoint of -1. Replicas that recover from this primary will receive a global checkpoint of -1 that they would maintain if promoted. Similarly for relocation. Thus I think that we will never see -2 here.
I think we should only add an assertion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To recap a discussion we had via another channel, we do have to worry about -2 here in the case when a primary on 5.x dies and a replica on 6.x is promoted and initiates are re-sync to another 6.x replica. I pushed a d1e0ec2.
* master: [Analysis] Support normalizer in request param (elastic#24767) Remove deprecated IdsQueryBuilder constructor (elastic#25529) Adds check for negative search request size (elastic#25397) test: also inspect the upgrade api response to check whether the upgrade really ran [DOCS] restructure java clients docs pages (elastic#25517)
@ywelsch I addressed your feedback, would you look again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This commit causes a replica to throw back its local checkpoint to the global checkpoint when learning of a new primary through a replica operation.
Relates #10708, relates #25355