-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw back replica local checkpoint on new primary #25452
Changes from 2 commits
2115f4a
8f74e92
5e9d79f
1e7cee9
f33925d
cbe568b
0174af4
813032a
385c948
ab8eeb3
93e751f
e8e4544
49742d4
5064e8c
5640e2a
d1e0ec2
e1bc5c7
6d67289
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2057,6 +2057,15 @@ public void acquireReplicaOperationPermit(final long operationPrimaryTerm, final | |
assert operationPrimaryTerm > primaryTerm : | ||
"shard term already update. op term [" + operationPrimaryTerm + "], shardTerm [" + primaryTerm + "]"; | ||
primaryTerm = operationPrimaryTerm; | ||
logger.trace( | ||
"detected new primary with primary term [{}], " | ||
+ "resetting local checkpoint from [{}] to [{}], " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This log line is incorrect, we don't know the value yet at this point towards which we are going to reset the local checkpoint. It is only determined after setting the global checkpoint in the line below. I think it's easiest to move the logging one line below and use |
||
+ "updating global checkpoint to [{}]", | ||
operationPrimaryTerm, | ||
getLocalCheckpoint(), | ||
globalCheckpoint, | ||
globalCheckpoint); | ||
getEngine().seqNoService().resetLocalCheckpoint(globalCheckpoint); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The global checkpoint that is provided by the new primary might be lower than the global checkpoint that we currently have (e.g. the failed primary did communicate the latest global checkpoint to us, but not to the newly appointed primary). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I pushed 5e9d79f. |
||
updateGlobalCheckpointOnReplica(globalCheckpoint); | ||
getEngine().getTranslog().rollGeneration(); | ||
}); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,8 +38,10 @@ | |
import java.util.stream.Collectors; | ||
import java.util.stream.IntStream; | ||
|
||
import static org.hamcrest.Matchers.empty; | ||
import static org.hamcrest.Matchers.equalTo; | ||
import static org.hamcrest.Matchers.isOneOf; | ||
import static org.hamcrest.Matchers.not; | ||
|
||
public class LocalCheckpointTrackerTests extends ESTestCase { | ||
|
||
|
@@ -49,14 +51,14 @@ public class LocalCheckpointTrackerTests extends ESTestCase { | |
|
||
public static LocalCheckpointTracker createEmptyTracker() { | ||
return new LocalCheckpointTracker( | ||
IndexSettingsModule.newIndexSettings( | ||
"test", | ||
Settings | ||
.builder() | ||
.put(LocalCheckpointTracker.SETTINGS_BIT_ARRAYS_SIZE.getKey(), SMALL_CHUNK_SIZE) | ||
.build()), | ||
SequenceNumbersService.NO_OPS_PERFORMED, | ||
SequenceNumbersService.NO_OPS_PERFORMED | ||
IndexSettingsModule.newIndexSettings( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why reformat? |
||
"test", | ||
Settings | ||
.builder() | ||
.put(LocalCheckpointTracker.SETTINGS_BIT_ARRAYS_SIZE.getKey(), SMALL_CHUNK_SIZE) | ||
.build()), | ||
SequenceNumbersService.NO_OPS_PERFORMED, | ||
SequenceNumbersService.NO_OPS_PERFORMED | ||
); | ||
} | ||
|
||
|
@@ -236,4 +238,24 @@ public void testWaitForOpsToComplete() throws BrokenBarrierException, Interrupte | |
|
||
thread.join(); | ||
} | ||
|
||
public void testResetCheckpoint() { | ||
final int operations = 1024 - scaledRandomIntBetween(0, 1024); | ||
for (int i = 0; i < operations; i++) { | ||
if (!rarely()) { | ||
tracker.markSeqNoAsCompleted(i); | ||
} | ||
} | ||
|
||
final int localCheckpoint = | ||
randomIntBetween(Math.toIntExact(SequenceNumbersService.NO_OPS_PERFORMED), Math.toIntExact(tracker.getCheckpoint())); | ||
tracker.resetCheckpoint(localCheckpoint); | ||
assertThat(tracker.getCheckpoint(), equalTo((long) localCheckpoint)); | ||
assertThat(tracker.getMaxSeqNo(), equalTo((long) localCheckpoint)); | ||
assertThat(tracker.processedSeqNo, empty()); | ||
assertThat(tracker.generateSeqNo(), equalTo((long) (localCheckpoint + 1))); | ||
tracker.markSeqNoAsCompleted((long) (localCheckpoint + 1)); | ||
assertThat(tracker.processedSeqNo, not(empty())); | ||
assertThat(tracker.processedSeqNo.peek().get(0), equalTo(true)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that resetting nextSeqNo is incorrect. Assume that the primary-replica resync fails and that the shard here would be promoted to primary, in that case it would reuse the sequence numbers to override stuff it already had. I'll reach out to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a very long discussion about this. The solution here is fine if we add a follow-up that resets the local checkpoint tracker state on a primary during promotion (the newly promoted primary needs to reset its local checkpoint and mark the sequence numbers in its translog as completed to reestablish the state of the local checkpoint tracker, it has to do this before filling the gaps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, such a follow-up will introduce a test that captures the problem here, namely that if we do not do something as outlined above, in this scenario a newly promoted primary can overwrite history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this some more, I agree with the assessment we had, except for one thing: We should not reset the
nextSeqNo
variable which is exposed asgetMaxSeqNo
. Otherwise when writing out segments, this max sequence number information which we take from the local checkpoint tracker would be incorrect, i.e. there could be a document in the segment where the sequence number would be above max.Put differently, nextSeqNo is not tied to the bit set (which represents the pending confirmation marker). Instead it tracks the actual translog.