-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sequence-number-aware translog #22822
Introduce sequence-number-aware translog #22822
Conversation
public abstract int totalOperations(); | ||
public abstract int totalOperations(); | ||
|
||
abstract long getMinSeqNo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nice to have javadocs for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed abd491cbf3a62f74d2c391876aee7b3e57166b0c.
this.offset = offset; | ||
this.numOps = numOps; | ||
this.generation = generation; | ||
this.minSeqNo = minSeqNo; | ||
this.maxSeqNo = maxSeqNo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth checking that minSeqNo <= maxSeqNo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe something like minSeqNo <= maxSeqNo || minSeqNo == Long.MAX_VALUE && maxSeqNo == Long.MIN_VALUE
if we want to keep the merging easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e9370c84925e00d11d74e6785004c9de7f9feefc.
@@ -198,7 +198,7 @@ public Translog( | |||
logger.debug("wipe translog location - creating new translog"); | |||
Files.createDirectories(location); | |||
final long generation = 1; | |||
Checkpoint checkpoint = new Checkpoint(0, 0, generation, globalCheckpointSupplier.getAsLong()); | |||
Checkpoint checkpoint = new Checkpoint(0, 0, generation, Long.MAX_VALUE, Long.MIN_VALUE, globalCheckpointSupplier.getAsLong()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative values for translog seem to have a special meaning. Maybe this should be 0, 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or one of the constants in SequenceNumberService
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to constants. I'll comment about it at the end.
@@ -419,22 +419,21 @@ public Location add(Operation operation) throws IOException { | |||
out.writeInt(operationSize); | |||
out.seek(end); | |||
final ReleasablePagedBytesReference bytes = out.bytes(); | |||
try (ReleasableLock lock = readLock.acquire()) { | |||
try (final ReleasableLock ignored = readLock.acquire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are implicitly final already. I do like the name change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #22960 so this travesty does not happen again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't go so far as to say it is a travesty, but, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 4311c1fc4bfeb9830e688d47ac3729b9d7acf8e3.
protected final AtomicBoolean closed = new AtomicBoolean(false); | ||
|
||
/** | ||
* Create a reader of translog file channel. The length parameter should be consistent with totalOperations and point | ||
* at the end of the last operation in this snapshot. | ||
*/ | ||
public TranslogReader(long generation, FileChannel channel, Path path, long firstOperationOffset, long length, int totalOperations) { | ||
public TranslogReader( | ||
final long generation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indent these so they don't look like method body?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I hate this - sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 8786360f5f4f40fa31ab273464128abb34c440a4 and 48403f4baa62d1bd6223ea26a547ccb6b63ef6eb.
cool.. I quickly read through it and the basics look good. I will give it a careful look tomorrow morning (I want to look critically at the initial min/max values and also w.r.t the BWC situation where incoming seqnos can be -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments
public abstract int totalOperations(); | ||
public abstract int totalOperations(); | ||
|
||
abstract long getMinSeqNo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
@@ -35,11 +36,13 @@ | |||
import java.nio.file.OpenOption; | |||
import java.nio.file.Path; | |||
|
|||
class Checkpoint { | |||
final class Checkpoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
|
||
// reads a checksummed checkpoint introduced in ES 5.0.0 | ||
static Checkpoint readChecksummedV1(DataInput in) throws IOException { | ||
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
return new Checkpoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we wrap this in 2 lines instead of N this makes my eyes bleed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds horrible, you should really get that checked out. 😛
I pushed a commit that should address your concerns.
@@ -778,6 +777,8 @@ public static Type fromId(byte id) { | |||
|
|||
Source getSource(); | |||
|
|||
long seqNo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be getSeqNo
?
@@ -1147,6 +1150,7 @@ public String toString() { | |||
private final long primaryTerm; | |||
private final String reason; | |||
|
|||
@Override | |||
public long seqNo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe getSeqNo
?
protected final AtomicBoolean closed = new AtomicBoolean(false); | ||
|
||
/** | ||
* Create a reader of translog file channel. The length parameter should be consistent with totalOperations and point | ||
* at the end of the last operation in this snapshot. | ||
*/ | ||
public TranslogReader(long generation, FileChannel channel, Path path, long firstOperationOffset, long length, int totalOperations) { | ||
public TranslogReader( | ||
final long generation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I hate this - sorry
@@ -116,7 +143,15 @@ public static TranslogReader open(FileChannel channel, Path path, Checkpoint che | |||
throw new TranslogCorruptedException("expected shard UUID " + uuidBytes + " but got: " + ref + | |||
" this translog file belongs to a different translog. path:" + path); | |||
} | |||
return new TranslogReader(checkpoint.generation, channel, path, ref.length + CodecUtil.headerLength(TranslogWriter.TRANSLOG_CODEC) + Integer.BYTES, checkpoint.offset, checkpoint.numOps); | |||
return new TranslogReader( | |||
checkpoint.generation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please stop doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make the constructor accept a checkpoint and that will make many of these parameters go away..
channel, | ||
path, | ||
new Checkpoint( | ||
Files.size(path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better now.
ex.addSuppressed(inner); | ||
} | ||
throw ex; | ||
} | ||
totalOffset += data.length(); | ||
operationCounter++; | ||
minSeqNo = Math.min(minSeqNo, seqNo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment that the order we assign these values matters to ensure we always have a >= relationship?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to take BWC aspect here into account. The incoming minSeqNo can be unassigned (if the ops comes from an old primary) but as soon as we start receiving valid seqNo we should never go back. At least, I hope that's where end up being and I want to start building the assertions based on that.
@@ -191,6 +205,14 @@ public int totalOperations() { | |||
return operationCounter; | |||
} | |||
|
|||
public long getMinSeqNo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible to read these out of order and therefore get inconsistent values? like if you read max first and then min? Also I wonder if the default value can be confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be really consistent I think we need to return a tuple and read it under the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can wait and see the usage (but document the caveat - other stats are not consistent either). At the moment it's only used for snapshots which is created under lock. I think that the future for recovery/translog trimming will not care about the the writer's values. An alternative is to not expose these but rather the last checkpoint, which is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer not to expose these unless we're sure we need them because of the concurrency issues. Or just expose them for testing if we need them there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor it seems the consensus leans towards exposing/use the last checkpoint. I presume you had a reason to leave as is. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed bdbbe83399a22d2df91aa5ed684d86d3579dc023.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a more detailed pass and left a bunch of comments.
On top of it, I think we should extend the translog stats with the min max (which might need the lock/tuple that Simon discussed)
About those constants. Without BWC I would argue we should initialize the min/max with the no ops performed constants. The tricky part that with BWC the translog can up full with ops and still the min/max would be no_ops_performed which is strange and will lead to bugs imo. As such, I would recommend using unassigned until the first operation with a seq# enters the system, at which point we switch to normal min/max.
this.offset = offset; | ||
this.numOps = numOps; | ||
this.generation = generation; | ||
this.minSeqNo = minSeqNo; | ||
this.maxSeqNo = maxSeqNo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for assertions.
in.readLong(), | ||
in.readInt(), | ||
in.readLong(), | ||
SequenceNumbersService.NO_OPS_PERFORMED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use unassigned here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressed.
} | ||
|
||
// reads checkpoint from ES < 5.0.0 | ||
static Checkpoint readNonChecksummed(DataInput in) throws IOException { | ||
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
return new Checkpoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can trash this in master, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 8e9091bf41d2d28cd46485e760c235b895e8a65f.
@@ -198,7 +198,7 @@ public Translog( | |||
logger.debug("wipe translog location - creating new translog"); | |||
Files.createDirectories(location); | |||
final long generation = 1; | |||
Checkpoint checkpoint = new Checkpoint(0, 0, generation, globalCheckpointSupplier.getAsLong()); | |||
Checkpoint checkpoint = new Checkpoint(0, 0, generation, Long.MAX_VALUE, Long.MIN_VALUE, globalCheckpointSupplier.getAsLong()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to constants. I'll comment about it at the end.
@@ -116,7 +143,15 @@ public static TranslogReader open(FileChannel channel, Path path, Checkpoint che | |||
throw new TranslogCorruptedException("expected shard UUID " + uuidBytes + " but got: " + ref + | |||
" this translog file belongs to a different translog. path:" + path); | |||
} | |||
return new TranslogReader(checkpoint.generation, channel, path, ref.length + CodecUtil.headerLength(TranslogWriter.TRANSLOG_CODEC) + Integer.BYTES, checkpoint.offset, checkpoint.numOps); | |||
return new TranslogReader( | |||
checkpoint.generation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make the constructor accept a checkpoint and that will make many of these parameters go away..
@@ -191,6 +205,14 @@ public int totalOperations() { | |||
return operationCounter; | |||
} | |||
|
|||
public long getMinSeqNo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can wait and see the usage (but document the caveat - other stats are not consistent either). At the moment it's only used for snapshots which is created under lock. I think that the future for recovery/translog trimming will not care about the the writer's values. An alternative is to not expose these but rather the last checkpoint, which is consistent.
@@ -295,7 +331,7 @@ public boolean syncUpTo(long offset) throws IOException { | |||
try { | |||
channel.force(false); | |||
checkpoint = | |||
writeCheckpoint(channelFactory, offsetToSync, opsCounter, globalCheckpoint, path.getParent(), generation); | |||
writeCheckpoint(channelFactory, offsetToSync, opsCounter, minSeqNo, maxSeqNo, globalCheckpoint, path.getParent(), generation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should capture these under lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 7802d6ec2d2f3fdb2dbb1c79c60b3fe6c605291d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in 7802d6ec2d2f3fdb2dbb1c79c60b3fe6c605291d.
@@ -168,7 +168,8 @@ protected void execute(Terminal terminal, OptionSet options, Environment env) th | |||
|
|||
/** Write a checkpoint file to the given location with the given generation */ | |||
public static void writeEmptyCheckpoint(Path filename, int translogLength, long translogGeneration) throws IOException { | |||
Checkpoint emptyCheckpoint = new Checkpoint(translogLength, 0, translogGeneration, SequenceNumbersService.UNASSIGNED_SEQ_NO); | |||
Checkpoint emptyCheckpoint = | |||
new Checkpoint(translogLength, 0, translogGeneration, Long.MAX_VALUE, Long.MIN_VALUE, SequenceNumbersService.UNASSIGNED_SEQ_NO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a static method on the translog class to make sure this is consistent with what the translog does when it creates an empty translog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e6b356d885d22524bc899987525d6e655fffd31a although I left unsettled the issue of the default values for the min and max sequence number (I will address these later).
@@ -1208,12 +1220,12 @@ public void testRecoveryUncommittedCorruptedCheckpoint() throws IOException { | |||
TranslogConfig config = translog.getConfig(); | |||
Path ckp = config.getTranslogPath().resolve(Translog.CHECKPOINT_FILE_NAME); | |||
Checkpoint read = Checkpoint.read(ckp); | |||
Checkpoint corrupted = new Checkpoint(0, 0, 0, SequenceNumbersService.UNASSIGNED_SEQ_NO); | |||
Checkpoint corrupted = new Checkpoint(0, 0, 0, SequenceNumbersService.NO_OPS_PERFORMED, SequenceNumbersService.NO_OPS_PERFORMED, SequenceNumbersService.UNASSIGNED_SEQ_NO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this another example we need the "empty" checkpoint utility - this is different than what the translog does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e6b356d885d22524bc899987525d6e655fffd31a although I left unsettled the issue of the default values for the min and max sequence number (I will address these later).
for (int i = 0; i < numOps; i++) { | ||
out.reset(bytes); | ||
out.writeInt(i); | ||
writer.add(new BytesArray(bytes)); | ||
long seqNo; | ||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we add it, can we test that we do the right things with unassigned seq nos due to BWC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 390ee829db21bae606446a5393c35af055a47c02.
fd7b4a6
to
4dfa75d
Compare
// reads checkpoint from ES < 5.0.0 | ||
static Checkpoint readNonChecksummed(DataInput in) throws IOException { | ||
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
final long minSeqNo = Translog.INITIAL_MIN_SEQ_NO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should have the same number as an empty checkpoint. That way we can tell the difference between a checkpoint that didn't have any sequence numbers and one that was empty. I guess we could just look at the size, but still, I think it'd be more clear to reserve some int for this.
out.writeLong(globalCheckpoint); | ||
} | ||
|
||
static Checkpoint emptyTranslogCheckpoint(final long offset, final long generation, final long globalCheckpoint) { | ||
return new Checkpoint(offset, 0, generation, Translog.INITIAL_MIN_SEQ_NO, Translog.INITIAL_MAX_SEQ_NO, globalCheckpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm - now that you have it this way and those are the same numbers I think it'd be more clear to use UNASSIGNED_SEQ_NO
. I think it is confusing to have three names for -2
because they'll all be unexpectedly equal, you know?
ex.addSuppressed(inner); | ||
} | ||
throw ex; | ||
} | ||
totalOffset += data.length(); | ||
operationCounter++; | ||
|
||
if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with Long.MAX_VALUE
here so long as we're careful not to let them escape from this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert operation counter is 0? (and move the counter increment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
minSeqNo = Math.min(minSeqNo, seqNo); | ||
} | ||
|
||
if (maxSeqNo == Translog.INITIAL_MAX_SEQ_NO) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
@@ -191,6 +205,14 @@ public int totalOperations() { | |||
return operationCounter; | |||
} | |||
|
|||
public long getMinSeqNo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer not to expose these unless we're sure we need them because of the concurrency issues. Or just expose them for testing if we need them there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @jasontedor . I went through it again and I think we're close. Like Nik, I have concerns around have the new constants INITIAL_MIN_SEQ_NO
. If we go the extra mile (which I think is good), I would go with setting the initial values to the intuitive NO_OPS_PERFORMED
. Any incoming unassigned seq# should set it to UNASSINGED_SEQ_NO
and I think that from that point on all the logic works.
Also - did you see my ask for extending the translog stats?
final TranslogWriter writer = | ||
new TranslogWriter(channelFactory, shardId, checkpoint, channel, file, bufferSize, globalCheckpointSupplier); | ||
return writer; | ||
writeCheckpoint(channelFactory, headerLength, 0, Translog.INITIAL_MIN_SEQ_NO, Translog.INITIAL_MAX_SEQ_NO, globalCheckpointSupplier.getAsLong(), file.getParent(), fileGeneration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we sue the createEmptyCheckpoint method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed fc40b8f8e77b8cab1dd0544334e9d01f57b5f837.
ex.addSuppressed(inner); | ||
} | ||
throw ex; | ||
} | ||
totalOffset += data.length(); | ||
operationCounter++; | ||
|
||
if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert operation counter is 0? (and move the counter increment?)
minSeqNo = seqNo; | ||
} | ||
} else { | ||
assert seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing (I'm leaving my original comment below, to show how I got confused). IMO we should always set the minSeqNo / maxSeqNo upon the first value we see. If it is UNASSIGNED_SEQ_NO, then so be it. It tells us that the translog contains operations without seq#, which is currently impossible to know.
hmm... is this correct? what want to check here is that if the incoming value is UNASSIGNED_SEQ_NO then the current value of minSeqNo must be UNASSIGNED_SEQ_NO . This is to say - once seq# are added to the mix, we never go back (we may need to change some logic somewhere else to do so, but I think it's the right move).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
maxSeqNo = seqNo; | ||
} | ||
} else { | ||
assert seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments rely to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
@@ -191,6 +205,14 @@ public int totalOperations() { | |||
return operationCounter; | |||
} | |||
|
|||
public long getMinSeqNo() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor it seems the consensus leans towards exposing/use the last checkpoint. I presume you had a reason to leave as is. Can you elaborate?
@@ -1008,42 +1032,6 @@ public void testTranslogWriter() throws IOException { | |||
IOUtils.close(writer); | |||
} | |||
|
|||
public void testFailWriterWhileClosing() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good that we have one thing less to worry about, but I wonder if it's a good idea to remove the test. I mean these transitions are always tricky and it's good to have it well tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was testing a specific scenario: constructing the new translog reader could fail because we did a file operation (reading the channel position) in that construction. That file operation was removed, so that failure scenario can not occur anymore. Are you saying that you want the test to remain, just without testing for the failure scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I get why you removed it but It think it's indeed good to keep around - it seems we don't have any other test for the closeIntoReader method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed introduce-sequence-number-aware-translog.
long minSeqNo = Translog.INITIAL_MIN_SEQ_NO; | ||
long maxSeqNo = Translog.INITIAL_MAX_SEQ_NO; | ||
final Set<Long> seenSeqNos = new HashSet<>(); | ||
boolean opsHaveValidSequenceNumbers = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we randomize the start to simulate "no BWC mode"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
} while (seenSeqNos.contains(seqNo)); | ||
if (seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO) { | ||
seenSeqNos.add(seqNo); | ||
if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be simpler to calculate those in the end based on seenSeqNos? we can also wrap this whole thing in a opsHaveValidSequenceNumbers chcek and then the inner loop won't need to change for UNASSIGNED_SEQ_NO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.
10cc2a8
to
7c50088
Compare
Certainly this makes sense at the shard level, but it does not make sense when we report node level or index level stats. Thus, I think they should be left out? |
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @jasontedor for the hard work. This LGTM. I left some minor comments but there is no need for an extra review on my end. @s1monw do you want to do another pass?
Re the stats - I see what you're saying. I'm not happy with it as I think this will give us a way to debug this and it will be good to expose, but I agree there is no clear cut easy solution that's worth pushing for in this PR. Potentially we should just remove translog stats from common stats, I'm not sure how much value the aggregation per index/node/cluster adds here.
final long maxSeqNo = Translog.INITIAL_MAX_SEQ_NO; | ||
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), minSeqNo, maxSeqNo, SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
static Checkpoint readChecksummedV1(final DataInput in) throws IOException { | ||
final long minSeqNo = SequenceNumbersService.NO_OPS_PERFORMED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be UNASSIGNED_SEQ_NO no? (unless maybe if total count is 0, but I'm not sure it's worth the code song and dance it will require)
writer.add(new BytesArray(bytes), randomNonNegativeLong()); | ||
} | ||
writer.sync(); | ||
try (TranslogReader reader = writer.closeIntoReader()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check the transfer of the checkpoint as well, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 48f02c2bedbc08e10099605030b92b8ca1bd34cd.
protected final long length; | ||
private final int totalOperations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth having these extra fields now that we have a checkpoint? code that needs it can either call the getter methods or access the checkpoint directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about removing them and decided the code read easier with them than with checkpoint.numOps
and checkpoint.offset
everywhere so reverted it. I can remove them if you feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to remove the fields but I'm fine with doing it in a dead simple followup. I like not doing it in this PR to keep it smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm fine with it being a follow-up.
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
// reads a checksummed checkpoint introduced in ES 5.0.0 | ||
static Checkpoint readChecksummedV1(final DataInput in) throws IOException { | ||
final long minSeqNo = SequenceNumbersService.NO_OPS_PERFORMED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying the commit comment as the code moved around:
these should be UNASSIGNED_SEQ_NO no? (unless maybe if total count is 0, but I'm not sure it's worth the code song and dance it will require)
return getLastSyncedCheckpoint(); | ||
} | ||
|
||
@Override | ||
public long sizeInBytes() { | ||
return totalOffset; | ||
} | ||
|
||
/** | ||
* closes this writer and transfers it's underlying file channel to a new immutable reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed ca0a0adf1820c68f096d9167b9f6f1efdac2c51d.
@@ -80,6 +85,8 @@ public TranslogWriter( | |||
this.outputStream = new BufferedChannelOutputStream(java.nio.channels.Channels.newOutputStream(channel), bufferSize.bytesAsInt()); | |||
this.lastSyncedCheckpoint = initialCheckpoint; | |||
this.totalOffset = initialCheckpoint.offset; | |||
this.minSeqNo = initialCheckpoint.minSeqNo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right but it is confusing me some. Maybe you can explain it or maybe add a comment?
So my problem is that on figure reading it looks like you are keeping the minimum sequence number as made by some previous writer. So if you make more than one of these writers then this minimum value will stay the minimum value from the first writer. Would it make more sense to use the NO_OPS_PERFORMED
constant here? Or maybe add a comment about what it when why we want this "dragging backwards" behavior sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only ever called with a new empty checkpoint. We can assert, would that make it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. That way I don't start thinking about crazy stuff that we're not doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 52f68e1979b80f9ae4a3c7e462546bd779110963.
retest this please |
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments nothing major
// reads a checksummed checkpoint introduced in ES 5.0.0 | ||
static Checkpoint readChecksummedV1(DataInput in) throws IOException { | ||
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO); | ||
static Checkpoint readChecksummedV2(final DataInput in) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment when this version was introduced. Maybe instead of V1 and V2 we use the version in the method name like Pre6_0_0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit that does this.
CodecUtil.checksumEntireFile(indexInput); | ||
final int fileVersion = CodecUtil.checkHeader(indexInput, CHECKPOINT_CODEC, INITIAL_VERSION, CURRENT_VERSION); | ||
if (fileVersion == INITIAL_VERSION) { | ||
assert indexInput.length() == V1_FILE_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have the actual length in the message? this would be aweful if we didn't have it. same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit that does this.
assert indexInput.length() == FILE_SIZE; | ||
return Checkpoint.readChecksummedV2(indexInput); | ||
} | ||
assert fileVersion == CURRENT_VERSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get the version in the message
*/ | ||
public static TranslogReader open(FileChannel channel, Path path, Checkpoint checkpoint, String translogUUID) throws IOException { | ||
public static TranslogReader open( | ||
final FileChannel channel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh can we have one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit just for you and only for you that does this.
@@ -138,6 +158,11 @@ public int totalOperations() { | |||
return totalOperations; | |||
} | |||
|
|||
@Override | |||
Checkpoint getCheckpoint() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this method be final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I pushed a commit that does this.
@@ -56,6 +55,11 @@ public int totalOperations() { | |||
} | |||
|
|||
@Override | |||
Checkpoint getCheckpoint() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this method be final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A final
modifier here would be redundant because the class is final.
@@ -80,6 +85,10 @@ public TranslogWriter( | |||
this.outputStream = new BufferedChannelOutputStream(java.nio.channels.Channels.newOutputStream(channel), bufferSize.bytesAsInt()); | |||
this.lastSyncedCheckpoint = initialCheckpoint; | |||
this.totalOffset = initialCheckpoint.offset; | |||
assert initialCheckpoint.minSeqNo == SequenceNumbersService.NO_OPS_PERFORMED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get the actual value in the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I pushed a commit that does this.
ex.addSuppressed(inner); | ||
} | ||
throw ex; | ||
} | ||
totalOffset += data.length(); | ||
|
||
if (minSeqNo == SequenceNumbersService.NO_OPS_PERFORMED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance we can get this logic as SequenceNumbersService#max
and SequenceNumbersService#min
where we can document and test it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I pushed a commit that does this (and added tests).
Today, the relationship between Lucene and the translog is rather simple: every document not in Lucene is guaranteed to be in the translog. We need a stronger guarantee from the translog though, namely that it can replay all operations after a certain sequence number. For this to be possible, the translog has to made sequence-number aware. As a first step, we introduce the min and max sequence numbers into the translog so that each generation knows the possible range of operations contained in the generation. This will enable future work to keep around all generations containing operations after a certain sequence number (e.g., the global checkpoint).
48f02c2
to
772a513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jasontedor looks awesome |
* master: Mark IP range aggregator test as awaits fix Add note and link to 'tune for disk usage' (elastic#23252)
Today, the relationship between Lucene and the translog is rather simple: every document not in Lucene is guaranteed to be in the translog. We need a stronger guarantee from the translog though, namely that it can replay all operations after a certain sequence number. For this to be possible, the translog has to made sequence-number aware. As a first step, we introduce the min and max sequence numbers into the translog so that each generation knows the possible range of operations contained in the generation. This will enable future work to keep around all generations containing operations after a certain sequence number (e.g., the global checkpoint).
Relates #10708