Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sequence-number-aware translog #22822

Merged

Conversation

jasontedor
Copy link
Member

Today, the relationship between Lucene and the translog is rather simple: every document not in Lucene is guaranteed to be in the translog. We need a stronger guarantee from the translog though, namely that it can replay all operations after a certain sequence number. For this to be possible, the translog has to made sequence-number aware. As a first step, we introduce the min and max sequence numbers into the translog so that each generation knows the possible range of operations contained in the generation. This will enable future work to keep around all generations containing operations after a certain sequence number (e.g., the global checkpoint).

Relates #10708

public abstract int totalOperations();
public abstract int totalOperations();

abstract long getMinSeqNo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to have javadocs for these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed abd491cbf3a62f74d2c391876aee7b3e57166b0c.

this.offset = offset;
this.numOps = numOps;
this.generation = generation;
this.minSeqNo = minSeqNo;
this.maxSeqNo = maxSeqNo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth checking that minSeqNo <= maxSeqNo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe something like minSeqNo <= maxSeqNo || minSeqNo == Long.MAX_VALUE && maxSeqNo == Long.MIN_VALUE if we want to keep the merging easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for assertions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e9370c84925e00d11d74e6785004c9de7f9feefc.

@@ -198,7 +198,7 @@ public Translog(
logger.debug("wipe translog location - creating new translog");
Files.createDirectories(location);
final long generation = 1;
Checkpoint checkpoint = new Checkpoint(0, 0, generation, globalCheckpointSupplier.getAsLong());
Checkpoint checkpoint = new Checkpoint(0, 0, generation, Long.MAX_VALUE, Long.MIN_VALUE, globalCheckpointSupplier.getAsLong());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative values for translog seem to have a special meaning. Maybe this should be 0, 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or one of the constants in SequenceNumberService?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to constants. I'll comment about it at the end.

@@ -419,22 +419,21 @@ public Location add(Operation operation) throws IOException {
out.writeInt(operationSize);
out.seek(end);
final ReleasablePagedBytesReference bytes = out.bytes();
try (ReleasableLock lock = readLock.acquire()) {
try (final ReleasableLock ignored = readLock.acquire()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are implicitly final already. I do like the name change though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #22960 so this travesty does not happen again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't go so far as to say it is a travesty, but, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 4311c1fc4bfeb9830e688d47ac3729b9d7acf8e3.

protected final AtomicBoolean closed = new AtomicBoolean(false);

/**
* Create a reader of translog file channel. The length parameter should be consistent with totalOperations and point
* at the end of the last operation in this snapshot.
*/
public TranslogReader(long generation, FileChannel channel, Path path, long firstOperationOffset, long length, int totalOperations) {
public TranslogReader(
final long generation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indent these so they don't look like method body?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I hate this - sorry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8786360f5f4f40fa31ab273464128abb34c440a4 and 48403f4baa62d1bd6223ea26a547ccb6b63ef6eb.

@bleskes
Copy link
Contributor

bleskes commented Jan 26, 2017

cool.. I quickly read through it and the basics look good. I will give it a careful look tomorrow morning (I want to look critically at the initial min/max values and also w.r.t the BWC situation where incoming seqnos can be -1)

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments

public abstract int totalOperations();
public abstract int totalOperations();

abstract long getMinSeqNo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -35,11 +36,13 @@
import java.nio.file.OpenOption;
import java.nio.file.Path;

class Checkpoint {
final class Checkpoint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}

// reads a checksummed checkpoint introduced in ES 5.0.0
static Checkpoint readChecksummedV1(DataInput in) throws IOException {
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO);
return new Checkpoint(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we wrap this in 2 lines instead of N this makes my eyes bleed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds horrible, you should really get that checked out. 😛

I pushed a commit that should address your concerns.

@@ -778,6 +777,8 @@ public static Type fromId(byte id) {

Source getSource();

long seqNo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be getSeqNo?

@@ -1147,6 +1150,7 @@ public String toString() {
private final long primaryTerm;
private final String reason;

@Override
public long seqNo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe getSeqNo?

protected final AtomicBoolean closed = new AtomicBoolean(false);

/**
* Create a reader of translog file channel. The length parameter should be consistent with totalOperations and point
* at the end of the last operation in this snapshot.
*/
public TranslogReader(long generation, FileChannel channel, Path path, long firstOperationOffset, long length, int totalOperations) {
public TranslogReader(
final long generation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I hate this - sorry

@@ -116,7 +143,15 @@ public static TranslogReader open(FileChannel channel, Path path, Checkpoint che
throw new TranslogCorruptedException("expected shard UUID " + uuidBytes + " but got: " + ref +
" this translog file belongs to a different translog. path:" + path);
}
return new TranslogReader(checkpoint.generation, channel, path, ref.length + CodecUtil.headerLength(TranslogWriter.TRANSLOG_CODEC) + Integer.BYTES, checkpoint.offset, checkpoint.numOps);
return new TranslogReader(
checkpoint.generation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please stop doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make the constructor accept a checkpoint and that will make many of these parameters go away..

channel,
path,
new Checkpoint(
Files.size(path),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better now.

ex.addSuppressed(inner);
}
throw ex;
}
totalOffset += data.length();
operationCounter++;
minSeqNo = Math.min(minSeqNo, seqNo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment that the order we assign these values matters to ensure we always have a >= relationship?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to take BWC aspect here into account. The incoming minSeqNo can be unassigned (if the ops comes from an old primary) but as soon as we start receiving valid seqNo we should never go back. At least, I hope that's where end up being and I want to start building the assertions based on that.

@@ -191,6 +205,14 @@ public int totalOperations() {
return operationCounter;
}

public long getMinSeqNo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to read these out of order and therefore get inconsistent values? like if you read max first and then min? Also I wonder if the default value can be confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be really consistent I think we need to return a tuple and read it under the lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can wait and see the usage (but document the caveat - other stats are not consistent either). At the moment it's only used for snapshots which is created under lock. I think that the future for recovery/translog trimming will not care about the the writer's values. An alternative is to not expose these but rather the last checkpoint, which is consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd prefer not to expose these unless we're sure we need them because of the concurrency issues. Or just expose them for testing if we need them there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor it seems the consensus leans towards exposing/use the last checkpoint. I presume you had a reason to leave as is. Can you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed bdbbe83399a22d2df91aa5ed684d86d3579dc023.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a more detailed pass and left a bunch of comments.

On top of it, I think we should extend the translog stats with the min max (which might need the lock/tuple that Simon discussed)

About those constants. Without BWC I would argue we should initialize the min/max with the no ops performed constants. The tricky part that with BWC the translog can up full with ops and still the min/max would be no_ops_performed which is strange and will lead to bugs imo. As such, I would recommend using unassigned until the first operation with a seq# enters the system, at which point we switch to normal min/max.

this.offset = offset;
this.numOps = numOps;
this.generation = generation;
this.minSeqNo = minSeqNo;
this.maxSeqNo = maxSeqNo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for assertions.

in.readLong(),
in.readInt(),
in.readLong(),
SequenceNumbersService.NO_OPS_PERFORMED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use unassigned here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed.

}

// reads checkpoint from ES < 5.0.0
static Checkpoint readNonChecksummed(DataInput in) throws IOException {
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO);
return new Checkpoint(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can trash this in master, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8e9091bf41d2d28cd46485e760c235b895e8a65f.

@@ -198,7 +198,7 @@ public Translog(
logger.debug("wipe translog location - creating new translog");
Files.createDirectories(location);
final long generation = 1;
Checkpoint checkpoint = new Checkpoint(0, 0, generation, globalCheckpointSupplier.getAsLong());
Checkpoint checkpoint = new Checkpoint(0, 0, generation, Long.MAX_VALUE, Long.MIN_VALUE, globalCheckpointSupplier.getAsLong());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to constants. I'll comment about it at the end.

@@ -116,7 +143,15 @@ public static TranslogReader open(FileChannel channel, Path path, Checkpoint che
throw new TranslogCorruptedException("expected shard UUID " + uuidBytes + " but got: " + ref +
" this translog file belongs to a different translog. path:" + path);
}
return new TranslogReader(checkpoint.generation, channel, path, ref.length + CodecUtil.headerLength(TranslogWriter.TRANSLOG_CODEC) + Integer.BYTES, checkpoint.offset, checkpoint.numOps);
return new TranslogReader(
checkpoint.generation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make the constructor accept a checkpoint and that will make many of these parameters go away..

@@ -191,6 +205,14 @@ public int totalOperations() {
return operationCounter;
}

public long getMinSeqNo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can wait and see the usage (but document the caveat - other stats are not consistent either). At the moment it's only used for snapshots which is created under lock. I think that the future for recovery/translog trimming will not care about the the writer's values. An alternative is to not expose these but rather the last checkpoint, which is consistent.

@@ -295,7 +331,7 @@ public boolean syncUpTo(long offset) throws IOException {
try {
channel.force(false);
checkpoint =
writeCheckpoint(channelFactory, offsetToSync, opsCounter, globalCheckpoint, path.getParent(), generation);
writeCheckpoint(channelFactory, offsetToSync, opsCounter, minSeqNo, maxSeqNo, globalCheckpoint, path.getParent(), generation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should capture these under lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 7802d6ec2d2f3fdb2dbb1c79c60b3fe6c605291d.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in 7802d6ec2d2f3fdb2dbb1c79c60b3fe6c605291d.

@@ -168,7 +168,8 @@ protected void execute(Terminal terminal, OptionSet options, Environment env) th

/** Write a checkpoint file to the given location with the given generation */
public static void writeEmptyCheckpoint(Path filename, int translogLength, long translogGeneration) throws IOException {
Checkpoint emptyCheckpoint = new Checkpoint(translogLength, 0, translogGeneration, SequenceNumbersService.UNASSIGNED_SEQ_NO);
Checkpoint emptyCheckpoint =
new Checkpoint(translogLength, 0, translogGeneration, Long.MAX_VALUE, Long.MIN_VALUE, SequenceNumbersService.UNASSIGNED_SEQ_NO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a static method on the translog class to make sure this is consistent with what the translog does when it creates an empty translog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e6b356d885d22524bc899987525d6e655fffd31a although I left unsettled the issue of the default values for the min and max sequence number (I will address these later).

@@ -1208,12 +1220,12 @@ public void testRecoveryUncommittedCorruptedCheckpoint() throws IOException {
TranslogConfig config = translog.getConfig();
Path ckp = config.getTranslogPath().resolve(Translog.CHECKPOINT_FILE_NAME);
Checkpoint read = Checkpoint.read(ckp);
Checkpoint corrupted = new Checkpoint(0, 0, 0, SequenceNumbersService.UNASSIGNED_SEQ_NO);
Checkpoint corrupted = new Checkpoint(0, 0, 0, SequenceNumbersService.NO_OPS_PERFORMED, SequenceNumbersService.NO_OPS_PERFORMED, SequenceNumbersService.UNASSIGNED_SEQ_NO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this another example we need the "empty" checkpoint utility - this is different than what the translog does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e6b356d885d22524bc899987525d6e655fffd31a although I left unsettled the issue of the default values for the min and max sequence number (I will address these later).

for (int i = 0; i < numOps; i++) {
out.reset(bytes);
out.writeInt(i);
writer.add(new BytesArray(bytes));
long seqNo;
do {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we add it, can we test that we do the right things with unassigned seq nos due to BWC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 390ee829db21bae606446a5393c35af055a47c02.

@jasontedor jasontedor force-pushed the introduce-sequence-number-aware-translog branch 2 times, most recently from fd7b4a6 to 4dfa75d Compare February 7, 2017 21:30
@jasontedor
Copy link
Member Author

@nik9000 @bleskes @s1monw I think that this is ready for another round of reviews.

// reads checkpoint from ES < 5.0.0
static Checkpoint readNonChecksummed(DataInput in) throws IOException {
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO);
final long minSeqNo = Translog.INITIAL_MIN_SEQ_NO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should have the same number as an empty checkpoint. That way we can tell the difference between a checkpoint that didn't have any sequence numbers and one that was empty. I guess we could just look at the size, but still, I think it'd be more clear to reserve some int for this.

out.writeLong(globalCheckpoint);
}

static Checkpoint emptyTranslogCheckpoint(final long offset, final long generation, final long globalCheckpoint) {
return new Checkpoint(offset, 0, generation, Translog.INITIAL_MIN_SEQ_NO, Translog.INITIAL_MAX_SEQ_NO, globalCheckpoint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm - now that you have it this way and those are the same numbers I think it'd be more clear to use UNASSIGNED_SEQ_NO. I think it is confusing to have three names for -2 because they'll all be unexpectedly equal, you know?

ex.addSuppressed(inner);
}
throw ex;
}
totalOffset += data.length();
operationCounter++;

if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with Long.MAX_VALUE here so long as we're careful not to let them escape from this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert operation counter is 0? (and move the counter increment?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

minSeqNo = Math.min(minSeqNo, seqNo);
}

if (maxSeqNo == Translog.INITIAL_MAX_SEQ_NO) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

@@ -191,6 +205,14 @@ public int totalOperations() {
return operationCounter;
}

public long getMinSeqNo() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd prefer not to expose these unless we're sure we need them because of the concurrency issues. Or just expose them for testing if we need them there.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jasontedor . I went through it again and I think we're close. Like Nik, I have concerns around have the new constants INITIAL_MIN_SEQ_NO. If we go the extra mile (which I think is good), I would go with setting the initial values to the intuitive NO_OPS_PERFORMED. Any incoming unassigned seq# should set it to UNASSINGED_SEQ_NO and I think that from that point on all the logic works.

Also - did you see my ask for extending the translog stats?

final TranslogWriter writer =
new TranslogWriter(channelFactory, shardId, checkpoint, channel, file, bufferSize, globalCheckpointSupplier);
return writer;
writeCheckpoint(channelFactory, headerLength, 0, Translog.INITIAL_MIN_SEQ_NO, Translog.INITIAL_MAX_SEQ_NO, globalCheckpointSupplier.getAsLong(), file.getParent(), fileGeneration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we sue the createEmptyCheckpoint method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed fc40b8f8e77b8cab1dd0544334e9d01f57b5f837.

ex.addSuppressed(inner);
}
throw ex;
}
totalOffset += data.length();
operationCounter++;

if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert operation counter is 0? (and move the counter increment?)

minSeqNo = seqNo;
}
} else {
assert seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing (I'm leaving my original comment below, to show how I got confused). IMO we should always set the minSeqNo / maxSeqNo upon the first value we see. If it is UNASSIGNED_SEQ_NO, then so be it. It tells us that the translog contains operations without seq#, which is currently impossible to know.

hmm... is this correct? what want to check here is that if the incoming value is UNASSIGNED_SEQ_NO then the current value of minSeqNo must be UNASSIGNED_SEQ_NO . This is to say - once seq# are added to the mix, we never go back (we may need to change some logic somewhere else to do so, but I think it's the right move).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

maxSeqNo = seqNo;
}
} else {
assert seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments rely to here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

@@ -191,6 +205,14 @@ public int totalOperations() {
return operationCounter;
}

public long getMinSeqNo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasontedor it seems the consensus leans towards exposing/use the last checkpoint. I presume you had a reason to leave as is. Can you elaborate?

@@ -1008,42 +1032,6 @@ public void testTranslogWriter() throws IOException {
IOUtils.close(writer);
}

public void testFailWriterWhileClosing() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that we have one thing less to worry about, but I wonder if it's a good idea to remove the test. I mean these transitions are always tricky and it's good to have it well tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was testing a specific scenario: constructing the new translog reader could fail because we did a file operation (reading the channel position) in that construction. That file operation was removed, so that failure scenario can not occur anymore. Are you saying that you want the test to remain, just without testing for the failure scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I get why you removed it but It think it's indeed good to keep around - it seems we don't have any other test for the closeIntoReader method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed introduce-sequence-number-aware-translog.

long minSeqNo = Translog.INITIAL_MIN_SEQ_NO;
long maxSeqNo = Translog.INITIAL_MAX_SEQ_NO;
final Set<Long> seenSeqNos = new HashSet<>();
boolean opsHaveValidSequenceNumbers = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we randomize the start to simulate "no BWC mode"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

} while (seenSeqNos.contains(seqNo));
if (seqNo != SequenceNumbersService.UNASSIGNED_SEQ_NO) {
seenSeqNos.add(seqNo);
if (minSeqNo == Translog.INITIAL_MIN_SEQ_NO) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be simpler to calculate those in the end based on seenSeqNos? we can also wrap this whole thing in a opsHaveValidSequenceNumbers chcek and then the inner loop won't need to change for UNASSIGNED_SEQ_NO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed e8f11578cade5be8fdbb9c96d3478e07c0532ebf.

@jasontedor jasontedor force-pushed the introduce-sequence-number-aware-translog branch from 10cc2a8 to 7c50088 Compare February 16, 2017 15:12
@jasontedor
Copy link
Member Author

Also - did you see my ask for extending the translog stats?

Certainly this makes sense at the shard level, but it does not make sense when we report node level or index level stats. Thus, I think they should be left out?

@jasontedor
Copy link
Member Author

Thanks @nik9000 and @bleskes; I think I've addressed your feedback.

@bleskes
Copy link
Contributor

bleskes commented Feb 17, 2017

test this please

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jasontedor for the hard work. This LGTM. I left some minor comments but there is no need for an extra review on my end. @s1monw do you want to do another pass?

Re the stats - I see what you're saying. I'm not happy with it as I think this will give us a way to debug this and it will be good to expose, but I agree there is no clear cut easy solution that's worth pushing for in this PR. Potentially we should just remove translog stats from common stats, I'm not sure how much value the aggregation per index/node/cluster adds here.

final long maxSeqNo = Translog.INITIAL_MAX_SEQ_NO;
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), minSeqNo, maxSeqNo, SequenceNumbersService.UNASSIGNED_SEQ_NO);
static Checkpoint readChecksummedV1(final DataInput in) throws IOException {
final long minSeqNo = SequenceNumbersService.NO_OPS_PERFORMED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be UNASSIGNED_SEQ_NO no? (unless maybe if total count is 0, but I'm not sure it's worth the code song and dance it will require)

writer.add(new BytesArray(bytes), randomNonNegativeLong());
}
writer.sync();
try (TranslogReader reader = writer.closeIntoReader()) {
Copy link
Contributor

@bleskes bleskes Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check the transfer of the checkpoint as well, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 48f02c2bedbc08e10099605030b92b8ca1bd34cd.

protected final long length;
private final int totalOperations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth having these extra fields now that we have a checkpoint? code that needs it can either call the getter methods or access the checkpoint directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about removing them and decided the code read easier with them than with checkpoint.numOps and checkpoint.offset everywhere so reverted it. I can remove them if you feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to remove the fields but I'm fine with doing it in a dead simple followup. I like not doing it in this PR to keep it smaller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm fine with it being a follow-up.

return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO);
// reads a checksummed checkpoint introduced in ES 5.0.0
static Checkpoint readChecksummedV1(final DataInput in) throws IOException {
final long minSeqNo = SequenceNumbersService.NO_OPS_PERFORMED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying the commit comment as the code moved around:

these should be UNASSIGNED_SEQ_NO no? (unless maybe if total count is 0, but I'm not sure it's worth the code song and dance it will require)

@bleskes bleskes requested a review from s1monw February 17, 2017 16:28
return getLastSyncedCheckpoint();
}

@Override
public long sizeInBytes() {
return totalOffset;
}

/**
* closes this writer and transfers it's underlying file channel to a new immutable reader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed ca0a0adf1820c68f096d9167b9f6f1efdac2c51d.

@@ -80,6 +85,8 @@ public TranslogWriter(
this.outputStream = new BufferedChannelOutputStream(java.nio.channels.Channels.newOutputStream(channel), bufferSize.bytesAsInt());
this.lastSyncedCheckpoint = initialCheckpoint;
this.totalOffset = initialCheckpoint.offset;
this.minSeqNo = initialCheckpoint.minSeqNo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right but it is confusing me some. Maybe you can explain it or maybe add a comment?

So my problem is that on figure reading it looks like you are keeping the minimum sequence number as made by some previous writer. So if you make more than one of these writers then this minimum value will stay the minimum value from the first writer. Would it make more sense to use the NO_OPS_PERFORMED constant here? Or maybe add a comment about what it when why we want this "dragging backwards" behavior sometimes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only ever called with a new empty checkpoint. We can assert, would that make it better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. That way I don't start thinking about crazy stuff that we're not doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 52f68e1979b80f9ae4a3c7e462546bd779110963.

@jasontedor
Copy link
Member Author

retest this please

@bleskes
Copy link
Contributor

bleskes commented Feb 20, 2017

test this please

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments nothing major

// reads a checksummed checkpoint introduced in ES 5.0.0
static Checkpoint readChecksummedV1(DataInput in) throws IOException {
return new Checkpoint(in.readLong(), in.readInt(), in.readLong(), SequenceNumbersService.UNASSIGNED_SEQ_NO);
static Checkpoint readChecksummedV2(final DataInput in) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment when this version was introduced. Maybe instead of V1 and V2 we use the version in the method name like Pre6_0_0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit that does this.

CodecUtil.checksumEntireFile(indexInput);
final int fileVersion = CodecUtil.checkHeader(indexInput, CHECKPOINT_CODEC, INITIAL_VERSION, CURRENT_VERSION);
if (fileVersion == INITIAL_VERSION) {
assert indexInput.length() == V1_FILE_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have the actual length in the message? this would be aweful if we didn't have it. same below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit that does this.

assert indexInput.length() == FILE_SIZE;
return Checkpoint.readChecksummedV2(indexInput);
}
assert fileVersion == CURRENT_VERSION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get the version in the message

*/
public static TranslogReader open(FileChannel channel, Path path, Checkpoint checkpoint, String translogUUID) throws IOException {
public static TranslogReader open(
final FileChannel channel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh can we have one line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit just for you and only for you that does this.

@@ -138,6 +158,11 @@ public int totalOperations() {
return totalOperations;
}

@Override
Checkpoint getCheckpoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this method be final?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I pushed a commit that does this.

@@ -56,6 +55,11 @@ public int totalOperations() {
}

@Override
Checkpoint getCheckpoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this method be final?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final modifier here would be redundant because the class is final.

@@ -80,6 +85,10 @@ public TranslogWriter(
this.outputStream = new BufferedChannelOutputStream(java.nio.channels.Channels.newOutputStream(channel), bufferSize.bytesAsInt());
this.lastSyncedCheckpoint = initialCheckpoint;
this.totalOffset = initialCheckpoint.offset;
assert initialCheckpoint.minSeqNo == SequenceNumbersService.NO_OPS_PERFORMED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get the actual value in the message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I pushed a commit that does this.

ex.addSuppressed(inner);
}
throw ex;
}
totalOffset += data.length();

if (minSeqNo == SequenceNumbersService.NO_OPS_PERFORMED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we can get this logic as SequenceNumbersService#max and SequenceNumbersService#minwhere we can document and test it separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I pushed a commit that does this (and added tests).

Today, the relationship between Lucene and the translog is rather
simple: every document not in Lucene is guaranteed to be in the
translog. We need a stronger guarantee from the translog though, namely
that it can replay all operations after a certain sequence number. For
this to be possible, the translog has to made sequence-number aware. As
a first step, we introduce the min and max sequence numbers into the
translog so that each generation knows the possible range of operations
contained in the generation. This will enable future work to keep around
all generations containing operations after a certain sequence number
(e.g., the global checkpoint).
@jasontedor jasontedor force-pushed the introduce-sequence-number-aware-translog branch from 48f02c2 to 772a513 Compare February 20, 2017 19:08
@jasontedor
Copy link
Member Author

I left some comments nothing major

Thanks @s1monw. I pushed 772a513.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s1monw
Copy link
Contributor

s1monw commented Feb 20, 2017

@jasontedor looks awesome

* master:
  Mark IP range aggregator test as awaits fix
  Add note and link to 'tune for disk usage' (elastic#23252)
@jasontedor jasontedor merged commit 4c2bd5f into elastic:master Feb 20, 2017
@jasontedor jasontedor deleted the introduce-sequence-number-aware-translog branch February 20, 2017 20:05
@jasontedor
Copy link
Member Author

Thanks @nik9000, @bleskes, and @s1monw.

@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants