-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune only gc deletes below the local checkpoint #28790
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @dnhatn . I left some questions on the tests. Otherwise looks good. Like #28787 , let's wait for @DaveCTurner .
Also - I'm not comfortable with the fact that we skip the local checkpoint check (i.e., not adding to lucene if seq# <= local checkpoint) for optimized operations. It's technically correct but I rather have that as a hard rule that's visible in the code. Can you do a follow up for that?
equalTo(Sets.union(trimmedDeletes, rememberedDeletes))); | ||
engine.refresh("test"); | ||
// Only prune deletes below the local checkpoint. | ||
engine.maybePruneDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this relate to clock? also refresh already maybe prunes deletes
@bleskes You're correct. We don't need a manual clock here. I've removed the clock and also the prune call. I will make a follow-up as you suggested. |
# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java
@bleskes I've mixed time-based and sequence-based conditions into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baring discussions around the model with @DaveCTurner about the TLA model, this LGTM. I left some nits that don't need another review.
@@ -4572,4 +4577,67 @@ public void testStressUpdateSameDocWhileGettingIt() throws IOException, Interrup | |||
} | |||
} | |||
} | |||
|
|||
public void testPruneOnlyDeletesAtMostLocalCheckpoint() throws Exception { | |||
IOUtils.close(engine, store); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we try not to mess up with the class's engine ? I rather have a local one enclosed in a try with resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
clock.set(randomLongBetween(gcInterval, deleteBatch + gcInterval)); | ||
engine.refresh("test"); | ||
tombstones.removeIf(v -> v.seqNo < gapSeqNo && v.time < clock.get() - gcInterval); | ||
assertThat(engine.getDeletedTombstones(), containsInAnyOrder(tombstones.toArray())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to check the size too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to as containsInAnyOrder
also does a size-check.
shards.startAll(); | ||
final IndexShard primary = shards.getPrimary(); | ||
final IndexShard replica = shards.getReplicas().get(0); | ||
final TimeValue gcInterval = TimeValue.timeValueMillis(scaledRandomIntBetween(1, 1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just set this to something very small (10ms?) and also set ThreadPool#ESTIMATED_TIME_INTERVAL_SETTING to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final boolean isTooOld = currentTime - versionValue.time > pruneInterval; | ||
private boolean canRemoveTombstone(long maxTimestampToPrune, long maxSeqNoToPrune, DeleteVersionValue versionValue) { | ||
// check if the value is old enough and safe to be removed | ||
final boolean isTooOld = versionValue.time < maxTimestampToPrune; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we would like to keep delete tombstones for at least one GC cycle.
- final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
+ final boolean isTooOld = versionValue.time < maxTimestampToPrune;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too!
Models the fix implemented in elastic/elasticsearch#28790
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
# Conflicts: # server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java # server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java
Thanks @bleskes, @s1monw and @DaveCTurner for reviewing. |
* master: Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790)
* master: (40 commits) Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable elastic#28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156) Add file permissions checks to precommit task Remove execute mode bit from source files Optimize the composite aggregation for match_all and range queries (elastic#28745) [Docs] Add rank_eval size parameter k (elastic#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (elastic#29172) Docs: Link C++ client lib elasticlient (elastic#28949) [DOCS] Unregister repository instead of deleting it (elastic#29206) Docs: HighLevelRestClient#multiSearch (elastic#29144) Add Z value support to geo_shape Remove type casts in logging in server component (elastic#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878) REST : Split `RestUpgradeAction` into two actions (elastic#29124) Add error file docs to important settings ...
* es/master: (22 commits) Fix building Javadoc JARs on JDK for client JARs (#29274) Require JDK 10 to build Elasticsearch (#29174) Decouple NamedXContentRegistry from ElasticsearchException (#29253) Docs: Update generating test coverage reports (#29255) [TEST] Fix issue with HttpInfo passed invalid parameter Remove all dependencies from XContentBuilder (#29225) Fix sporadic failure in CompositeValuesCollectorQueueTests Propagate ignore_unmapped to inner_hits (#29261) TEST: Increase timeout for testPrimaryReplicaResyncFailed REST client: hosts marked dead for the first time should not be immediately retried (#29230) TEST: Use different translog dir for a new engine Make SearchStats implement Writeable (#29258) [Docs] Spelling and grammar changes to reindex.asciidoc (#29232) Do not optimize append-only if seen normal op with higher seqno (#28787) [test] packaging: gradle tasks for groovy tests (#29046) Prune only gc deletes below local checkpoint (#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable #28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (#29156) Add file permissions checks to precommit task ...
This models how indexing and deletion operations are handled on the replica, including the optimisations for append-only operations and the interaction with Lucene commits and the version map. It incorporates - elastic/elasticsearch#28787 - elastic/elasticsearch#28790 - elastic/elasticsearch#29276 - a proposal to always prune tombstones
Once a document is deleted and Lucene is refreshed, we will not be able to look up the `version/seq#` associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica. 1. Primary index doc (index, id=1, v2) 2. Network packet issue causes index operation to back off and wait 3. Primary deletes doc (delete, id=1, v3) 4. Replica processes delete (delete, id=1, v3) 5. 1+ minute passes (GC deletes runs replica) 6. Indexing op is finally sent to the replica which no processes it because it forgot about the delete. We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows: Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded: 1. If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was already previously added. 2. If seq#(o1) > LCP, then it depends on the nature of o2: - If o2 is a delete then its seq# is recorded in the VersionMap, since seq#(o2) > seq#(o1) > LCP, so a lookup can find it and determine that o1 is stale. - If o2 is an indexing then its seq# is either in Lucene (if refreshed) or the VersionMap (if not refreshed yet), so a real-time lookup can find it and determine that o1 is stale. In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because: - It's simpler - no need to distinguish if an engine is running at primary mode or replica mode or being promoted. - If a replica subsequently is promoted, user experience is fully maintained as that replica remembers deletes for the last GC cycle. However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.
Once a document is deleted and Lucene is refreshed, we will not be able to look up the
version/seq#
associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica.We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows:
Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded:
so a lookup can find it and determine that o1 is stale.
In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because:
However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.