-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix false positive query timeouts due to using cached time #3454
Fix false positive query timeouts due to using cached time #3454
Conversation
Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com>
/* for startTime, relative non-cached time must be used to prevent false positive timeouts. | ||
* Using cached time for startTime will fail and produce false positive timeouts when maxTime = (startTime + timeout) falls in | ||
* next time cache slot(s) AND time caching lifespan > passed timeout */ | ||
final long startTime = TimeValue.nsecToMSec(System.nanoTime()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aabukhalil I think we are focusing only on one aspect of the problem rather than whole:
- if
thread_pool.estimated_time_interval
is set to0
, no changes needed - the original implementation will work right away - if
thread_pool.estimated_time_interval
is not set to0
, we may have an issues here but only if thetimeout
value is significantly smaller thanthread_pool.estimated_time_interval
Instead of changing QueryPhase
, I would like to suggest introducing new method getRelativeTimeInMillisForTimeout(final long timeout)
(or even modifying the existing method getRelativeTimeInMillis
to accept timeout
argument, seems like it is used only in QueryPhase
). Inside, depending on the thread_pool.estimated_time_interval
and timeout
value the decision could be made to either use cached time provider or fallback to System.nanoTime()
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta thank you for reviewing this and providing feedback :)
Can you please elaborate on your concern on the reason of introducing more branching logic to compute initial time (which depend on cache lifespan and timeout value) as you suggested instead of using one logic which will always work with no branching ?
And yeah I have concern of calling System.nanoTime
directly from QueryPhase
but I think adding additional method getRelativeTimeInMillisForTimeout
or override to getRelativeTimeInMillis(long timeout)
will be more confusing to which one to use. (It will be harder to decide which method/override to use in here for example or any caller and eventually need to look at implementation details to decide which one). My suggestion is to introduce method like getNonCachedRelativeTimeInMillis
or override like getRelativeTimeInMillis(boolean useCache)
where it is clear for caller what is the difference between them and let caller decide on branching logic. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @aabukhalil
Can you please elaborate on your concern on the reason of introducing more branching logic to compute initial time (which depend on cache lifespan and timeout value) as you suggested instead of using one logic which will always work with no branching ?
Sure, the decision what is getRelativeTime()
is made in one place and in one place only. Taking into account possible set of values, we should not do the work someone else (the time provider in this case) should be doing.
And yeah I have concern of calling System.nanoTime directly from QueryPhase but I think adding additional method getRelativeTimeInMillisForTimeout or override to getRelativeTimeInMillis(long timeout) will be more confusing to which one to use.
If I am not mistaken, this is the only place when this method is being used, so fixing it in a way that prevents wrong usage makes sense to me. Keeping this method unchanged leaves opportunities for future misuse. The getRelativeTimeInMillisForTimeout
is not a good name, but may be getRelativeTimeInMillisWithPrecision(long precision)
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta yeah I agree that time provider should be the only responsible module for time but I'm asking about your concern of adding new branch (if timeout < cache lifespan then return non cached time else fallback to default implementation) instead of always returning non cached time and without promising that we have actual control of time precision because I think we have cached time and non cached time actually. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aabukhalil sorry missed that, I don't know the full story behind caching, but may be we could measure the overhead of calling System.nanoTime
vs query execution time? (like [1] fe) Calling System.nanoTime
directly and dropping all the caching impl would make things simpler for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta System.nanoTime
is known to be a slow operation and that's why it is cached. But the caching advantage is still applicable here because getting time here is still using the cached time and which will be called almost per matched document. So the overhead I'm adding is one System.nanoSecond
per query when timeout is set and after merging this I need to look on the performance impact of this single call. For now I will refactor this to delegate calling System.nanoSecond
to SearchContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta, what sort of testing would give you confidence that we could just remove the caching all together?
@aabukhalil @andrross there is this excellent analysis [1] from OpenJDK developer(s) regarding the System.nanoTime()
usage, the takeaway (one of) is: "nanoTime is not dirt cheap; at best, you can hope for 15-30 ns per call" (is highly depends on operating system as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not refresh the cache at a faster interval, I would like to understand the overall overhead of caching System#nanoTime. We should look into optimizing/correcting the clock used for the query and its timeout rather than adding more code for the timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta @andrross Yes nanoTime
is not cheap and seems nothing has changed since these analysis were made. According to a micro benchmark I've just made (will include it's code in next revision), calling System.nanoTime
has latency of ~30 nanoseconds while accessing volatile long variable and incrementing it has latency of ~3 nanoseconds (which is upper bound of what is happening in ThreadPool time cache ).
The benchmark was run on Linux on EC2 c6a.12xlarge (AMD processor), Linux on EC2 m5.12xlarge (Intel processor), Macbook locally (Intel(R) Core(TM) i7-9750H) and Windows locally (Intel(R) Core(TM) 7-10750H) and all shows the consistent behavior of 30ns latency when calling nanoTime
.
My call here is that the advantage of cached time optimization is still applicable as getting time here is still using the cached time which will be called almost per matched document in query result which even the 30ns latency MIGHT be amplified depending on match set size because of how checkCancelled is called as currently there are 21 caller of that method and some of them with no sampling (e.g. 100K matching documents might add up to 30 millisecond to query's took
when timeout is set with no time caching).
My change is called once only per query where timeout is set which will add ~30ns to overall search latency per query which I don't think will impact overall system latency or throughput and still I need to follow up on performance test after merging this to confirm that there is no impact made. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bukhtawar since current supported timeout granularity is millisecond so solve this issue completely we need to speed up time cache eviction to be each 1ms, no ?
I've run this program locally
public class MainTestTimeSleep {
volatile static long time = 0L;
public static void main(String[] args) throws Exception{
while (true) {
time = System.nanoTime();
Thread.sleep(1);
}
}
}
which mimics updating time each 1ms and this program resulted in utilizing ~6% of one CPU by itself due switching process state from running to sleeping so frequently I guess.
Sleeping each 10ms reduced CPU consumption to ~1%.
My concern is if want to support timeout granularity in milliseconds, we might run into same situation consuming CPU resources almost busy waiting. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @Bukhtawar @andrross and @reta for any followup to avoid this becoming stale.
I think there are 3 approaches we could take:
@andrross @Bukhtawar please feel free to add or correct me, thank you. |
I think calling System.nanoTime() with no caching is probably a non-starter because the check happens too frequently, if my understanding is correct. However, what if reduced the cached time interval from 200ms down to 10ms or 20ms? A 30ns cost every 10000000ns (please check my math :) ) seems trivial. Would the inaccuracy be small enough as to not matter? I think this is what @Bukhtawar was hinting at above, and it would be a super simple fix with no extra logic added. Also, this wasn't super clear to me on the linked issue (#2000), but I think the only thing at issue here is the accuracy of the search timeout and not the |
We have this setting |
I agree that it's non-obvious and confusing that the For completeness, I'll add a couple more options:
|
Thanks @andrross
I think this is the best we can do in this case.
The part which I am not comfortable with - we do use 2 different approaches for the same thing:
👍 |
✅ Gradle Check success 8955d64d74546e2d4c70fc31f582cc0764e04efb |
8955d64
to
f604902
Compare
Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com>
d1bd21d
to
ddbfa15
Compare
❌ Gradle Check failure d1bd21d2054a95294dd64dc220e932a88dcb6127 |
@reta @andrross can you review this again as I pushed a change to keep current approach as is (use non cached time for startTime for once and then use the cached time per document ) but I've changed the time API to be more accurate and less confusing. This approach does not have performance compromise and still fixes the problem and is not discarding any optimizations made. |
@aabukhalil My apologies but I did not get it: before we just called May be we should just flip |
Yes the new change is just a wrapper of the old procedure but I think this time it is better abstracted as one method might be cached and the other is not cached and let caller decide which to use depending on usage. I think abstracting this behavior as For flipping By shipping this, false positive timeouts should be eliminated by default without affecting anything else and it is a one step toward a better product. |
This is the most straightforward and concise way to address the problem (or reducing it to smaller value fe 10ms). The few nanonseconds per call sounds like not a dramatic performance hit (if we account for caching implementation - even less so) for calling
This is a patch for the particular flow - it does not address the problem that |
Yeah definitely couple of nanoseconds per call is not my concern. The concern and which is why this optimization was made (most probably) is due the fact that total latency per whole search call will be impacted by extra couple milliseconds because
Yes but now it's well documented that |
As on option, we could update |
you mean what was proposed in here as |
Exactly, yes, sorry I missed that, I think out of all options it is the best one. |
Does this mean passing |
@@ -393,12 +393,16 @@ public final boolean hasOnlySuggest() { | |||
public abstract long getRelativeTimeInMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we deprecate this one and point to getRelativeTimeInMillis(boolean useCache)
? We definitely don't need 2 methods for same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta if want to deprecate getRelativeTimeInMillis
which is an abstract method then getRelativeTimeInMillis(boolean useCache)
should be abstract too to give child classes ability to implement it and I think that not all child classes really should care about useCache
because they might not have such concept so this will cause even more abstraction leak. while by keeping both overrides then child class has to implement one method/logic. so I'm not sure what is best here as both options is causing abstraction leak anyway. do you mean by deprecation just annotating the old method with @Deprecated
and leave it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @aabukhalil. The fact that getRelativeTimeInMillis()
is being overloaded and eventually comes from some other time thread class made it really hard to follow and understand. I'm happy we're solving the SearchContext problem here and would defer trying to solve the more general problem to a radical simplification of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean by deprecation just annotating the old method with @deprecated and leave it ?
@aabukhalil yeah, exactly, mark as @deprecated
, we cannot remove it right now
@dblock no breaking changes, everything would be working as-is, just hinting that new method is preferred for use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta marking getRelativeTimeInMillis()
as deprecated while still abstract and forcing child classes to implement it because there is no path forward to implement, does not make sense to me. Deprecation is not inherited in java so if you deprecate getRelativeTimeInMillis()
in SearchContext
and you have subclass MySearchContext
which extends SearchContext
and you call getRelativeTimeInMillis()
on MySearchContext
it won't be marked as deprecated which somehow breaks Liskov substitution principle because the behavior of compiler will change if you have your variable defined as SearchContext
vs MySearchContext
, what do you think ? I don't see strong reason to deprecate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is true, @Deprecated
indeed has an issue with inheritance, you are right @aabukhalil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @reta, appreciate your help and dedication :)
❌ Gradle Check failure 75bcb5ed569a34daf046bc43327a3fb7c2915d0c |
Needs a fix for this:
|
…ched time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com>
75bcb5e
to
7f73aa7
Compare
@dblock thank you for catching this. it is updated now. |
* Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> (cherry picked from commit b5f137b)
…3624) * Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> (cherry picked from commit b5f137b) Co-authored-by: Ahmad AbuKhalil <105249973+aabukhalil@users.noreply.github.com>
* Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com>
…h-project#3454) * Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com>
* Bump reactor-netty-core from 1.0.16 to 1.0.19 in /plugins/repository-azure (#3360) * Bump reactor-netty-core in /plugins/repository-azure Bumps [reactor-netty-core](https://github.com/reactor/reactor-netty) from 1.0.16 to 1.0.19. - [Release notes](https://github.com/reactor/reactor-netty/releases) - [Commits](reactor/reactor-netty@v1.0.16...v1.0.19) --- updated-dependencies: - dependency-name: io.projectreactor.netty:reactor-netty-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * [Type removal] _type removal from mocked responses of scroll hit tests (#3377) Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove _type deprecation from script and conditional processor (#3239) * [Type removal] Remove _type deprecation from script and conditional processor Signed-off-by: Suraj Singh <surajrider@gmail.com> * Spotless check apply Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove _type from _bulk yaml test, scripts, unused constants (#3372) * [Type removal] Remove redundant _type deprecation checks in bulk request Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] bulk yaml tests validating deprecation on _type and removal from scripts Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix Lucene-snapshots repo for jdk 17. (#3396) Signed-off-by: Marc Handalian <handalm@amazon.com> * Replace internal usages of 'master' term in 'server/src/internalClusterTest' directory (#2521) Signed-off-by: Tianli Feng <ftianli@amazon.com> * [REMOVE] Cleanup deprecated thread pool types (FIXED_AUTO_QUEUE_SIZE) (#3369) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * [Type removal] _type removal from tests of yaml tests (#3406) * [Type removal] _type removal from tests of yaml tests Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix spotless failures Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix assertion failures Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix assertion failures in DoSectionTests Signed-off-by: Suraj Singh <surajrider@gmail.com> * Add release notes for version 2.0.0 (#3410) Signed-off-by: Rabi Panda <adnapibar@gmail.com> * [Upgrade] Lucene-9.2.0-snapshot-ba8c3a8 (#3416) Upgrades to latest snapshot of lucene 9.2.0 in preparation for GA release. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Fix release notes for 2.0.0-rc1 version (#3418) This change removes some old commits from the 2.0.0-rc1 release notes. These commits were already released as part of 1.x releases. Add back some missing type removal commits to the 2.0.0 release notes Signed-off-by: Rabi Panda <adnapibar@gmail.com> * Bump version 2.1 to Lucene 9.2 after upgrade (#3424) Bumps Version.V_2_1_0 lucene version to 9.2 after backporting upgrage. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Bump com.gradle.enterprise from 3.10 to 3.10.1 (#3425) Bumps com.gradle.enterprise from 3.10 to 3.10.1. --- updated-dependencies: - dependency-name: com.gradle.enterprise dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump reactor-core from 3.4.17 to 3.4.18 in /plugins/repository-azure (#3427) Bumps [reactor-core](https://github.com/reactor/reactor-core) from 3.4.17 to 3.4.18. - [Release notes](https://github.com/reactor/reactor-core/releases) - [Commits](reactor/reactor-core@v3.4.17...v3.4.18) --- updated-dependencies: - dependency-name: io.projectreactor:reactor-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump gax-httpjson from 0.101.0 to 0.103.1 in /plugins/repository-gcs (#3426) Bumps [gax-httpjson](https://github.com/googleapis/gax-java) from 0.101.0 to 0.103.1. - [Release notes](https://github.com/googleapis/gax-java/releases) - [Changelog](https://github.com/googleapis/gax-java/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/gax-java/commits) --- updated-dependencies: - dependency-name: com.google.api:gax-httpjson dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * [segment replication]Introducing common Replication interfaces for segment replication and recovery code paths (#3234) * RecoveryState inherits from ReplicationState + RecoveryTarget inherits from ReplicationTarget Signed-off-by: Poojita Raj <poojiraj@amazon.com> * Refactoring: mixedClusterVersion error fix + move Stage to ReplicationState Signed-off-by: Poojita Raj <poojiraj@amazon.com> * pull ReplicationListener into a top level class + add javadocs + address review comments Signed-off-by: Poojita Raj <poojiraj@amazon.com> * fix javadoc Signed-off-by: Poojita Raj <poojiraj@amazon.com> * review changes Signed-off-by: Poojita Raj <poojiraj@amazon.com> * Refactoring the hierarchy relationship between repl and recovery Signed-off-by: Poojita Raj <poojiraj@amazon.com> * style fix Signed-off-by: Poojita Raj <poojiraj@amazon.com> * move package common under replication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * rename to replication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * rename and doc changes Signed-off-by: Poojita Raj <poojiraj@amazon.com> * [Type removal] Remove type from BulkRequestParser (#3423) * [Type removal] Remove type handling in bulk request parser Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove testTypesStillParsedForBulkMonitoring as it is no longer present in codebase Signed-off-by: Suraj Singh <surajrider@gmail.com> * Adding CheckpointRefreshListener to trigger when Segment replication is turned on and Primary shard refreshes (#3108) * Intial PR adding classes and tests related to checkpoint publishing Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Putting a Draft PR with all changes in classes. Testing is still not included in this commit. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Wiring up index shard to new engine, spotless apply and removing unnecessary tests and logs Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding Unit test for checkpointRefreshListener Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Applying spotless check Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fixing import statements * Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * removing unused constructor in index shard Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Addressing comments from last commit Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding package-info.java files for two new packages Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding test for null checkpoint publisher and addreesing PR comments Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Add docs for indexshardtests and remove shard.refresh Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Add a new Engine implementation for replicas with segment replication enabled. (#3240) * Change fastForwardProcessedSeqNo method in LocalCheckpointTracker to persisted checkpoint. This change inverts fastForwardProcessedSeqNo to fastForwardPersistedSeqNo for use in Segment Replication. This is so that a Segrep Engine can match the logic of InternalEngine where the seqNo is incremented with each operation, but only persisted in the tracker on a flush. With Segment Replication we bump the processed number with each operation received index/delete/noOp, and invoke this method when we receive a new set of segments to bump the persisted seqNo. Signed-off-by: Marc Handalian <handalm@amazon.com> * Extract Translog specific engine methods into an abstract class. This change extracts translog specific methods to an abstract engine class so that other engine implementations can reuse translog logic. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add a separate Engine implementation for replicas with segment replication enabled. This change adds a new engine intended to be used on replicas with segment replication enabled. This engine does not wire up an IndexWriter, but still writes all operations to a translog. The engine uses a new ReaderManager that refreshes from an externally provided SegmentInfos. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless checks. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix :server:compileInternalClusterTestJava compilation. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix failing test naming convention check. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. - Removed isReadOnlyReplica from overloaded constructor and added feature flag checks. - Updated log msg in NRTReplicationReaderManager - cleaned up store ref counting in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless check. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove TranslogAwareEngine and build translog in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix formatting Signed-off-by: Marc Handalian <handalm@amazon.com> * Add missing translog methods to NRTEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove persistent seqNo check from fastForwardProcessedSeqNo. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add test specific to translog trimming. Signed-off-by: Marc Handalian <handalm@amazon.com> * Javadoc check. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add failEngine calls to translog methods in NRTReplicationEngine. Roll xlog generation on replica when a new commit point is received. Signed-off-by: Marc Handalian <handalm@amazon.com> * Rename master to cluster_manager in the XContent Parser of ClusterHealthResponse (#3432) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Bump hadoop-minicluster in /test/fixtures/hdfs-fixture (#3359) Bumps hadoop-minicluster from 3.3.2 to 3.3.3. --- updated-dependencies: - dependency-name: org.apache.hadoop:hadoop-minicluster dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump avro from 1.10.2 to 1.11.0 in /plugins/repository-hdfs (#3358) * Bump avro from 1.10.2 to 1.11.0 in /plugins/repository-hdfs Bumps avro from 1.10.2 to 1.11.0. --- updated-dependencies: - dependency-name: org.apache.avro:avro dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix testSetAdditionalRolesCanAddDeprecatedMasterRole() by removing the initial assertion (#3441) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Replace internal usages of 'master' term in 'server/src/test' directory (#2520) * Replace the non-inclusive terminology "master" with "cluster manager" in code comments, internal variable/method/class names, in `server/src/test` directory. * Backwards compatibility is not impacted. * Add a new unit test `testDeprecatedMasterNodeFilter()` to validate using `master:true` or `master:false` can filter the node in [Cluster Stats](https://opensearch.org/docs/latest/opensearch/rest-api/cluster-stats/) API, after the `master` role is deprecated in PR #2424 Signed-off-by: Tianli Feng <ftianli@amazon.com> * Removing unused method from TransportSearchAction (#3437) * Removing unused method from TransportSearchAction Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> * Set term vector flags to false for ._index_prefix field (#1901). (#3119) * Set term vector flags to false for ._index_prefix field (#1901). Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Replaced the FieldType copy ctor with ctor for the prefix field and replaced setting the field type parameters with setIndexOptions(). (#1901) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added tests for term vectors. (#1901) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Fixed code formatting error. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> Co-authored-by: sdp <sdp@9049fa06826d.jf.intel.com> * [BUG] Fixing org.opensearch.monitor.os.OsProbeTests > testLogWarnCpuMessageOnlyOnes when cgroups are available but cgroup stats is not (#3448) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * [Segment Replication] Add SegmentReplicationTargetService to orchestrate replication events. (#3439) * Add SegmentReplicationTargetService to orchestrate replication events. This change introduces boilerplate classes for Segment Replication and a target service to orchestrate replication events. It also includes two refactors of peer recovery components for reuse. 1. Rename RecoveryFileChunkRequest to FileChunkRequest and extract code to handle throttling into ReplicationTarget. 2. Extracts a component to execute retryable requests over the transport layer. Signed-off-by: Marc Handalian <handalm@amazon.com> * Code cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> * Make SegmentReplicationTargetService component final so that it can not be extended by plugins. Signed-off-by: Marc Handalian <handalm@amazon.com> * Bump azure-core-http-netty from 1.11.9 to 1.12.0 in /plugins/repository-azure (#3474) Bumps [azure-core-http-netty](https://github.com/Azure/azure-sdk-for-java) from 1.11.9 to 1.12.0. - [Release notes](https://github.com/Azure/azure-sdk-for-java/releases) - [Commits](Azure/azure-sdk-for-java@azure-core-http-netty_1.11.9...azure-core_1.12.0) --- updated-dependencies: - dependency-name: com.azure:azure-core-http-netty dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update to Apache Lucene 9.2 (#3477) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump protobuf-java from 3.20.1 to 3.21.1 in /plugins/repository-hdfs (#3472) Signed-off-by: dependabot[bot] <support@github.com> * [Upgrade] Lucene-9.3.0-snapshot-823df23 (#3478) Upgrades to latest snapshot of lucene 9.3.0. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Filter out invalid URI and HTTP method in the error message of no handler found for a REST request (#3459) Filter out invalid URI and HTTP method of a error message, which shown when there is no handler found for a REST request sent by user, so that HTML special characters <>&"' will not shown in the error message. The error message is return as mine-type `application/json`, which can't contain active (script) content, so it's not a vulnerability. Besides, no browsers are going to render as html when the mine-type is that. While the common security scanners will raise a false-positive alarm for having HTML tags in the response without escaping the HTML special characters, so the solution only aims to satisfy the code security scanners. Signed-off-by: Tianli Feng <ftianli@amazon.com> * Support use of IRSA for repository-s3 plugin credentials (#3475) * Support use of IRSA for repository-s3 plugin credentials Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Address code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Address code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump google-auth-library-oauth2-http from 0.20.0 to 1.7.0 in /plugins/repository-gcs (#3473) * Bump google-auth-library-oauth2-http in /plugins/repository-gcs Bumps google-auth-library-oauth2-http from 0.20.0 to 1.7.0. --- updated-dependencies: - dependency-name: com.google.auth:google-auth-library-oauth2-http dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Use variable to define the version of dependency google-auth-library-java Signed-off-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> * [Segment Replication] Added source-side classes for orchestrating replication events (#3470) This change expands on the existing SegmentReplicationSource interface and its corresponding Factory class by introducing an implementation where the replication source is a primary shard (PrimaryShardReplicationSource). These code paths execute on the target. The primary shard implementation creates the requests to be send to the source/primary shard. Correspondingly, this change also defines two request classes for the GET_CHECKPOINT_INFO and GET_SEGMENT_FILES requests as well as an abstract superclass. A CopyState class has been introduced that captures point-in-time, file-level details from an IndexShard. This implementation mirrors Lucene's NRT CopyState implementation. Finally, a service class has been introduce for segment replication that runs on the source side (SegmentReplicationSourceService) which handles these two types of incoming requests. This includes private handler classes that house the logic to respond to these requests, with some functionality stubbed for now. The service class also uses a simple map to cache CopyState objects that would be needed by replication targets. Unit tests have been added/updated for all new functionality. Signed-off-by: Kartik Ganesh <gkart@amazon.com> * [Dependency upgrade] google-oauth-client to 1.33.3 (#3500) Signed-off-by: Suraj Singh <surajrider@gmail.com> * move bash flag to set statement (#3494) Passing bash with flags to the first argument of /usr/bin/env requires its own flag to interpret it correctly. Rather than use `env -S` to split the argument, have the script `set -e` to enable the same behavior explicitly in preinst and postinst scripts. Also set `-o pipefail` for consistency. Closes: #3492 Signed-off-by: Cole White <cwhite@wikimedia.org> * Support use of IRSA for repository-s3 plugin credentials: added YAML Rest test case (#3499) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump azure-storage-common from 12.15.0 to 12.16.0 in /plugins/repository-azure (#3517) * Bump azure-storage-common in /plugins/repository-azure Bumps [azure-storage-common](https://github.com/Azure/azure-sdk-for-java) from 12.15.0 to 12.16.0. - [Release notes](https://github.com/Azure/azure-sdk-for-java/releases) - [Commits](Azure/azure-sdk-for-java@azure-storage-blob_12.15.0...azure-storage-blob_12.16.0) --- updated-dependencies: - dependency-name: com.azure:azure-storage-common dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-oauth-client from 1.33.3 to 1.34.0 in /plugins/discovery-gce (#3516) * Bump google-oauth-client from 1.33.3 to 1.34.0 in /plugins/discovery-gce Bumps [google-oauth-client](https://github.com/googleapis/google-oauth-java-client) from 1.33.3 to 1.34.0. - [Release notes](https://github.com/googleapis/google-oauth-java-client/releases) - [Changelog](https://github.com/googleapis/google-oauth-java-client/blob/main/CHANGELOG.md) - [Commits](googleapis/google-oauth-java-client@v1.33.3...v1.34.0) --- updated-dependencies: - dependency-name: com.google.oauth-client:google-oauth-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix the support of RestClient Node Sniffer for version 2.x and update tests (#3487) Fix the support of RestClient Node Sniffer for OpenSearch 2.x, and update unit tests for OpenSearch. The current code contains the logic to be compatible with Elasticsearch 2.x version, which is conflict with OpenSearch 2.x, so removed that part of legacy code. * Update the script create_test_nodes_info.bash to dump the response of Nodes Info API GET _nodes/http for OpenSearch 1.0 and 2.0 version, which used for unit test. * Remove the support of Elasticsearch version 2.x for the Sniffer * Update unit test to validate the Sniffer compatible with OpenSearch 1.x and 2.x * Update the API response parser to meet the array notation (in ES 6.1 and above) for the node attributes setting. It will result the value of `node.attr` setting will not be parsed as array in the Sniffer, when using the Sniffer on cluster in Elasticsearch 6.0 and above. * Replace "master" node role with "cluster_manager" in unit test Signed-off-by: Tianli Feng <ftianli@amazon.com> * Bump com.diffplug.spotless from 6.6.1 to 6.7.0 (#3513) Bumps com.diffplug.spotless from 6.6.1 to 6.7.0. --- updated-dependencies: - dependency-name: com.diffplug.spotless dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump guava from 18.0 to 23.0 in /plugins/ingest-attachment (#3357) * Bump guava from 18.0 to 23.0 in /plugins/ingest-attachment Bumps [guava](https://github.com/google/guava) from 18.0 to 23.0. - [Release notes](https://github.com/google/guava/releases) - [Commits](google/guava@v18.0...v23.0) --- updated-dependencies: - dependency-name: com.google.guava:guava dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Add more ingorance of using internal java API sun.misc.Unsafe Signed-off-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> * Added bwc version 2.0.1 (#3452) Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Add release notes for 1.3.3 (#3549) Signed-off-by: Xue Zhou <xuezhou@amazon.com> * [Upgrade] Lucene-9.3.0-snapshot-b7231bb (#3537) Upgrades to latest snapshot of lucene 9.3; including reducing maxFullFlushMergeWaitMillis in LuceneTest.testWrapLiveDocsNotExposeAbortedDocuments to 0 ms to ensure aborted docs are not merged away in the test with the new mergeOnRefresh default policy. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * [Remote Store] Upload segments to remote store post refresh (#3460) * Add RemoteDirectory interface to copy segment files to/from remote store Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Add index level setting for remote store Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Add RemoteDirectoryFactory and use RemoteDirectory instance in RefreshListener Co-authored-by: Sachin Kale <kalsac@amazon.com> Signed-off-by: Sachin Kale <kalsac@amazon.com> * Upload segment to remote store post refresh Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Fixing VerifyVersionConstantsIT test failure (#3574) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump jettison from 1.4.1 to 1.5.0 in /plugins/discovery-azure-classic (#3571) * Bump jettison from 1.4.1 to 1.5.0 in /plugins/discovery-azure-classic Bumps [jettison](https://github.com/jettison-json/jettison) from 1.4.1 to 1.5.0. - [Release notes](https://github.com/jettison-json/jettison/releases) - [Commits](jettison-json/jettison@jettison-1.4.1...jettison-1.5.0) --- updated-dependencies: - dependency-name: org.codehaus.jettison:jettison dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-api-services-storage from v1-rev20200814-1.30.10 to v1-rev20220608-1.32.1 in /plugins/repository-gcs (#3573) * Bump google-api-services-storage in /plugins/repository-gcs Bumps google-api-services-storage from v1-rev20200814-1.30.10 to v1-rev20220608-1.32.1. --- updated-dependencies: - dependency-name: com.google.apis:google-api-services-storage dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Upgrade Google HTTP Client to 1.42.0 Signed-off-by: Xue Zhou <xuezhou@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Xue Zhou <xuezhou@amazon.com> * Add flat_skew setting to node overload decider (#3563) * Add flat_skew setting to node overload decider Signed-off-by: Rishab Nahata <rnnahata@amazon.com> * Bump xmlbeans from 5.0.3 to 5.1.0 in /plugins/ingest-attachment (#3572) * Bump xmlbeans from 5.0.3 to 5.1.0 in /plugins/ingest-attachment Bumps xmlbeans from 5.0.3 to 5.1.0. --- updated-dependencies: - dependency-name: org.apache.xmlbeans:xmlbeans dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-oauth-client from 1.34.0 to 1.34.1 in /plugins/discovery-gce (#3570) * Bump google-oauth-client from 1.34.0 to 1.34.1 in /plugins/discovery-gce Bumps [google-oauth-client](https://github.com/googleapis/google-oauth-java-client) from 1.34.0 to 1.34.1. - [Release notes](https://github.com/googleapis/google-oauth-java-client/releases) - [Changelog](https://github.com/googleapis/google-oauth-java-client/blob/main/CHANGELOG.md) - [Commits](googleapis/google-oauth-java-client@v1.34.0...v1.34.1) --- updated-dependencies: - dependency-name: com.google.oauth-client:google-oauth-client dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix for bug showing incorrect awareness attributes count in AwarenessAllocationDecider (#3428) * Fix for bug showing incorrect awareness attributes count in AwarenessAllocationDecider Signed-off-by: Anshu Agarwal <anshukag@amazon.com> * Added bwc version 1.3.4 (#3552) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Support dynamic node role (#3436) * Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: #2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com> * Rename package 'o.o.action.support.master' to 'o.o.action.support.clustermanager' (#3556) * Rename package org.opensearch.action.support.master to org.opensearch.action.support.clustermanager Signed-off-by: Tianli Feng <ftianli@amazon.com> * Rename classes with master term in the package org.opensearch.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Deprecate classes in org.opensearch.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Remove pakcage o.o.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Move package-info back Signed-off-by: Tianli Feng <ftianli@amazon.com> * Move package-info to new folder Signed-off-by: Tianli Feng <ftianli@amazon.com> * Correct the package-info Signed-off-by: Tianli Feng <ftianli@amazon.com> * Fixing flakiness of ShuffleForcedMergePolicyTests (#3591) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Deprecate classes in org.opensearch.action.support.master (#3593) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Add release notes for version 2.0.1 (#3595) Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Fix NPE when minBound/maxBound is not set before being called. (#3605) Signed-off-by: George Apaaboah <george.apaaboah@gmail.com> * Added bwc version 2.0.2 (#3613) Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Fix false positive query timeouts due to using cached time (#3454) * Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * Fix random gradle check failure issue 3584. (#3627) * [Segment Replication] Add components for segment replication to perform file copy. (#3525) * Add components for segment replication to perform file copy. This change adds the required components to SegmentReplicationSourceService to initiate copy and react to lifecycle events. Along with new components it refactors common file copy code from RecoverySourceHandler into reusable pieces. Signed-off-by: Marc Handalian <handalm@amazon.com> * Deprecate public methods and variables with master term in package 'org.opensearch.action.support.master' (#3617) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Add replication orchestration for a single shard (#3533) * implement segment replication target Signed-off-by: Poojita Raj <poojiraj@amazon.com> * test added Signed-off-by: Poojita Raj <poojiraj@amazon.com> * changes to tests + finalizeReplication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * fix style check Signed-off-by: Poojita Raj <poojiraj@amazon.com> * addressing comments + fix gradle check Signed-off-by: Poojita Raj <poojiraj@amazon.com> * added test + addressed review comments Signed-off-by: Poojita Raj <poojiraj@amazon.com> * [BUG] opensearch crashes on closed client connection before search reply (#3626) * [BUG] opensearch crashes on closed client connection before search reply Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Addressing code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Add all deprecated method in the package with new name 'org.opensearch.action.support.clustermanager' (#3644) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Introduce TranslogManager implementations decoupled from the Engine (#3638) * Introduce decoupled translog manager interfaces Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Adding onNewCheckpoint to Start Replication on Replica Shard when Segment Replication is turned on (#3540) * Adding onNewCheckpoint and it's test to start replication. SCheck for latestcheckpoint and replaying logic is removed from this commit and will be added in a different PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Changing binding/inject logic and addressing comments from PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Applying spotless check Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Moving shouldProcessCheckpoint() to IndexShard, and removing some trace logs Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * applying spotlessApply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding more info to log statement in targetservice class Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * applying spotlessApply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Addressing comments on PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding teardown() in SegmentReplicationTargetServiceTests. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * fixing testShouldProcessCheckpoint() in SegmentReplicationTargetServiceTests Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Removing CheckpointPublisherProvider in IndicesModule Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * spotless check apply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Remove class org.opensearch.action.support.master.AcknowledgedResponse (#3662) * Remove class org.opensearch.action.support.master.AcknowledgedResponse Signed-off-by: Tianli Feng <ftianli@amazon.com> * Remove class org.opensearch.action.support.master.AcknowledgedRequest RequestBuilder ShardsAcknowledgedResponse Signed-off-by: Tianli Feng <ftianli@amazon.com> * Restore AcknowledgedResponse and AcknowledgedRequest to package org.opensearch.action.support.master (#3669) Signed-off-by: Tianli Feng <ftianli@amazon.com> * [BUG] Custom POM configuration for ZIP publication produces duplicit tags (url, scm) (#3656) * [BUG] Custom POM configuration for ZIP publication produces duplicit tags (url, scm) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Added test case for pluginZip with POM Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Support both Gradle 6.8.x and Gradle 7.4.x Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Adding 2.2.0 Bwc version to main (#3673) * Upgraded to t-digest 3.3. (#3634) * Revert renaming method onMaster() and offMaster() in interface LocalNodeMasterListener (#3686) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Upgrading AWS SDK dependency for native plugins (#3694) * Merge branch 'feature/point_in_time' of https://github.com/opensearch-project/OpenSearch into fb Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Co-authored-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Rabi Panda <adnapibar@gmail.com> Co-authored-by: Nick Knize <nknize@apache.org> Co-authored-by: Poojita Raj <poojiraj@amazon.com> Co-authored-by: Rishikesh Pasham <62345295+Rishikesh1159@users.noreply.github.com> Co-authored-by: Ankit Jain <jain.ankitk@gmail.com> Co-authored-by: vpehkone <101240162+vpehkone@users.noreply.github.com> Co-authored-by: sdp <sdp@9049fa06826d.jf.intel.com> Co-authored-by: Kartik Ganesh <gkart@amazon.com> Co-authored-by: Cole White <42356806+shdubsh@users.noreply.github.com> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> Co-authored-by: Xue Zhou <85715413+xuezhou25@users.noreply.github.com> Co-authored-by: Sachin Kale <sachinpkale@gmail.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Xue Zhou <xuezhou@amazon.com> Co-authored-by: Rishab Nahata <rishabnahata07@gmail.com> Co-authored-by: Anshu Agarwal <anshuagarwal11@gmail.com> Co-authored-by: Yaliang Wu <ylwu@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> Co-authored-by: George Apaaboah <35894485+GeorgeAp@users.noreply.github.com> Co-authored-by: Ahmad AbuKhalil <105249973+aabukhalil@users.noreply.github.com> Co-authored-by: Bukhtawar Khan <bukhtawa@amazon.com> Co-authored-by: Sarat Vemulapalli <vemulapallisarat@gmail.com> Co-authored-by: Daniel (dB.) Doubrovkine <dblock@dblock.org>
Signed-off-by: Ahmad AbuKhalil abukhali@amazon.com
Description
System.nanoTime()
forstartTime
when creating query timeout cancellation runnable to prevent premature false positive query timeouts. More info on how this fix will resolve the issue.Issues Resolved
fixes #2000
Testing
The new units tests has dependency on actual system time which is not the recommended method but these tests has passed
./gradlew test -Dtests.class=org.opensearch.search.query.QueryPhaseTests -Dtests.iters=200 -Dtests.method=testQueryTimeoutChecker*
so I would assume they are reliable enough.For performance, now we will have additional nanoTime called when timeout is set per query which might have slight performance impact so I will follow up on performance test after merging this.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.