Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} if flaky #11875

Closed
reta opened this issue Jan 12, 2024 · 8 comments · Fixed by #12089
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Relevance Search Search query, autocomplete ...etc v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@reta
Copy link
Collaborator

reta commented Jan 12, 2024

Describe the bug

The test case org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} is flaky:

Failed to execute phase [query], all shards failed; shardFailures {[Di7AFLIFTB20nfl7X5doBg][test][0]: RemoteTransportException[[node_s0][127.0.0.1:42601][indices:data/read/search[phase/query]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: UnsupportedOperationException; }

Failed to execute phase [query], all shards failed; shardFailures {[Di7AFLIFTB20nfl7X5doBg][test][0]: RemoteTransportException[[node_s0][127.0.0.1:42601][indices:data/read/search[phase/query]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: UnsupportedOperationException; }
	at __randomizedtesting.SeedInfo.seed([90490F3B9D1AFC66:D232EAF3A97C8078]:0)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:718)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:757)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:511)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:301)
	at app//org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104)
	at app//org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75)
	at app//org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:755)
	at app//org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81)
	at app//org.opensearch.transport.TransportService$9.handleException(TransportService.java:1690)
	at app//org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1476)
	at app//org.opensearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1590)
	at app//org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1564)
	at app//org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:81)
	at app//org.opensearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:73)
	at app//org.opensearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:70)
	at app//org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54)
	at app//org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at app//org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
	at app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)
	at app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
Caused by: OpenSearchException; nested: UnsupportedOperationException;
	at app//org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:708)
	at app//org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:377)
	... 24 more
Caused by: java.lang.UnsupportedOperationException
	at org.opensearch.index.fielddata.AbstractNumericDocValues.advance(AbstractNumericDocValues.java:57)
	at org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$2.advance(NumericComparator.java:407)
	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:286)
	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:236)
	at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71)
	at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
	at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:327)
	at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:283)
	at org.apache.lucene.search.IndexSearcher.lambda$search$2(IndexSearcher.java:721)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:118)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	... 8 more

Related component

Search

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.sort.FieldSortIT" -Dtests.method="testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}}" -Dtests.seed=90490F3B9D1AFC66

Expected behavior

The test must always pass

Additional Details

Plugins
Standard

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • CI

Additional context
Add any other context about the problem here.

@reta
Copy link
Collaborator Author

reta commented Jan 12, 2024

@mch2 @msfroh I am not sure this is related to Apache Lucene 9.9.1, just want to bring it to you attention folks

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4]
Thanks for filing

@jed326
Copy link
Collaborator

jed326 commented Jan 29, 2024

Able to reproduce this 100% with the provided test seed:

 ./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.sort.FieldSortIT" -Dtests.method="testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}}" -Dtests.seed=90490F3B9D1AFC66 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv -Dtests.timezone=America/Virgin -Druntime.java=21

Exception is coming from here:

@Override
public int advance(int target) throws IOException {
throw new UnsupportedOperationException();
}

This comes from the searchLeaf path which should look the same for both concurrent and non-concurrent search so next step is to figure out why it's only failing for concurrent search enabled.

@jed326
Copy link
Collaborator

jed326 commented Jan 29, 2024

This is (one of) the problematic queries:

searchResponse = client().prepareSearch().setQuery(matchAllQuery()).setSize(size).addSort("half_float_value", SortOrder.DESC).get();

@reta
Copy link
Collaborator Author

reta commented Jan 29, 2024

This is the problematic query:

This is interesting, I think we handle HALF_FLOAT / UNSIGNED_LONG differently but I am wondering how come the test fails from time to time?

@jed326
Copy link
Collaborator

jed326 commented Jan 29, 2024

This is interesting, I think we handle HALF_FLOAT / UNSIGNED_LONG differently but I am wondering how come the test fails from time to time?

It looks like we encounter the problem with sort for, double_value, and unsigned_long_value as well.

The problem seems to be related to this change: apache/lucene#12405
Specifically: apache/lucene@d910990#diff-79c6a57519ecd1ef504629e62e13d17859a4ffedc58f4602e583ce758a15adc8R291-R295

As the comparator is getting set to NumericComparator::NumericLeafComparator. In the non-concurrent search case we don't go into the if statement so the competitiveIterator is not updated.

The if statement seems to be related to the number of documents on the segment though so it seems like we could still see this in the non-concurrent search case.

@jed326
Copy link
Collaborator

jed326 commented Jan 30, 2024

It seems to me the naive solution here is to change this:

return new HalfFloatComparator(numHits, fieldname, fMissingValue, reversed, filterPruning(pruning)) {

to use Pruning.None which I think means we won't update the competitiveIterator to prune results. That seems like there would be performance regression implications though. That's basically reverting the changes from #8168

@reta @mch2 as you are more familiar with the Lucene 9.9 upgrade would you mind giving a second set of eyes here?

@reta
Copy link
Collaborator Author

reta commented Jan 30, 2024

@reta @mch2 as you are more familiar with the Lucene 9.9 upgrade would you mind giving a second set of eyes here?

Sure, I will take a look shortly, thank you @jed326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Relevance Search Search query, autocomplete ...etc v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Archived in project
Status: No status
4 participants