[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

martin-gaievski · 2023-10-03T20:26:15Z

What is the bug?

Flaky integ test

How can one reproduce the bug?

In github CI for the plugin you can see that tests for windows are not stable and failing at random.

Example of such failed run:

https://github.com/opensearch-project/neural-search/actions/runs/6397635644/job/17366751285?pr=359

Tests with failures:
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBooleanQuery_withNeuralAndBM25Queries
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery


=== Standard output of node `node{::integTest-0}` ===
�    ? errors and warnings from D:\a\neural-search\neural-search\build\testclusters\integTest-0\logs\opensearch.stdout.log ?
� WARN ][o.o.g.DanglingIndicesState] [integTest-0] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
� WARN ][o.o.d.FileBasedSeedHostsProvider] [integTest-0] expected, but did not find, a dynamic hosts list at [D:\a\neural-search\neural-search\build\testclusters\integTest-0\config\unicast_hosts.txt]
� WARN ][r.suppressed             ] [integTest-0] path: /_plugins/_ml/models/3XAr94oB_1pNJlKRXFYm, params: {model_id=3XAr94oB_1pNJlKRXFYm}
�  java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete
�  	at org.opensearch.ml.action.models.DeleteModelTransportAction.lambda$doExecute$2(DeleteModelTransportAction.java:128) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.ml.helper.ModelAccessControlHelper.validateModelGroupAccess(ModelAccessControlHelper.java:79) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.ml.action.models.DeleteModelTransportAction.lambda$doExecute$4(DeleteModelTransportAction.java:114) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:298) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:284) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1516) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1599) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1579) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:71) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
�  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
�  	at java.lang.Thread.run(Thread.java:829) [?:?]
�   ? last 40 non error or warning messages from D:\a\neural-search\neural-search\build\testclusters\integTest-0\logs\opensearch.stdout.log ?

What is the expected behavior?

Test results are stable

Do you have any additional context?

Tests on linux are much more stable, there was a recent PR that fixes some flaky tests, maybe some params can be tweaked there

The text was updated successfully, but these errors were encountered:

navneet1v · 2023-10-04T00:33:52Z

@heemin32 this can be problem for 2.11 release can we fix it?

tanqiuliu · 2023-11-07T07:14:10Z

I ran into the same issue when trying to run ./gradlew build on the latest commit, raised a PR to fix it: #487

navneet1v · 2024-01-11T21:54:45Z

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

vibrantvarun · 2024-01-11T23:32:09Z

Hey @tanqiuliu

I don’t think the PR which you raised will fix the issue. The reason being

In every integ test case
@before calls prepare model which loads the model Id in the cluster

And
@after finds the deployed models and deletes it.

The problem is we are not deleting a specific model Id.

Consider a scenario  

Therefore at T+10 time it will throw the error mentioned in the description of issue.
� java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete

There are 2 major errors we face
1.)

neural-search/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

Line 788 in 91a1202

assertEquals(1, modelIds.size());

We get 2 model id some times and the CI check fails.

The second issue is mentioned above where we try deleting the model in deploying state.

Now what your solution is doing is just extending the time for an individual execution by adding a wait time in the load model.
However, the above scenario can still come.

Therefore, I would present a solution Where we can declare a variable at the top in the class and store the model Id in that when prepare model is executed.

Then in @after instead of finding all deployed Models and deleting them we will delete the specific model id generated before running the test,

The same has been done in BWC tests and we didn’t face any issues.

The core issue of these flaky tests is Integ tests execution in multithreaded environement.

Open for your suggestions.

navneet1v · 2024-01-11T23:33:38Z

Hey @tanqiuliu

I don’t think the PR which you raised will fix the issue. The reason being

In every integ test case @before calls prepare model which loads the model Id in the cluster

And @after finds the deployed models and deletes it.

The problem is we are not deleting a specific model Id.

Consider a scenario  

Thread 1 Thread 2

Running HybridQueryIT Running NeuralQueryIT

At T+1 time the execution starts At T+9 time the execution starts prepareModel prepareModel is called. Then it will create a model Id with DEPLOYING State

At T+5 Run test

T+10 time it tries to delete the model Id Delete Model

Therefore at T+10 time it will throw the error mentioned in the description of issue. � java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete

There are 2 major errors we face 1.)

neural-search/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

Line 788 in 91a1202

assertEquals(1, modelIds.size());

We get 2 model id some times and the CI check fails.

The second issue is mentioned above where we try deleting the model in deploying state.

Now what your solution is doing is just extending the time for an individual execution by adding a wait time in the load model. However, the above scenario can still come.

Therefore, I would present a solution Where we can declare a variable at the top in the class and store the model Id in that when prepare model is executed.

Then in @after instead of finding all deployed Models and deleting them we will delete the specific model id generated before running the test,

The same has been done in BWC tests and we didn’t face any issues.

The core issue of these flaky tests is Integ tests execution in multithreaded environement.

Open for your suggestions.

+1 on this. We should do this rather than waiting for model to be deployed and deleted.

martin-gaievski · 2024-01-11T23:42:09Z

last time I was checking flaky tests that approach faced a major issue - model were redeployed in background with a different model id, so storing model id did not give any benefit. I traced it to a feature in ml-commons opensearch-project/ml-commons#852. I'm not sure how the feature works nowadays, seems there were some changes to default behavior - opensearch-project/ml-commons#1808.
We can spend some time on checking if same concern is still valid.

navneet1v · 2024-01-11T23:58:15Z

The redeploy happen if the cluster died/node restarted before the model is deleted. Hence you would be seeing those issues.

vibrantvarun · 2024-01-30T20:58:01Z

Bug is resolved now

martin-gaievski added bug Something isn't working untriaged labels Oct 3, 2023

navneet1v changed the title ~~[BUG]~~ [BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery Oct 3, 2023

navneet1v removed the untriaged label Oct 3, 2023

martin-gaievski added the good first issue Good for newcomers label Oct 10, 2023

tanqiuliu mentioned this issue Nov 7, 2023

Fix flaky integ tests #487

Closed

5 tasks

vibrantvarun self-assigned this Jan 12, 2024

vibrantvarun mentioned this issue Jan 26, 2024

Fix for Flaky test for issue 384 #559

Merged

5 tasks

vibrantvarun added this to Vector Search RoadMap Jan 30, 2024

vibrantvarun moved this to 2.12.0 in Vector Search RoadMap Jan 30, 2024

vibrantvarun closed this as completed Jan 30, 2024

github-project-automation bot moved this from 2.12.0 to ✅ Done in Vector Search RoadMap Jan 30, 2024

vibrantvarun moved this from ✅ Done to 2.12.0 in Vector Search RoadMap Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

martin-gaievski commented Oct 3, 2023

navneet1v commented Oct 4, 2023

tanqiuliu commented Nov 7, 2023

navneet1v commented Jan 11, 2024

vibrantvarun commented Jan 11, 2024 •

edited

Loading

navneet1v commented Jan 11, 2024

martin-gaievski commented Jan 11, 2024

navneet1v commented Jan 11, 2024

vibrantvarun commented Jan 30, 2024

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

Comments

martin-gaievski commented Oct 3, 2023

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

Do you have any additional context?

navneet1v commented Oct 4, 2023

tanqiuliu commented Nov 7, 2023

navneet1v commented Jan 11, 2024

vibrantvarun commented Jan 11, 2024 • edited Loading

navneet1v commented Jan 11, 2024

martin-gaievski commented Jan 11, 2024

navneet1v commented Jan 11, 2024

vibrantvarun commented Jan 30, 2024

vibrantvarun commented Jan 11, 2024 •

edited

Loading