Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

Closed
martin-gaievski opened this issue Oct 3, 2023 · 8 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@martin-gaievski
Copy link
Member

What is the bug?

Flaky integ test

How can one reproduce the bug?

In github CI for the plugin you can see that tests for windows are not stable and failing at random.

Example of such failed run:

https://github.com/opensearch-project/neural-search/actions/runs/6397635644/job/17366751285?pr=359

Tests with failures:
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBooleanQuery_withNeuralAndBM25Queries
 - org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery


=== Standard output of node `node{::integTest-0}` ===
�    ? errors and warnings from D:\a\neural-search\neural-search\build\testclusters\integTest-0\logs\opensearch.stdout.log ?
� WARN ][o.o.g.DanglingIndicesState] [integTest-0] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
� WARN ][o.o.d.FileBasedSeedHostsProvider] [integTest-0] expected, but did not find, a dynamic hosts list at [D:\a\neural-search\neural-search\build\testclusters\integTest-0\config\unicast_hosts.txt]
� WARN ][r.suppressed             ] [integTest-0] path: /_plugins/_ml/models/3XAr94oB_1pNJlKRXFYm, params: {model_id=3XAr94oB_1pNJlKRXFYm}
�  java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete
�  	at org.opensearch.ml.action.models.DeleteModelTransportAction.lambda$doExecute$2(DeleteModelTransportAction.java:128) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.ml.helper.ModelAccessControlHelper.validateModelGroupAccess(ModelAccessControlHelper.java:79) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.ml.action.models.DeleteModelTransportAction.lambda$doExecute$4(DeleteModelTransportAction.java:114) [opensearch-ml-2.11.0.0-SNAPSHOT.jar:2.11.0.0-SNAPSHOT]
�  	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:298) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:284) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1516) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1599) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1579) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:71) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0-SNAPSHOT.jar:2.11.0-SNAPSHOT]
�  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
�  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
�  	at java.lang.Thread.run(Thread.java:829) [?:?]
�   ? last 40 non error or warning messages from D:\a\neural-search\neural-search\build\testclusters\integTest-0\logs\opensearch.stdout.log ?

What is the expected behavior?

Test results are stable

Do you have any additional context?

Tests on linux are much more stable, there was a recent PR that fixes some flaky tests, maybe some params can be tweaked there

@martin-gaievski martin-gaievski added bug Something isn't working untriaged labels Oct 3, 2023
@navneet1v navneet1v changed the title [BUG] [BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery Oct 3, 2023
@navneet1v
Copy link
Collaborator

@heemin32 this can be problem for 2.11 release can we fix it?

@martin-gaievski martin-gaievski added the good first issue Good for newcomers label Oct 10, 2023
@tanqiuliu tanqiuliu mentioned this issue Nov 7, 2023
5 tasks
@tanqiuliu
Copy link

I ran into the same issue when trying to run ./gradlew build on the latest commit, raised a PR to fix it: #487

@navneet1v
Copy link
Collaborator

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

@vibrantvarun
Copy link
Member

vibrantvarun commented Jan 11, 2024

Hey @tanqiuliu

I don’t think the PR which you raised will fix the issue. The reason being

In every integ test case
@before calls prepare model which loads the model Id in the cluster

And
@after finds the deployed models and deletes it.

The problem is we are not deleting a specific model Id.

Consider a scenario



Screenshot 2024-01-11 at 3 35 50 PM

Therefore at T+10 time it will throw the error mentioned in the description of issue.
� java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete

There are 2 major errors we face
1.)

We get 2 model id some times and the CI check fails.

  1. The second issue is mentioned above where we try deleting the model in deploying state.

Now what your solution is doing is just extending the time for an individual execution by adding a wait time in the load model.
However, the above scenario can still come.

Therefore, I would present a solution
Where we can declare a variable at the top in the class and store the model Id in that when prepare model is executed.

Then in @after instead of finding all deployed Models and deleting them we will delete the specific model id generated before running the test,

The same has been done in BWC tests and we didn’t face any issues.

The core issue of these flaky tests is Integ tests execution in multithreaded environement.

Open for your suggestions.

@navneet1v
Copy link
Collaborator

Hey @tanqiuliu

I don’t think the PR which you raised will fix the issue. The reason being

In every integ test case @before calls prepare model which loads the model Id in the cluster

And @after finds the deployed models and deletes it.

The problem is we are not deleting a specific model Id.

Consider a scenario



Thread 1 Thread 2

Running HybridQueryIT Running NeuralQueryIT

At T+1 time the execution starts At T+9 time the execution starts
prepareModel prepareModel is called. Then it will create a model Id with DEPLOYING State

At T+5 Run test

T+10 time it tries to delete the model Id Delete Model

Therefore at T+10 time it will throw the error mentioned in the description of issue. � java.lang.Exception: Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete

There are 2 major errors we face 1.)

We get 2 model id some times and the CI check fails.

  1. The second issue is mentioned above where we try deleting the model in deploying state.

Now what your solution is doing is just extending the time for an individual execution by adding a wait time in the load model. However, the above scenario can still come.

Therefore, I would present a solution
Where we can declare a variable at the top in the class and store the model Id in that when prepare model is executed.

Then in @after instead of finding all deployed Models and deleting them we will delete the specific model id generated before running the test,

The same has been done in BWC tests and we didn’t face any issues.

The core issue of these flaky tests is Integ tests execution in multithreaded environement.

Open for your suggestions.

+1 on this. We should do this rather than waiting for model to be deployed and deleted.

@martin-gaievski
Copy link
Member Author

last time I was checking flaky tests that approach faced a major issue - model were redeployed in background with a different model id, so storing model id did not give any benefit. I traced it to a feature in ml-commons opensearch-project/ml-commons#852. I'm not sure how the feature works nowadays, seems there were some changes to default behavior - opensearch-project/ml-commons#1808.
We can spend some time on checking if same concern is still valid.

@navneet1v
Copy link
Collaborator

The redeploy happen if the cluster died/node restarted before the model is deleted. Hence you would be seeing those issues.

@vibrantvarun
Copy link
Member

Bug is resolved now

@vibrantvarun vibrantvarun moved this from ✅ Done to 2.12.0 in Vector Search RoadMap Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
Archived in project
Development

No branches or pull requests

4 participants