-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384
Comments
@heemin32 this can be problem for 2.11 release can we fix it? |
I ran into the same issue when trying to run |
@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR. |
Hey @tanqiuliu I don’t think the PR which you raised will fix the issue. The reason being In every integ test case And The problem is we are not deleting a specific model Id. Consider a scenario ![]() Therefore at T+10 time it will throw the error mentioned in the description of issue. There are 2 major errors we face neural-search/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java Line 788 in 91a1202
We get 2 model id some times and the CI check fails.
Now what your solution is doing is just extending the time for an individual execution by adding a wait time in the load model. Therefore, I would present a solution Where we can declare a variable at the top in the class and store the model Id in that when prepare model is executed. Then in @after instead of finding all deployed Models and deleting them we will delete the specific model id generated before running the test, The same has been done in BWC tests and we didn’t face any issues. The core issue of these flaky tests is Integ tests execution in multithreaded environement. Open for your suggestions. |
+1 on this. We should do this rather than waiting for model to be deployed and deleted. |
last time I was checking flaky tests that approach faced a major issue - model were redeployed in background with a different model id, so storing model id did not give any benefit. I traced it to a feature in ml-commons opensearch-project/ml-commons#852. I'm not sure how the feature works nowadays, seems there were some changes to default behavior - opensearch-project/ml-commons#1808. |
The redeploy happen if the cluster died/node restarted before the model is deleted. Hence you would be seeing those issues. |
Bug is resolved now |
What is the bug?
Flaky integ test
How can one reproduce the bug?
In github CI for the plugin you can see that tests for windows are not stable and failing at random.
Example of such failed run:
https://github.com/opensearch-project/neural-search/actions/runs/6397635644/job/17366751285?pr=359
What is the expected behavior?
Test results are stable
Do you have any additional context?
Tests on linux are much more stable, there was a recent PR that fixes some flaky tests, maybe some params can be tweaked there
The text was updated successfully, but these errors were encountered: