Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BasicDistributedJobsIT#testDedicatedMlNode fails #63980

Closed
javanna opened this issue Oct 21, 2020 · 4 comments · Fixed by #69198
Closed

BasicDistributedJobsIT#testDedicatedMlNode fails #63980

javanna opened this issue Oct 21, 2020 · 4 comments · Fixed by #69198
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@javanna
Copy link
Member

javanna commented Oct 21, 2020

Build scan: https://gradle-enterprise.elastic.co/s/jji6ljcth352u/console-log?task=:x-pack:plugin:ml:internalClusterTest

Repro line: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testDedicatedMlNode" -Dtests.seed=B52557EBFF0CB0F6 -Dtests.security.manager=true -Dtests.locale=ar-SA -Dtests.timezone=Africa/Cairo -Druntime.java=8

Reproduces locally?: no

Applicable branches: failed on 7.x and 7.10 so far

Failure history: It started failing only a few days ago in this way and failed only three times in total with this specific error.

Failure excerpt:


org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT > testDedicatedMlNode FAILED |  
-- | --
  | java.lang.RuntimeException: Had to resort to force-closing job, something went wrong? |  
  | at __randomizedtesting.SeedInfo.seed([B52557EBFF0CB0F6:C064E94DE8C3429]:0) |  
  | at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllJobs(BaseMlIntegTestCase.java:379) |  
  | at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:245) |  
  | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) |  
  | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) |  
  | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) |  
  | at java.lang.reflect.Method.invoke(Method.java:498) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) |  
  | at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) |  
  | at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) |  
  | at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) |  
  | at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) |  
  | at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) |  
  | at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
  | at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) |  
  | at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) |  
  | at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) |  
  | at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) |  
  | at java.lang.Thread.run(Thread.java:748) |  
  |   |  
  | Caused by: |  
  | java.util.concurrent.ExecutionException: RemoteTransportException[[node_t1][127.0.0.1:60399][cluster:admin/xpack/ml/job/close]]; nested: IllegalStateException[Timed out when waiting for persistent tasks after 1.5m]; |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:273) |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) |  
  | at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllJobs(BaseMlIntegTestCase.java:367) |  
  | ... 36 more |  
  |   |  
  | Caused by: |  
  | RemoteTransportException[[node_t1][127.0.0.1:60399][cluster:admin/xpack/ml/job/close]]; nested: IllegalStateException[Timed out when waiting for persistent tasks after 1.5m]; |  
  |   |  
  | Caused by: |  
  | java.lang.IllegalStateException: Timed out when waiting for persistent tasks after 1.5m |  
  | at org.elasticsearch.persistent.PersistentTasksService$2.onTimeout(PersistentTasksService.java:209) |  
  | at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) |  
  | at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) |  
  | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) |  
  | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) |  
  | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) |  
  | at java.lang.Thread.run(Thread.java:748)

I found in the logs errors like, not sure if related:


1> [2020-10-21T11:15:00,537][ERROR][o.e.x.m.e.l.LocalExporter] [node_t0] failed to set monitoring pipeline [xpack_monitoring_6] |  
  | 1> org.elasticsearch.ElasticsearchParseException: No processor type exists with name [script] |  
  | 1> 	at org.elasticsearch.ingest.ConfigurationUtils.newConfigurationException(ConfigurationUtils.java:316) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:446) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:399) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.ConfigurationUtils.readProcessorConfigs(ConfigurationUtils.java:337) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.Pipeline.create(Pipeline.java:74) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.IngestService.validatePipeline(IngestService.java:435) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.ingest.IngestService.putPipeline(IngestService.java:340) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.ingest.PutPipelineTransportAction.lambda$masterOperation$0(PutPipelineTransportAction.java:88) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:263) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:248) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$000(TransportNodesAction.java:177) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:226) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:218) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TransportService$6.handleResponse(TransportService.java:634) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1249) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:271) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:267) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.transport.TransportService$8.doRun(TransportService.java:809) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241] |  
  | 1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241] |  
  | 1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241] |  
  | 1> 	Suppressed: org.elasticsearch.ElasticsearchParseException: No processor type exists with name [gsub] |  
  | 1> 		at org.elasticsearch.ingest.ConfigurationUtils.newConfigurationException(ConfigurationUtils.java:316) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:446) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:399) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.ConfigurationUtils.readProcessorConfigs(ConfigurationUtils.java:337) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.Pipeline.create(Pipeline.java:74) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.IngestService.validatePipeline(IngestService.java:435) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.ingest.IngestService.putPipeline(IngestService.java:340) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.ingest.PutPipelineTransportAction.lambda$masterOperation$0(PutPipelineTransportAction.java:88) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:263) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:248) [elasticsearch-7.10.0-SNA  1> PSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$000(TransportNodesAction.java:177) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:226) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:218) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TransportService$6.handleResponse(TransportService.java:634) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1171) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1249) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:271) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:267) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:72) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.transport.TransportService$8.doRun(TransportService.java:809) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT] |  
  | 1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241] |  
  | 1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241] |  
  | 1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]


@javanna javanna added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

droberts195 commented Oct 21, 2020

I think the interesting bit of the log is:

  1> [2020-10-21T11:15:48,795][INFO ][o.e.x.m.i.BasicDistributedJobsIT] [testDedicatedMlNode] [BasicDistributedJobsIT#testDedicatedMlNode]: Cleaning up datafeeds and jobs after test
  1> [2020-10-21T11:15:48,797][INFO ][o.e.x.m.i.BasicDistributedJobsIT] [testDedicatedMlNode] Closing all datafeeds (using _all)
  1> [2020-10-21T11:15:48,804][INFO ][o.e.x.m.i.BasicDistributedJobsIT] [testDedicatedMlNode] Closing jobs using [_all]
  1> [2020-10-21T11:15:48,915][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node_t4] Opening job [dedicated-ml-node-job]
  1> [2020-10-21T11:15:48,922][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node_t4] [dedicated-ml-node-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
  1> [2020-10-21T11:15:48,986][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node_t4] Successfully set job state to [opened] for job [dedicated-ml-node-job]

The test ends when the job's persistent task is assigned to a node. However, after this happens (and hence after the end of the test code and after the cleanup code has started running) the autodetect process manager will be running the autodetect process (the "blackhole" one rather than the native one in this case since it's an internal cluster test). So it seems that there's a window of opportunity where if a job is closed in between being assigned to a node but before the autodetect process is running then that close request is ignored. (This may possibly only affect close requests that use _all, possibly because of the way _all gets expanded to a list of concrete jobs that need closing.)

@pgomulka
Copy link
Contributor

pgomulka commented Dec 2, 2020

another failure on master (PR)
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testDedicatedMlNode" -Dtests.seed=D9F6C8D54269C031 -Dtests.security.manager=true -Dtests.locale=cs -Dtests.timezone=America/Kentucky/Louisville -Druntime.java=11

https://gradle-enterprise.elastic.co/s/obdm3spo7telc

does not reproduce locally

@gwbrown
Copy link
Contributor

gwbrown commented Dec 18, 2020

This has been failing more recently on 7.11 and master-based PRs.
Scan from 7.11: https://gradle-enterprise.elastic.co/s/pd27ckj3einns
Scan from master-based PR: https://gradle-enterprise.elastic.co/s/6rzjho54npjuy

As this has started failing more in the past couple days, I'm going to mute this test.

gwbrown added a commit to gwbrown/elasticsearch that referenced this issue Dec 18, 2020
gwbrown added a commit that referenced this issue Dec 18, 2020
gwbrown added a commit to gwbrown/elasticsearch that referenced this issue Dec 18, 2020
gwbrown added a commit to gwbrown/elasticsearch that referenced this issue Dec 18, 2020
gwbrown added a commit that referenced this issue Dec 18, 2020
@benwtrent benwtrent self-assigned this Feb 17, 2021
benwtrent added a commit that referenced this issue Feb 18, 2021
previous work fixing a hard to detect `_close` race condition
: #69136

May have fixed the test failures indicated in: #63980

adding some logging and unmuting the previously flaky test to see if the issue has been addressed.

closes: #63980
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Feb 18, 2021
…c#69198)

previous work fixing a hard to detect `_close` race condition
: elastic#69136

May have fixed the test failures indicated in: elastic#63980

adding some logging and unmuting the previously flaky test to see if the issue has been addressed.

closes: elastic#63980
benwtrent added a commit that referenced this issue Feb 18, 2021
#69206)

previous work fixing a hard to detect `_close` race condition
: #69136

May have fixed the test failures indicated in: #63980

adding some logging and unmuting the previously flaky test to see if the issue has been addressed.

closes: #63980
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for elastic#63980
since February. This was added in elastic#69198, on the basis that elastic#69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
droberts195 added a commit that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for #63980
since February. This was added in #69198, on the basis that #69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for elastic#63980
since February. This was added in elastic#69198, on the basis that elastic#69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for elastic#63980
since February. This was added in elastic#69198, on the basis that elastic#69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
elasticsearchmachine pushed a commit that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for #63980
since February. This was added in #69198, on the basis that #69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
elasticsearchmachine pushed a commit that referenced this issue Nov 19, 2021
There has been trace logging in BasicDistributedJobsIT for #63980
since February. This was added in #69198, on the basis that #69136
had probably fixed a problem but if not then the trace logging
would be useful. Since the tests have not been failing since they
were reenabled this logging can now be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants