[ML] Persist data counts and datafeed timing stats asynchronously #93000

droberts195 · 2023-01-17T17:49:18Z

When an anomaly detection job runs, the majority of results originate from the C++ autodetect process, so can be persisted in bulk. However, there are two types of results, namely data counts and datafeed timing stats, that are generated wholly within the ML Java code and where there are serious downsides to batching them up with the output of the C++ process. (If we batched them and the C++ process stopped generating results then the input side stats would also stall, so it is better that the input side stats are written independently.)

The approach used in this PR is to write data counts and datafeed timing stats asynchronously except at certain key points, like job flush and close, and datafeed stop. At these key points the latest stats are persisted synchronously, like before. When large amounts of data are being processed the code will generate updated stats documents faster than they can be indexed. The approach taken here is to skip persistence of the newer document if persistence of the previous document is still in progress. This can lead to the stats being slightly out of date while a job is running. However, at key points like flush and close the data counts will be up-to-date, and the datafeed timing stats will get written at least once per datafeed frequency, so should not be more out-of-date than that.

When an anomaly detection job runs, the majority of results originate from the C++ autodetect process, so can be persisted in bulk. However, there are two types of results, namely data counts and datafeed timing stats, that are generated wholly within the ML Java code and where there are serious downsides to batching them up with the output of the C++ process. (If we batched them and the C++ process stopped generating results then the input side stats would also stall, so it is better that the input side stats are written independently.) The approach used in this PR is to write data counts and datafeed timing stats asynchronously _except_ at certain key points, like job flush and close, and datafeed stop. At these key points the latest stats _are_ persisted synchronously, like before. When large amounts of data are being processed the code will generate updated stats documents faster than they can be indexed. The approach taken here is to skip persistence of the newer document if persistence of the previous document is still in progress. This can lead to the stats being slightly out of date while a job is running. However, at key points like flush and close the data counts will be up-to-date, and the datafeed timing stats will get written at least once per datafeed `frequency`, so should not be more out-of-date than that.

elasticsearchmachine · 2023-01-17T17:49:42Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2023-01-17T17:49:42Z

Hi @droberts195, I've created a changelog YAML for you.

davidkyle

LGTM

davidkyle · 2023-01-18T15:02:35Z

...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedTimingStatsReporter.java

+            logger.trace("[{}] not persisting datafeed timing stats as persistence is disallowed", jobId);
+            return;
+        }
+        if (persistInProgressLatch != null && persistInProgressLatch.await(1, TimeUnit.NANOSECONDS) == false) {


persistInProgressLatch.await(1, TimeUnit.NANOSECONDS)

This looks like you want to test if the latch is waitable, i.e. the count is greater than zero. In which case it could be replaced with persistInProgressLatch.getCount() > 0

When a datafeed starts up it looks at the job's data counts to decide where to pick up from where a previous invocation finished. Previously the datafeed was always getting the data counts from the index. In the case where a datafeed has been stopped and restarted while its corresponding job was continually opened this is incorrect. In this case the data counts need to be obtained from the running job. (In the case where the job was closed and reopened while the datafeed was stopped this does not matter, as closing the job will have persisted the up-to-date data counts.) This bug has always existed, yet is made much more likely to cause a noticeable discrepancy by the changes made in elastic#93000. Fixes elastic#93298

When a datafeed starts up it looks at the job's data counts to decide where to pick up from where a previous invocation finished. Previously the datafeed was always getting the data counts from the index. In the case where a datafeed has been stopped and restarted while its corresponding job was continually opened this is incorrect. In this case the data counts need to be obtained from the running job. (In the case where the job was closed and reopened while the datafeed was stopped this does not matter, as closing the job will have persisted the up-to-date data counts.) This bug has always existed, yet is made much more likely to cause a noticeable discrepancy by the changes made in #93000. Fixes #93298

droberts195 added >enhancement :ml Machine learning v8.7.0 labels Jan 17, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Jan 17, 2023

Update docs/changelog/93000.yaml

de4b475

davidkyle self-assigned this Jan 18, 2023

davidkyle approved these changes Jan 18, 2023

View reviewed changes

Use getCount() instead of await(1, NANOSECONDS)

bd5ef43

davidkyle removed their assignment Jan 18, 2023

droberts195 merged commit 69914bf into elastic:main Jan 23, 2023

droberts195 deleted the faster_data_counts_datafeed_timing_stats_writing branch January 23, 2023 17:27

droberts195 mentioned this pull request Jan 27, 2023

[CI] DatafeedJobsIT testStopAndRestartCompositeDatafeed failing #93298

Closed

droberts195 mentioned this pull request Jan 27, 2023

[ML] Fix data counts race condition when starting a datafeed #93324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Persist data counts and datafeed timing stats asynchronously #93000

[ML] Persist data counts and datafeed timing stats asynchronously #93000

droberts195 commented Jan 17, 2023

elasticsearchmachine commented Jan 17, 2023

elasticsearchmachine commented Jan 17, 2023

davidkyle left a comment

davidkyle Jan 18, 2023

[ML] Persist data counts and datafeed timing stats asynchronously #93000

[ML] Persist data counts and datafeed timing stats asynchronously #93000

Conversation

droberts195 commented Jan 17, 2023

elasticsearchmachine commented Jan 17, 2023

elasticsearchmachine commented Jan 17, 2023

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jan 18, 2023

Choose a reason for hiding this comment