Concurrent thread access to shared doc values #99007

salvatore-campagna · 2023-08-29T16:16:04Z

When trying to run a cardinality aggregation nested inside a
time series aggregation test called testCardinalityByTsid
(sometimes) fails with the following stack traces (plural here
is not a mistake, the test appears to fail with different issues).
It looks like something is wrong when accessing dimension fields
doc values.

My idea is that something is wrong with ordinals but can't figure out
if that is the case.

Usually I see one of the following two assertions failing:

assert target >= in.docID();
assert target < maxDoc;

which means in GlobalOrdCardinalityAggregator we try
to fetch incorrect target document when calling advanceExact

if (values.advanceExact(doc)) {
    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
        bits.set((int) ord);
    }
}

Note also that this branch is exercised only if the cardinality aggregation is
not a top level aggregation. I tried to reproduce the issue with the cardinality
aggregation nested inside a terms aggregation but didn't see any issue.
For this reason I believe something might be wrong when using parent (time
series aggregator) ordinals.

Also worth noting is that sometimes the test fails with other issues. I executed the
test a certain number of times to see it failing. Usually it takes less than 10 executions
to see a failure.

ago 29, 2023 12:11:31 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][search_worker][T#3],5,TGRP-TimeSeriesNestedAggregationsIT]
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([E92F1D22512328AC]:0)
	at org.apache.lucene.tests.index.AssertingLeafReader$AssertingSortedDocValues.advanceExact(AssertingLeafReader.java:881)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:85)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:121)
	at org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher$LeafWalker.collectCurrent(TimeSeriesIndexSearcher.java:262)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher.search(TimeSeriesIndexSearcher.java:167)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher.lambda$search$0(TimeSeriesIndexSearcher.java:102)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

This is another different stack trace

août 29, 2023 11:20:38 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s0][search][T#2],5,TGRP-TimeSeriesNestedAggregationsIT]
java.lang.AssertionError: Sorted doc values are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[elasticsearch[node_s0][search_worker][T#1],5,TGRP-TimeSeriesNestedAggregationsIT] and consumed in Thread[elasticsearch[node_s0][search][T#2],5,TGRP-TimeSeriesNestedAggregationsIT].
	at __randomizedtesting.SeedInfo.seed([B0787CC8BC021E74]:0)
	at org.apache.lucene.tests.index.AssertingLeafReader.assertThread(AssertingLeafReader.java:67)
	at org.apache.lucene.tests.index.AssertingLeafReader$AssertingSortedDocValues.lookupOrd(AssertingLeafReader.java:908)
	at org.apache.lucene.index.SingletonSortedSetDocValues.lookupOrd(SingletonSortedSetDocValues.java:95)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator.doPostCollection(GlobalOrdCardinalityAggregator.java:302)
	at org.elasticsearch.search.aggregations.AggregatorBase.postCollection(AggregatorBase.java:294)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.postCollection(MultiBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.AggregatorBase.postCollection(AggregatorBase.java:295)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.postCollection(MultiBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.AggregationPhase.executeInSortOrder(AggregationPhase.java:75)
	at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:36)
	at org.elasticsearch.search.query.QueryPhase.executeQuery(QueryPhase.java:132)
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:63)
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:515)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:667)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:540)
	at org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:51)
	at org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:48)
	at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:73)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

elasticsearchmachine · 2023-08-29T16:17:49Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2023-08-29T16:21:49Z

Hi @salvatore-campagna, I've created a changelog YAML for you.

iverase · 2023-08-30T00:04:27Z

This has been introduces by #98204. We are now offloading the execution of the query to a different thread (worker thread) regardless of concurrency, still global ordinals are created on a different thread (coordinating thread).

cc: @javanna

iverase · 2023-08-30T05:14:30Z

By the way, I am seeing other errors in the test testCardinalityByTsid, in particular one that looks worrying looks like:

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: java.io.EOFException: read past EOF (pos=7): MemorySegmentIndexInput(path="/Users/ivera/forks/elasticsearch/modules/aggregations/build/testrun/internalClusterTest/temp/org.elasticsearch.aggregations.bucket.TimeSeriesNestedAggregationsIT_FF7E6A58FA7E2ECF-001/tempDir-002/node_s3/d0/indices/xF5qgLtpSquCLks3UYxU2w/0/index/_0.cfs") [slice=_0_Lucene90_0.dvd] [slice=randomaccess]
	at org.apache.lucene.util.packed.DirectReader$DirectPackedReader1.get(DirectReader.java:204)
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20.ordValue(Lucene90DocValuesProducer.java:853)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:86)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:114)

I was able to reproduce it on 8.9 so it is something that has been probably always there.

Note that to silent the errors on this issue you can add the following annotation at the top of the class:

@LuceneTestCase.SuppressCodecs("*")

salvatore-campagna · 2023-08-30T07:27:21Z

By the way, I an seeing other errors in the test testCardinalityByTsid, in particular one that looks worrying looks like:

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: java.io.EOFException: read past EOF (pos=7): MemorySegmentIndexInput(path="/Users/ivera/forks/elasticsearch/modules/aggregations/build/testrun/internalClusterTest/temp/org.elasticsearch.aggregations.bucket.TimeSeriesNestedAggregationsIT_FF7E6A58FA7E2ECF-001/tempDir-002/node_s3/d0/indices/xF5qgLtpSquCLks3UYxU2w/0/index/_0.cfs") [slice=_0_Lucene90_0.dvd] [slice=randomaccess]
	at org.apache.lucene.util.packed.DirectReader$DirectPackedReader1.get(DirectReader.java:204)
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20.ordValue(Lucene90DocValuesProducer.java:853)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:86)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:114)

I was able to reproduce it on 8.9 so it is something that has been probably always there.

Note that to silent the errors on this issue you can add the following annotation at the top of the class:

@LuceneTestCase.SuppressCodecs("*")

I agree that is probably something that has always been there.

kkrik-es · 2023-08-30T12:40:01Z

I think there's an issue with GlobalOrdCardinalityAggregator::getLeafCollector. It seems like we're reusing the same aggregator for different global ordinal values. Keeping a separate reference to SortedSetDocValues inside the constructed LeafBucketCollector seems to do the trick:

Index: server/src/main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java
         bruteForce++;
         return new LeafBucketCollector() {
+
+            SortedSetDocValues docValues = values;
+
             @Override
             public void collect(int doc, long bucketOrd) throws IOException {
                 visitedOrds = bigArrays.grow(visitedOrds, bucketOrd + 1);
@@ -275,8 +278,8 @@
                     bits = new BitArray(maxOrd, bigArrays);
                     visitedOrds.set(bucketOrd, bits);
                 }
-                if (values.advanceExact(doc)) {
-                    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
+                if (docValues.advanceExact(doc)) {
+                    for (long ord = docValues.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = docValues.nextOrd()) {
                         bits.set((int) ord);
                     }
                 }

kkrik-es · 2023-08-30T12:52:20Z

...alClusterTest/java/org/elasticsearch/aggregations/bucket/TimeSeriesNestedAggregationsIT.java

+
+    public void testDateHistogramByTsid() {
+        final TimeSeriesAggregationBuilder timeSeries = new TimeSeriesAggregationBuilder("ts").subAggregation(
+            new DateHistogramAggregationBuilder("date_histogram").field("@timestamp").calendarInterval(DateHistogramInterval.MINUTE)


Change the interval to HOUR to avoid exceeding bucket limit?

salvatore-campagna · 2023-08-30T15:13:04Z

I think there's an issue with GlobalOrdCardinalityAggregator::getLeafCollector. It seems like we're reusing the same aggregator for different global ordinal values. Keeping a separate reference to SortedSetDocValues inside the constructed LeafBucketCollector seems to do the trick:

Index: server/src/main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java
         bruteForce++;
         return new LeafBucketCollector() {
+
+            SortedSetDocValues docValues = values;
+
             @Override
             public void collect(int doc, long bucketOrd) throws IOException {
                 visitedOrds = bigArrays.grow(visitedOrds, bucketOrd + 1);
@@ -275,8 +278,8 @@
                     bits = new BitArray(maxOrd, bigArrays);
                     visitedOrds.set(bucketOrd, bits);
                 }
-                if (values.advanceExact(doc)) {
-                    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
+                if (docValues.advanceExact(doc)) {
+                    for (long ord = docValues.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = docValues.nextOrd()) {
                         bits.set((int) ord);
                     }
                 }

This makes sense to me...indeed sometimes the test was failing with

Sorted doc values are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[elasticsearch[node_s4][search_worker][T#4],5,TGRP-NestedTimeSeriesAggregationsIT] and consumed in Thread[elasticsearch[node_s4][search][T#3],5,TGRP-NestedTimeSeriesAggregationsIT].

which actually confirms the doc values being shared and used by multiple threads.

Thanks @kkrik-es for looking at this.

salvatore-campagna · 2023-08-31T10:26:09Z

I pushed the changes suggested by Kostas but I still see some failures.

iverase · 2023-08-31T10:43:02Z

The fix from Kostas does not address the issue of accessing the doc values from different threads. That's a different beast. Are you getting a different exception?

iverase · 2023-08-31T10:59:26Z

I think I know how to fix the issue. My proposal is that you add @LuceneTestCase.SuppressCodecs("*") to the test and open an issue for addressing it. I can then work on the fix.

iverase · 2023-08-31T11:04:39Z

Or you can fix it here:

Subject: [PATCH] Remove calls to LuceneTestCase#newSearcher from FiltersAggregatorTests
---
Index: server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(date 1693479797631)
@@ -72,7 +72,6 @@
         searcher.setProfiler(context);
         try {
             searcher.search(context.rewrittenQuery(), collector);
-            collector.postCollection();
         } catch (IOException e) {
             throw new AggregationExecutionException("Could not perform time series aggregation", e);
         }
Index: server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(date 1693479797629)
@@ -95,11 +95,13 @@
         Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
         if (searcher.getExecutor() == null) {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return;
         }
         // offload to the search worker thread pool whenever possible. It will be null only when search.worker_threads_enabled is false
         RunnableFuture<Void> task = new FutureTask<>(() -> {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return null;
         });
         searcher.getExecutor().execute(task);

kkrik-es · 2023-08-31T12:24:04Z

.../main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java

@@ -105,16 +105,23 @@ public ScoreMode scoreMode() {
    private class CompetitiveIterator extends DocIdSetIterator {

        private final BitArray visitedOrds;
+        private final SortedSetDocValues values;


Use a different name here to avoid confusion?

I remove this...it is not needed (see following commit)

kkrik-es · 2023-08-31T12:27:05Z

.../main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java

@@ -211,6 +218,7 @@ public LeafBucketCollector getLeafCollector(AggregationExecutionContext aggCtx,
                if (maxOrd <= MAX_FIELD_CARDINALITY_FOR_DYNAMIC_PRUNING || numNonVisitedOrds <= MAX_TERMS_FOR_DYNAMIC_PRUNING) {
                    dynamicPruningAttempts++;
                    return new LeafBucketCollector() {
+                        final SortedSetDocValues docValues = valuesSource.globalOrdinalsValues(aggCtx.getLeafReaderContext());


Does this work:

final SortedSetDocValues docValues = values;

Agree, it makes no sense if you are calling the same above

I wanted to remove values completely...but it is used by postCollection that is why I was doing that.
I will restore

docValues = values

to avoid calling methods unnecessarily.

kkrik-es · 2023-08-31T12:27:50Z

.../main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java

@@ -267,6 +275,8 @@ public CompetitiveIterator competitiveIterator() {

        bruteForce++;
        return new LeafBucketCollector() {
+            final SortedSetDocValues docValues = valuesSource.globalOrdinalsValues(aggCtx.getLeafReaderContext());


salvatore-campagna · 2023-08-31T12:28:28Z

I am running the test (testCardinalityByTsid) until failure and after more than 1000 runs I don't see any issue.

salvatore-campagna · 2023-08-31T12:35:38Z

Or you can fix it here:

Subject: [PATCH] Remove calls to LuceneTestCase#newSearcher from FiltersAggregatorTests
---
Index: server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(date 1693479797631)
@@ -72,7 +72,6 @@
         searcher.setProfiler(context);
         try {
             searcher.search(context.rewrittenQuery(), collector);
-            collector.postCollection();
         } catch (IOException e) {
             throw new AggregationExecutionException("Could not perform time series aggregation", e);
         }
Index: server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(date 1693479797629)
@@ -95,11 +95,13 @@
         Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
         if (searcher.getExecutor() == null) {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return;
         }
         // offload to the search worker thread pool whenever possible. It will be null only when search.worker_threads_enabled is false
         RunnableFuture<Void> task = new FutureTask<>(() -> {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return null;
         });
         searcher.getExecutor().execute(task);

Thanks @iverase ...if I understand correctly the result of doing this is that postCollection is executed by the specific thread (executor or main) while before it was always executed by the main thread.
That was causing the issue because when the executor is not null then the post collection is executed by main thread while other methods accessing doc values where executed by thread pool threads:

searcher.getExecutor() == null => everything is executed in main thread and we don't see the issue (everything runs in main thread search)
searcher.getExecutor() != null => post collection executed in main thread (accessing doc values from main thread search) and all other methods executed by executor threads, different from main thread (accessing doc values from executor thread search_worker)

So values was shared between search and search_worker threads.

iverase · 2023-08-31T12:38:19Z

I think this change should be backported to 8.10.x if possible as there are lingering issues in that line.

iverase · 2023-08-31T12:50:43Z

This PR should not have the label backport, it is the backport PR that should have it

elasticsearchmachine · 2023-08-31T13:23:24Z

Hi @salvatore-campagna, I've created a changelog YAML for you.

iverase · 2023-08-31T23:45:42Z

Error is legit, we are calling postCollection twice for downsampling. We need to remove the following line:

elasticsearch/x-pack/plugin/downsample/src/main/java/org/elasticsearch/xpack/downsample/DownsampleShardIndexer.java

Line 160 in 996a90b

bucketCollector.postCollection();

And there are other cases in AggregatorTestCase, could you remove them there too?

salvatore-campagna · 2023-09-01T09:26:04Z

Error is legit, we are calling postCollection twice for downsampling. We need to remove the following line:

elasticsearch/x-pack/plugin/downsample/src/main/java/org/elasticsearch/xpack/downsample/DownsampleShardIndexer.java

Line 160 in 996a90b

bucketCollector.postCollection();

And there are other cases in AggregatorTestCase, could you remove them there too?

So the empty value I see is a result of the doc values "iterator" reaching the end of the stream and being used again by the second invocation?

salvatore-campagna · 2023-09-01T14:15:56Z

@iverase if I remove it from AggregatorTestCase there are a few tests failing because postCollection is not called.

iverase · 2023-09-01T14:18:53Z

if I remove it from AggregatorTestCase there are a few tests failing because postCollection is not called.

You should only remove it when using the time series searcher

elasticsearchmachine · 2023-09-01T15:43:41Z

💚 Backport successful

Status	Branch	Result
✅	8.10

The doc values in the `GlobalOrdCardinalityAggregator` are shared among multiple search threads, `search` and `search_worker`. The search thread also runs the aggregation phase. When an executor is used the 'search' thread is running `postCollection`, which uses doc values, while other methods are executed by the `search_worker` thread, using doc values too. As a result, doc values are concurrently accessed by different threads. Using doc values concurrently from multiple threads is not correct since multiple threads end up updating the doc values state. This breaks access to doc values resulting in different issue depending on how threads end up being scheduled (prematurely exhausting doc values, accessing incorrect documents as a result of trying to access docIds not in the thread owned leaf/segment,...). The solution here is to: 1. make sure we executed `postCollection in the same thread as other methods, which is `search` or `search_worker`. 2. make sure we do not call `postCollection` in case the `TimeSeriesIndexSearcher` is used. In that case `postCollection` is called by `TimeSeriesIndexSearcher`.

bug: cardinality nested in time series

dc95b3e

salvatore-campagna added >bug :Analytics/Aggregations Aggregations :StorageEngine/TSDB You know, for Metrics labels Aug 29, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 29, 2023

elasticsearchmachine added the v8.11.0 label Aug 29, 2023

Update docs/changelog/99007.yaml

699f766

fix: extract method to format dimension field names

8c1cea4

kkrik-es reviewed Aug 30, 2023

View reviewed changes

fix: change interval to hour to avoid hitting bucket limit

a3b5528

fix: doc valus concurrent access by multiple threads

9cc7a15

salvatore-campagna requested review from kkrik-es and iverase August 31, 2023 12:19

kkrik-es reviewed Aug 31, 2023

View reviewed changes

fix: use parent doc values in CompetitiveIterator

21e20b6

kkrik-es reviewed Aug 31, 2023

View reviewed changes

fix: changelog yaml

3ec037f

kkrik-es reviewed Aug 31, 2023

View reviewed changes

kkrik-es approved these changes Aug 31, 2023

View reviewed changes

iverase approved these changes Aug 31, 2023

View reviewed changes

salvatore-campagna added backport auto-backport Automatically create backport pull requests when merged v8.10.0 labels Aug 31, 2023

Delete docs/changelog/99007.yaml

0e00a48

salvatore-campagna removed the backport label Aug 31, 2023

salvatore-campagna and others added 3 commits August 31, 2023 15:23

Update docs/changelog/99007.yaml

0440c35

Update 99007.yaml

32487ca

fix: use locale

d0e3875

fix: remove duplicate execution of postCollection

df4e4b4

salvatore-campagna changed the title ~~Cardinality nested in time series doc values bug~~ Concurrent thread access to shared doc values Sep 1, 2023

fix: restore postCollection call

e3760c2

fix: call postCollection only if not using TimeSeriesIndexSearcher

9074b2b

iverase approved these changes Sep 1, 2023

View reviewed changes

salvatore-campagna added auto-backport-and-merge and removed auto-backport Automatically create backport pull requests when merged labels Sep 1, 2023

salvatore-campagna merged commit 06d8fa0 into elastic:main Sep 1, 2023

salvatore-campagna mentioned this pull request Sep 1, 2023

[8.10] Concurrent thread access to shared doc values (#99007) #99127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent thread access to shared doc values #99007

Concurrent thread access to shared doc values #99007

salvatore-campagna commented Aug 29, 2023 •

edited

Loading

elasticsearchmachine commented Aug 29, 2023

elasticsearchmachine commented Aug 29, 2023

iverase commented Aug 30, 2023

iverase commented Aug 30, 2023 •

edited

Loading

salvatore-campagna commented Aug 30, 2023

kkrik-es commented Aug 30, 2023 •

edited

Loading

kkrik-es Aug 30, 2023

salvatore-campagna commented Aug 30, 2023

salvatore-campagna commented Aug 31, 2023

iverase commented Aug 31, 2023 •

edited

Loading

iverase commented Aug 31, 2023

iverase commented Aug 31, 2023

kkrik-es Aug 31, 2023

salvatore-campagna Aug 31, 2023

kkrik-es Aug 31, 2023

iverase Aug 31, 2023

salvatore-campagna Aug 31, 2023

kkrik-es Aug 31, 2023

salvatore-campagna commented Aug 31, 2023

salvatore-campagna commented Aug 31, 2023 •

edited

Loading

iverase commented Aug 31, 2023

iverase commented Aug 31, 2023

elasticsearchmachine commented Aug 31, 2023

iverase commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Sep 1, 2023 •

edited

Loading

salvatore-campagna commented Sep 1, 2023

iverase commented Sep 1, 2023 •

edited

Loading

elasticsearchmachine commented Sep 1, 2023

Concurrent thread access to shared doc values #99007

Concurrent thread access to shared doc values #99007

Conversation

salvatore-campagna commented Aug 29, 2023 • edited Loading

elasticsearchmachine commented Aug 29, 2023

elasticsearchmachine commented Aug 29, 2023

iverase commented Aug 30, 2023

iverase commented Aug 30, 2023 • edited Loading

salvatore-campagna commented Aug 30, 2023

kkrik-es commented Aug 30, 2023 • edited Loading

kkrik-es Aug 30, 2023

Choose a reason for hiding this comment

salvatore-campagna commented Aug 30, 2023

salvatore-campagna commented Aug 31, 2023

iverase commented Aug 31, 2023 • edited Loading

iverase commented Aug 31, 2023

iverase commented Aug 31, 2023

kkrik-es Aug 31, 2023

Choose a reason for hiding this comment

salvatore-campagna Aug 31, 2023

Choose a reason for hiding this comment

kkrik-es Aug 31, 2023

Choose a reason for hiding this comment

iverase Aug 31, 2023

Choose a reason for hiding this comment

salvatore-campagna Aug 31, 2023

Choose a reason for hiding this comment

kkrik-es Aug 31, 2023

Choose a reason for hiding this comment

salvatore-campagna commented Aug 31, 2023

salvatore-campagna commented Aug 31, 2023 • edited Loading

iverase commented Aug 31, 2023

iverase commented Aug 31, 2023

elasticsearchmachine commented Aug 31, 2023

iverase commented Aug 31, 2023 • edited Loading

salvatore-campagna commented Sep 1, 2023 • edited Loading

salvatore-campagna commented Sep 1, 2023

iverase commented Sep 1, 2023 • edited Loading

elasticsearchmachine commented Sep 1, 2023

💚 Backport successful

salvatore-campagna commented Aug 29, 2023 •

edited

Loading

iverase commented Aug 30, 2023 •

edited

Loading

kkrik-es commented Aug 30, 2023 •

edited

Loading

iverase commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Aug 31, 2023 •

edited

Loading

iverase commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Sep 1, 2023 •

edited

Loading

iverase commented Sep 1, 2023 •

edited

Loading