Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent thread access to shared doc values #99007

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Aug 29, 2023

When trying to run a cardinality aggregation nested inside a
time series aggregation test called testCardinalityByTsid
(sometimes) fails with the following stack traces (plural here
is not a mistake, the test appears to fail with different issues).
It looks like something is wrong when accessing dimension fields
doc values.

My idea is that something is wrong with ordinals but can't figure out
if that is the case.

Usually I see one of the following two assertions failing:

assert target >= in.docID();
assert target < maxDoc;

which means in GlobalOrdCardinalityAggregator we try
to fetch incorrect target document when calling advanceExact

if (values.advanceExact(doc)) {
    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
        bits.set((int) ord);
    }
}

Note also that this branch is exercised only if the cardinality aggregation is
not a top level aggregation. I tried to reproduce the issue with the cardinality
aggregation nested inside a terms aggregation but didn't see any issue.
For this reason I believe something might be wrong when using parent (time
series aggregator) ordinals.

Also worth noting is that sometimes the test fails with other issues. I executed the
test a certain number of times to see it failing. Usually it takes less than 10 executions
to see a failure.

ago 29, 2023 12:11:31 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][search_worker][T#3],5,TGRP-TimeSeriesNestedAggregationsIT]
java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([E92F1D22512328AC]:0)
	at org.apache.lucene.tests.index.AssertingLeafReader$AssertingSortedDocValues.advanceExact(AssertingLeafReader.java:881)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:85)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:121)
	at org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher$LeafWalker.collectCurrent(TimeSeriesIndexSearcher.java:262)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher.search(TimeSeriesIndexSearcher.java:167)
	at org.elasticsearch.search.aggregations.support.TimeSeriesIndexSearcher.lambda$search$0(TimeSeriesIndexSearcher.java:102)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

This is another different stack trace

août 29, 2023 11:20:38 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s0][search][T#2],5,TGRP-TimeSeriesNestedAggregationsIT]
java.lang.AssertionError: Sorted doc values are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[elasticsearch[node_s0][search_worker][T#1],5,TGRP-TimeSeriesNestedAggregationsIT] and consumed in Thread[elasticsearch[node_s0][search][T#2],5,TGRP-TimeSeriesNestedAggregationsIT].
	at __randomizedtesting.SeedInfo.seed([B0787CC8BC021E74]:0)
	at org.apache.lucene.tests.index.AssertingLeafReader.assertThread(AssertingLeafReader.java:67)
	at org.apache.lucene.tests.index.AssertingLeafReader$AssertingSortedDocValues.lookupOrd(AssertingLeafReader.java:908)
	at org.apache.lucene.index.SingletonSortedSetDocValues.lookupOrd(SingletonSortedSetDocValues.java:95)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator.doPostCollection(GlobalOrdCardinalityAggregator.java:302)
	at org.elasticsearch.search.aggregations.AggregatorBase.postCollection(AggregatorBase.java:294)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.postCollection(MultiBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.AggregatorBase.postCollection(AggregatorBase.java:295)
	at org.elasticsearch.search.aggregations.MultiBucketCollector$1.postCollection(MultiBucketCollector.java:86)
	at org.elasticsearch.search.aggregations.AggregationPhase.executeInSortOrder(AggregationPhase.java:75)
	at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:36)
	at org.elasticsearch.search.query.QueryPhase.executeQuery(QueryPhase.java:132)
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:63)
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:515)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:667)
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$2(SearchService.java:540)
	at org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:51)
	at org.elasticsearch.action.ActionRunnable$2.accept(ActionRunnable.java:48)
	at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:73)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 29, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've created a changelog YAML for you.

@iverase
Copy link
Contributor

iverase commented Aug 30, 2023

This has been introduces by #98204. We are now offloading the execution of the query to a different thread (worker thread) regardless of concurrency, still global ordinals are created on a different thread (coordinating thread).

cc: @javanna

@iverase
Copy link
Contributor

iverase commented Aug 30, 2023

By the way, I am seeing other errors in the test testCardinalityByTsid, in particular one that looks worrying looks like:

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: java.io.EOFException: read past EOF (pos=7): MemorySegmentIndexInput(path="/Users/ivera/forks/elasticsearch/modules/aggregations/build/testrun/internalClusterTest/temp/org.elasticsearch.aggregations.bucket.TimeSeriesNestedAggregationsIT_FF7E6A58FA7E2ECF-001/tempDir-002/node_s3/d0/indices/xF5qgLtpSquCLks3UYxU2w/0/index/_0.cfs") [slice=_0_Lucene90_0.dvd] [slice=randomaccess]
	at org.apache.lucene.util.packed.DirectReader$DirectPackedReader1.get(DirectReader.java:204)
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20.ordValue(Lucene90DocValuesProducer.java:853)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:86)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:114)

I was able to reproduce it on 8.9 so it is something that has been probably always there.

Note that to silent the errors on this issue you can add the following annotation at the top of the class:

@LuceneTestCase.SuppressCodecs("*")

@salvatore-campagna
Copy link
Contributor Author

By the way, I an seeing other errors in the test testCardinalityByTsid, in particular one that looks worrying looks like:

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: java.io.EOFException: read past EOF (pos=7): MemorySegmentIndexInput(path="/Users/ivera/forks/elasticsearch/modules/aggregations/build/testrun/internalClusterTest/temp/org.elasticsearch.aggregations.bucket.TimeSeriesNestedAggregationsIT_FF7E6A58FA7E2ECF-001/tempDir-002/node_s3/d0/indices/xF5qgLtpSquCLks3UYxU2w/0/index/_0.cfs") [slice=_0_Lucene90_0.dvd] [slice=randomaccess]
	at org.apache.lucene.util.packed.DirectReader$DirectPackedReader1.get(DirectReader.java:204)
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$20.ordValue(Lucene90DocValuesProducer.java:853)
	at org.apache.lucene.index.SingletonSortedSetDocValues.advanceExact(SingletonSortedSetDocValues.java:86)
	at org.elasticsearch.search.aggregations.metrics.GlobalOrdCardinalityAggregator$2.collect(GlobalOrdCardinalityAggregator.java:278)
	at org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:96)
	at org.elasticsearch.aggregations.bucket.timeseries.TimeSeriesAggregator$1.collect(TimeSeriesAggregator.java:114)

I was able to reproduce it on 8.9 so it is something that has been probably always there.

Note that to silent the errors on this issue you can add the following annotation at the top of the class:

@LuceneTestCase.SuppressCodecs("*")

I agree that is probably something that has always been there.

@kkrik-es
Copy link
Contributor

kkrik-es commented Aug 30, 2023

I think there's an issue with GlobalOrdCardinalityAggregator::getLeafCollector. It seems like we're reusing the same aggregator for different global ordinal values. Keeping a separate reference to SortedSetDocValues inside the constructed LeafBucketCollector seems to do the trick:

Index: server/src/main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java
         bruteForce++;
         return new LeafBucketCollector() {
+
+            SortedSetDocValues docValues = values;
+
             @Override
             public void collect(int doc, long bucketOrd) throws IOException {
                 visitedOrds = bigArrays.grow(visitedOrds, bucketOrd + 1);
@@ -275,8 +278,8 @@
                     bits = new BitArray(maxOrd, bigArrays);
                     visitedOrds.set(bucketOrd, bits);
                 }
-                if (values.advanceExact(doc)) {
-                    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
+                if (docValues.advanceExact(doc)) {
+                    for (long ord = docValues.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = docValues.nextOrd()) {
                         bits.set((int) ord);
                     }
                 }


public void testDateHistogramByTsid() {
final TimeSeriesAggregationBuilder timeSeries = new TimeSeriesAggregationBuilder("ts").subAggregation(
new DateHistogramAggregationBuilder("date_histogram").field("@timestamp").calendarInterval(DateHistogramInterval.MINUTE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the interval to HOUR to avoid exceeding bucket limit?

@salvatore-campagna
Copy link
Contributor Author

I think there's an issue with GlobalOrdCardinalityAggregator::getLeafCollector. It seems like we're reusing the same aggregator for different global ordinal values. Keeping a separate reference to SortedSetDocValues inside the constructed LeafBucketCollector seems to do the trick:

Index: server/src/main/java/org/elasticsearch/search/aggregations/metrics/GlobalOrdCardinalityAggregator.java
         bruteForce++;
         return new LeafBucketCollector() {
+
+            SortedSetDocValues docValues = values;
+
             @Override
             public void collect(int doc, long bucketOrd) throws IOException {
                 visitedOrds = bigArrays.grow(visitedOrds, bucketOrd + 1);
@@ -275,8 +278,8 @@
                     bits = new BitArray(maxOrd, bigArrays);
                     visitedOrds.set(bucketOrd, bits);
                 }
-                if (values.advanceExact(doc)) {
-                    for (long ord = values.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = values.nextOrd()) {
+                if (docValues.advanceExact(doc)) {
+                    for (long ord = docValues.nextOrd(); ord != SortedSetDocValues.NO_MORE_ORDS; ord = docValues.nextOrd()) {
                         bits.set((int) ord);
                     }
                 }

This makes sense to me...indeed sometimes the test was failing with

Sorted doc values are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[elasticsearch[node_s4][search_worker][T#4],5,TGRP-NestedTimeSeriesAggregationsIT] and consumed in Thread[elasticsearch[node_s4][search][T#3],5,TGRP-NestedTimeSeriesAggregationsIT].

which actually confirms the doc values being shared and used by multiple threads.

Thanks @kkrik-es for looking at this.

@salvatore-campagna
Copy link
Contributor Author

I pushed the changes suggested by Kostas but I still see some failures.

@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

The fix from Kostas does not address the issue of accessing the doc values from different threads. That's a different beast. Are you getting a different exception?

@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

I think I know how to fix the issue. My proposal is that you add @LuceneTestCase.SuppressCodecs("*") to the test and open an issue for addressing it. I can then work on the fix.

@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

Or you can fix it here:

Subject: [PATCH] Remove calls to LuceneTestCase#newSearcher from FiltersAggregatorTests
---
Index: server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(date 1693479797631)
@@ -72,7 +72,6 @@
         searcher.setProfiler(context);
         try {
             searcher.search(context.rewrittenQuery(), collector);
-            collector.postCollection();
         } catch (IOException e) {
             throw new AggregationExecutionException("Could not perform time series aggregation", e);
         }
Index: server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(date 1693479797629)
@@ -95,11 +95,13 @@
         Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
         if (searcher.getExecutor() == null) {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return;
         }
         // offload to the search worker thread pool whenever possible. It will be null only when search.worker_threads_enabled is false
         RunnableFuture<Void> task = new FutureTask<>(() -> {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return null;
         });
         searcher.getExecutor().execute(task);

@@ -105,16 +105,23 @@ public ScoreMode scoreMode() {
private class CompetitiveIterator extends DocIdSetIterator {

private final BitArray visitedOrds;
private final SortedSetDocValues values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a different name here to avoid confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove this...it is not needed (see following commit)

@@ -211,6 +218,7 @@ public LeafBucketCollector getLeafCollector(AggregationExecutionContext aggCtx,
if (maxOrd <= MAX_FIELD_CARDINALITY_FOR_DYNAMIC_PRUNING || numNonVisitedOrds <= MAX_TERMS_FOR_DYNAMIC_PRUNING) {
dynamicPruningAttempts++;
return new LeafBucketCollector() {
final SortedSetDocValues docValues = valuesSource.globalOrdinalsValues(aggCtx.getLeafReaderContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work:

final SortedSetDocValues docValues = values;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it makes no sense if you are calling the same above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to remove values completely...but it is used by postCollection that is why I was doing that.
I will restore

docValues = values

to avoid calling methods unnecessarily.

@@ -267,6 +275,8 @@ public CompetitiveIterator competitiveIterator() {

bruteForce++;
return new LeafBucketCollector() {
final SortedSetDocValues docValues = valuesSource.globalOrdinalsValues(aggCtx.getLeafReaderContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@salvatore-campagna
Copy link
Contributor Author

I am running the test (testCardinalityByTsid) until failure and after more than 1000 runs I don't see any issue.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Aug 31, 2023

Or you can fix it here:

Subject: [PATCH] Remove calls to LuceneTestCase#newSearcher from FiltersAggregatorTests
---
Index: server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/AggregationPhase.java	(date 1693479797631)
@@ -72,7 +72,6 @@
         searcher.setProfiler(context);
         try {
             searcher.search(context.rewrittenQuery(), collector);
-            collector.postCollection();
         } catch (IOException e) {
             throw new AggregationExecutionException("Could not perform time series aggregation", e);
         }
Index: server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java
--- a/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(revision 596a56e8a9975f0def38a41aba327f73fe5ed478)
+++ b/server/src/main/java/org/elasticsearch/search/aggregations/support/TimeSeriesIndexSearcher.java	(date 1693479797629)
@@ -95,11 +95,13 @@
         Weight weight = searcher.createWeight(query, bucketCollector.scoreMode(), 1);
         if (searcher.getExecutor() == null) {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return;
         }
         // offload to the search worker thread pool whenever possible. It will be null only when search.worker_threads_enabled is false
         RunnableFuture<Void> task = new FutureTask<>(() -> {
             search(bucketCollector, weight);
+            bucketCollector.postCollection();
             return null;
         });
         searcher.getExecutor().execute(task);

Thanks @iverase ...if I understand correctly the result of doing this is that postCollection is executed by the specific thread (executor or main) while before it was always executed by the main thread.
That was causing the issue because when the executor is not null then the post collection is executed by main thread while other methods accessing doc values where executed by thread pool threads:

  • searcher.getExecutor() == null => everything is executed in main thread and we don't see the issue (everything runs in main thread search)
  • searcher.getExecutor() != null => post collection executed in main thread (accessing doc values from main thread search) and all other methods executed by executor threads, different from main thread (accessing doc values from executor thread search_worker)

So values was shared between search and search_worker threads.

@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

I think this change should be backported to 8.10.x if possible as there are lingering issues in that line.

@salvatore-campagna salvatore-campagna added backport auto-backport Automatically create backport pull requests when merged v8.10.0 labels Aug 31, 2023
@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

This PR should not have the label backport, it is the backport PR that should have it

@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've created a changelog YAML for you.

@iverase
Copy link
Contributor

iverase commented Aug 31, 2023

Error is legit, we are calling postCollection twice for downsampling. We need to remove the following line:

And there are other cases in AggregatorTestCase, could you remove them there too?

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Sep 1, 2023

Error is legit, we are calling postCollection twice for downsampling. We need to remove the following line:

And there are other cases in AggregatorTestCase, could you remove them there too?

So the empty value I see is a result of the doc values "iterator" reaching the end of the stream and being used again by the second invocation?

@salvatore-campagna salvatore-campagna changed the title Cardinality nested in time series doc values bug Concurrent thread access to shared doc values Sep 1, 2023
@salvatore-campagna
Copy link
Contributor Author

@iverase if I remove it from AggregatorTestCase there are a few tests failing because postCollection is not called.

@iverase
Copy link
Contributor

iverase commented Sep 1, 2023

if I remove it from AggregatorTestCase there are a few tests failing because postCollection is not called.

You should only remove it when using the time series searcher

@salvatore-campagna salvatore-campagna added auto-backport-and-merge and removed auto-backport Automatically create backport pull requests when merged labels Sep 1, 2023
@salvatore-campagna salvatore-campagna merged commit 06d8fa0 into elastic:main Sep 1, 2023
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.10

salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request Sep 1, 2023
The doc values in the `GlobalOrdCardinalityAggregator` are shared
among multiple search threads, `search` and `search_worker`.
The search thread also runs the aggregation phase. When an
executor is used the 'search' thread is running `postCollection`, which
uses doc values, while other methods are executed by the `search_worker`
thread, using doc values too. As a result, doc values are concurrently
accessed by different threads. Using doc values concurrently from multiple
threads is not correct since multiple threads end up updating the doc values
state. This breaks access to doc values resulting in different issue depending
on how threads end up being scheduled (prematurely exhausting doc values,
accessing incorrect documents as a result of trying to access docIds not
in the thread owned leaf/segment,...).

The solution here is to:
1. make sure we executed `postCollection in the same thread as other
methods, which is `search` or `search_worker`.
2. make sure we do not call `postCollection` in case the `TimeSeriesIndexSearcher`
is used. In that case `postCollection` is called by `TimeSeriesIndexSearcher`.
elasticsearchmachine pushed a commit that referenced this pull request Sep 1, 2023
The doc values in the `GlobalOrdCardinalityAggregator` are shared
among multiple search threads, `search` and `search_worker`.
The search thread also runs the aggregation phase. When an
executor is used the 'search' thread is running `postCollection`, which
uses doc values, while other methods are executed by the `search_worker`
thread, using doc values too. As a result, doc values are concurrently
accessed by different threads. Using doc values concurrently from multiple
threads is not correct since multiple threads end up updating the doc values
state. This breaks access to doc values resulting in different issue depending
on how threads end up being scheduled (prematurely exhausting doc values,
accessing incorrect documents as a result of trying to access docIds not
in the thread owned leaf/segment,...).

The solution here is to:
1. make sure we executed `postCollection in the same thread as other
methods, which is `search` or `search_worker`.
2. make sure we do not call `postCollection` in case the `TimeSeriesIndexSearcher`
is used. In that case `postCollection` is called by `TimeSeriesIndexSearcher`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.10.0 v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants