Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase search.max_buckets to 65,535 #57042

Merged
merged 13 commits into from
Jun 3, 2020

Conversation

imotov
Copy link
Contributor

@imotov imotov commented May 21, 2020

Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes #51731

Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes elastic#51731
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 21, 2020
@imotov
Copy link
Contributor Author

imotov commented May 21, 2020

@elasticmachine run elasticsearch-ci/bwc

@polyfractal
Copy link
Contributor

Changes look good (yay deletions). I think there are a few more things we could clean up too, e.g. some (all?) of the Internal* bucketing agg classes have extra stuff to add/remove buckets to the breaker while reducing. DateHisto as an example. We could probably purge all of these types of accounting and just do a single call at the end of reduce()

I don't think it functionally matters, since Nik's recent change means we only count buckets on the final reduce (and not the partial reductions). But it would tidy up some of the Internal* classes.

Haven't looked too closely though, so I might be misremembering :)

@imotov imotov added the WIP label May 28, 2020
@imotov imotov removed the WIP label May 29, 2020
Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed the new deletions and they look good (will look closer before merge), but left a comment about BucketsAggregator... think we need to tweak it a bit more

@@ -91,6 +87,9 @@ public final void collectBucket(LeafBucketCollector subCollector, int doc, long
* Same as {@link #collectBucket(LeafBucketCollector, int, long)}, but doesn't check if the docCounts needs to be re-sized.
*/
public final void collectExistingBucket(LeafBucketCollector subCollector, int doc, long bucketOrd) throws IOException {
if (doc == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think this isn't quite right. doc is the doc ID, so that could be all over the place, and also all going into the same bucket. I think there are two options here:

  1. Down below, we do if (docCounts.increment(bucketOrd, 1) == 1) { <breaker stuff> } which I think will work because the increment method returns the count after incrementing. So if we have a doc count of 1, it's the first doc and a new bucket so we can account it

  2. Alternatively, we could just account for it up in collectBucket without a conditional, since theoretically that should only be called on new buckets. It's not guaranteed by the API but in practice that's how aggs use it.

There are two other issues we need to address though:

  1. The old breaker logic only checked every 1024 buckets, since checking the real-memory breaker has a certain amount of overhead. So we should re-implement that somehow
  2. Trickier situation which I didn't think about when suggesting BucketsAggregator... if we add the 1024 threshold back, it's only a local count so aggs with 1023 buckets will never trigger the breaker even if the overall query has millions of buckets.

Perhaps we continue to use the MultiBucketConsumer service thing, but move the breaker accounting to a different method? That way it could maintain the global count and BucketsAggregator just calls a method on it or something? Not sure, we can discuss more offline

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about comments :), but otherwise I'm happy with it. Re-using the MultiBucketConsumer feels like an OK tradeoff in this case, since we'll need something global to track usage. And this is simpler/less invasive than trying to bolt on a new thing.

👍

@@ -91,7 +91,9 @@ public final void collectBucket(LeafBucketCollector subCollector, int doc, long
* Same as {@link #collectBucket(LeafBucketCollector, int, long)}, but doesn't check if the docCounts needs to be re-sized.
*/
public final void collectExistingBucket(LeafBucketCollector subCollector, int doc, long bucketOrd) throws IOException {
docCounts.increment(bucketOrd, 1);
if (docCounts.increment(bucketOrd, 1) == 1) {
multiBucketConsumer.accept(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here why we're using 0, just in case a lost soul stumbles over this and is confused :)

if (value > 0 && (count & 0x3FF) == 0) {
// check parent circuit breaker every 1024 calls
callCount++;
if ((callCount & 0x3FF) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for fixing this behavior

@imotov imotov merged commit 29b5643 into elastic:master Jun 3, 2020
imotov added a commit to imotov/elasticsearch that referenced this pull request Jun 3, 2020
Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes elastic#51731
imotov added a commit that referenced this pull request Jun 3, 2020
Increases the default search.max_buckets limit to 65,535, and only counts
buckets during reduce phase.

Closes #51731
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 12, 2020
Before elastic#57042 the max_buckets test would consistently pass because the
request would consistently fail. In particular, the request would fail on
the data node. After elastic#57042 it only fails on the coordinating node. When
the max_buckets test is run in a mixed version cluster it consistently
fails on *either* the data node or the coordinating node. Except when
the coordinating node is missing elastic#43095. In that case if the one data
node has elastic#57042 and one does not, *and* the one that doesn't gets the
request first, fails it as expected, and then the coordinating node
retries the request on the node with elastic#57042. When that happens the
request fails mysteriously with "partial shard failures" as the error
message but not partial failures reported. This is *exactly* the bug
fixed in elastic#43095.

This updates the test to be skipped in mixed version clusters without
 elastic#43095 because they *sometimes* fail the test spuriously. The request
fails in those cases, just like we expect, but with a mysterious error
message.

Closes elastic#57657
@imotov imotov deleted the issue-51731-increase-max-buckets branch June 12, 2020 15:36
nik9000 added a commit that referenced this pull request Jun 12, 2020
Before #57042 the max_buckets test would consistently pass because the
request would consistently fail. In particular, the request would fail on
the data node. After #57042 it only fails on the coordinating node. When
the max_buckets test is run in a mixed version cluster it consistently
fails on *either* the data node or the coordinating node. Except when
the coordinating node is missing #43095. In that case if the one data
node has #57042 and one does not, *and* the one that doesn't gets the
request first, fails it as expected, and then the coordinating node
retries the request on the node with #57042. When that happens the
request fails mysteriously with "partial shard failures" as the error
message but not partial failures reported. This is *exactly* the bug
fixed in #43095.

This updates the test to be skipped in mixed version clusters without
 #43095 because they *sometimes* fail the test spuriously. The request
fails in those cases, just like we expect, but with a mysterious error
message.

Closes #57657
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate search.max_buckets?
4 participants