-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram aggs: add empty buckets only in the final reduce step #35921
Histogram aggs: add empty buckets only in the final reduce step #35921
Conversation
Pinging @elastic/es-analytics-geo |
if (minDocCount == 0) { | ||
addEmptyBuckets(reducedBuckets, reduceContext); | ||
} | ||
if (InternalOrder.isKeyDesc(order)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a few minutes to convince myself that this works. :)
On first glance the behavior changes because sorting is only done on the final reduction now (instead of every reduction), which I thought might break reduceBuckets()
as that relies on consistent ordering. But histos/date_histos sort their shard results by key:asc
, so this isn't actually a problem.
Could we add a test to DateHistogramAggregatorTests
/HistogramAggregatorTests
that randomly chooses a sort order + min_doc_count: 0
to ensure this internal contract never changes in the future? It looks like none of those tests set an order so we might not notice otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I changed anything around sorting. That if was hard to read, I rewrote it but the only actual change is that empty buckets are only added in the final reduction phase. I am working on tests, I will add what you suggest, I had other additions as well in mind but I hit some roadblocks (see #36004). Basically empty buckets were not tested in our unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted about this in slack, and I'm just bad at boolean logic. All good here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check to the base class which verifies that no buckets are added in non final reduce phases. Now that we test adding empty buckets for histogram and date histogram aggs, this check makes some sense I think. These are the only aggs that add buckets as part of reduce right? I wonder if I need to check whether there is some other missing test coverage somewhere.
Also, I worked a bit on increasing test coverage in DateHistogramAggregatorTests like you suggested, and I think I prefer doing it in a follow-up if still necessary.
In this test we were randomizing different values but minDocCount was hardcoded to 1. It's important to test other values, especially `0` as it's the default. The test needed some adapting in the way buckets are randomly generated: all aggs need to share the same interval, minDocCount and emptyBucketInfo. Also assertions need to take into account that more (or less) buckets are expected depending on minDocCount. This was originated by elastic#35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to elastic#26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, this was triggered by the increased test coverage.
In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`. This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.
Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.
fe11866
to
9f55e45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @javanna
initialBucketCount += countInnerBucket(internalAggregation); | ||
} | ||
int reducedBucketCount = countInnerBucket(reduced); | ||
//check that non final reduction never adds buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see elastic#35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to elastic#32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`. This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.
Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This will allow us to later remove the max buckets limit when performing non final reduction.
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.