Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix auto date histogram rounding assertion bug #17023

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

finnegancarroll
Copy link
Contributor

@finnegancarroll finnegancarroll commented Jan 14, 2025

Description

The auto date histogram agg contains an optimization which skips traditional doc collection and instead examines the "pre-aggregated" doc counts contained in the BKD tree of each segment. Several conditions need to be true before this optimization can execute for a given segment. One such condition is we must be able to determine a set of ranges (or rounding) for the segment under consideration before optimizing.

Normally ranges for the auto date histogram aggregation are updated as needed over the course of collecting documents but the filter rewrite optimization will update the preparedRounding of the agg in accordance with the min & max values of the segment under consideration ahead of time since it skips regular doc collection.

As a result, it is possible for our preparedRounding to shrink as the rounding built from the segment could easily be smaller than the rounding previously used by our shard level aggregator.

This usually does not pose a problem as the preparedRounding will be updated accordingly when we collect our next document, or reduce our shard level aggs into a single top level agg.

The specific case where this becomes problematic is when our preparedRounding is delegating to a "bounded" structure. When we prepare a rounding we do so for the min & max epoch time of our shard since this allows us to optimize the structure we delegate rounding to.

For some ranges of epoch time and time zones rounding will be little more than a modulo operation. However if our min & max epoch time crosses "transitions" such as daylight savings we may want to delegate rounding to a linked list or array structure to quickly lookup these transitions. This is why the specific occurrence of this bug linked in the initial issue only appears when "time_zone":"America/New_York".

The combination of delegating rounding to these strictly bounded structures and the filter rewrite optimization "replaying" our previous bucket keys fails an assertion within our preparedRounding as our previous bucket keys are not guaranteed to fit within the strict bounds of the rounding prepared for the current segment being collected.

The changes in this PR resolve this by ensuring the filter rewrite optimization only ever increases the granularity of our preparedRounding.

Related Issues

Resolves #16932

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Search Search query, autocomplete ...etc labels Jan 14, 2025
Copy link
Contributor

❕ Gradle check result for 0ecdf31: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.16%. Comparing base (e7e19f7) to head (0ecdf31).
Report is 16 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17023      +/-   ##
============================================
- Coverage     72.20%   72.16%   -0.05%     
+ Complexity    65289    65211      -78     
============================================
  Files          5299     5299              
  Lines        303536   303538       +2     
  Branches      43941    43941              
============================================
- Hits         219180   219055     -125     
- Misses        66441    66518      +77     
- Partials      17915    17965      +50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@finnegancarroll
Copy link
Contributor Author

@bowenlan-amzn can you take a look when you have a chance? Thank you!

@bowenlan-amzn
Copy link
Member

I remember the optimization will not be applied if the aggregation defined a timezone, so this bug is kind of a surprise.
But the real story is the block of timezone happened right after the bug 😔

final Rounding rounding = getRounding(bounds[0], bounds[1]);
final OptionalLong intervalOpt = Rounding.getInterval(rounding);

The block of timezone is inside getInterval

if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

The timezone is part of the Rounding object, so we need to have Rounding first then check the timezone. However, in auto datehistogram, to get the rounding, we are updating prepared rounding also which leading to this bug.

I recommend we add a simple timezone check right before

final Rounding rounding = getRounding(bounds[0], bounds[1]);
final OptionalLong intervalOpt = Rounding.getInterval(rounding);

It purely takes in a Rounding (for autodatehistogram, this can just be the first Rounding in RouningInfos, since every Rounding would have the same timezone information) and do the check
if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

If the check doesn't pass, we don't even bother to go inside getRounding method.

The check you added here (not shrink the rounding) is still meaningful, agree it should be ever increasing, but not shrink depending on the next segment processed.

@bowenlan-amzn
Copy link
Member

The combination of delegating rounding to these strictly bounded structures and the filter rewrite optimization "replaying" our previous bucket keys fails an assertion within our preparedRounding as our previous bucket keys are not guaranteed to fit within the strict bounds of the rounding prepared for the current segment being collected.

A little more explaination on the root cause

long minLookup = minUtcMillis - unit.extraLocalOffsetLookup();

For different date unit, the boundary of the prepared rounding is different. For example, for date unit of month the min lookup will minus millis of a month, but for second, min lookup only need to minus 1000.
So if we shrink the prepared rounding from month to second, the min lookup actually increase and not able to cover previous rounded value, and this failes an assertion.

@finnegancarroll
Copy link
Contributor Author

The block of timezone is inside getInterval

if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

Ah, I didn't realize we already disallow non UTC timezones. It makes sense we may not want to support this optimization for non UTC in general as this could produce intervals which are not of fixed size.

I recommend we add a simple timezone check right before

final Rounding rounding = getRounding(bounds[0], bounds[1]);
final OptionalLong intervalOpt = Rounding.getInterval(rounding);

I've added this check to canOptimize(). I was thinking this would be slightly more convenient since we already have the ValuesSourceConfig to access the timezone and it's the earliest we can disable the optimization path. Please lmk what you think of this.

@finnegancarroll
Copy link
Contributor Author

finnegancarroll commented Jan 16, 2025

The block of timezone is inside getInterval

if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

Now that we disallow non UTC timezones from the start for this optimization i've added an additional change to remove this check from getInterval(). My thought is the check is now redundant and Rounding.java should not be aware of specific conditions for the filter rewrite optimization.

Signed-off-by: Finn Carroll <carrofin@amazon.com>
…te histo assertion bug per opensearch-project#16932

Signed-off-by: Finn Carroll <carrofin@amazon.com>
… preparedRounding of agg.

Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Copy link
Contributor

❌ Gradle check result for d6927fd: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d6927fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Copy link
Contributor

❌ Gradle check result for 3a6dcb1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] auto_date_histogram query with time_zone can throw AssertionError
2 participants