-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up date_histogram without children #63643
Conversation
I'm running rally against this now but playing with it by hand seems pretty good. |
return weights; | ||
} | ||
|
||
private Query filterMatchingBoth(Query lhs, Query rhs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method and everything in it is kind of shameful but it gives a 2x speed improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, this merges two filter queries so they can be performed in one pass? I know it's a private method, but I still think a bit of documentation for what it does and why that's important would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Performance numbers look pretty good:
|
I ran some more quick and dirty performance tests:
I think the pattern you see here comes from being able to use the filter cache. Still, even with the filter cache filled with things we don't want the agg runs significantly faster than before. |
By the way, this is basically just a revival of @polyfractal's #47712, but reworked so that we can use it for |
Working on the profile output for this:
|
This allows `date_histogram`s with `hard_bounds` and `extended_bounds` to use the "as range" style optimizations introducedin #63643. There isn't any work to do for `exended_bounds` besides add a test. For `hard_bounds` we have to be careful when constructing the ranges that to filter.
This allows us to run the optimization introduced in elastic#63643 when the `date_histogram` has children. It isn't a revolutionary performance improvement though because children tend to be a lot heavier than the `date_histogram`. It is faster, but only by a couple of percentage points.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters`.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters.
Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters`.
This allows many of the optimizations added in elastic#63643 and elastic#68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.
This allows many of the optimizations added in #63643 and #68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.
This allows many of the optimizations added in elastic#63643 and elastic#68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.
This allows many of the optimizations added in #63643 and #68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.
This speeds up
date_histogram
aggregations without a parent orchildren. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.
The kind of speedup we're seeing is fairly substantial in many cases:
This uses the work we did in #61467 to precompute the rounding points for
a
date_histogram
. Now, when we know the rounding points we execute thedate_histogram
as arange
aggregation. This is nice for two reasons:range
aggregation (see below)to ordinals.
Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing
range
aggregations asa
filters
aggregation. Normally thefilters
aggregation is quite slowbut when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original
date_histogram
.The
range
aggregation is fairly careful in how it rewrites, giving upon the
filters
aggregation if it won't collect "filter by filter" andfalling back to its original execution mechanism.
So an aggregation like this:
is executed like:
Which in turn is executed like this:
And that is faster because we can execute it "filter by filter".
Finally, notice the
range
query filtering the data. That is required forthe data set that I'm using for testing. The "filter by filter" collection
mechanism for the
filters
agg needs special case handling when the queryis a
range
query and the filter is arange
query and they are both onthe same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard
filter
collection, but not nearlyas fast as it could be.