Speed up date_histogram without children #63643

nik9000 · 2020-10-13T20:37:20Z

This speeds up date_histogram aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:

|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |

This uses the work we did in #61467 to precompute the rounding points for
a date_histogram. Now, when we know the rounding points we execute the
date_histogram as a range aggregation. This is nice for two reasons:

We can further rewrite the range aggregation (see below)
We don't need to allocate a hash to convert rounding points
to ordinals.
We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing range aggregations as
a filters aggregation. Normally the filters aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original date_histogram.

The range aggregation is fairly careful in how it rewrites, giving up
on the filters aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}

is executed like:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}

Which in turn is executed like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}

And that is faster because we can execute it "filter by filter".

Finally, notice the range query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the filters agg needs special case handling when the query
is a range query and the filter is a range query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard filter collection, but not nearly
as fast as it could be.

WIP

nik9000 · 2020-10-13T20:38:20Z

I'm running rally against this now but playing with it by hand seems pretty good.

nik9000 · 2020-10-13T20:39:23Z

server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java

+        return weights;
+    }
+
+    private Query filterMatchingBoth(Query lhs, Query rhs) {


This method and everything in it is kind of shameful but it gives a 2x speed improvement.

so, this merges two filter queries so they can be performed in one pass? I know it's a private method, but I still think a bit of documentation for what it does and why that's important would be good.

nik9000 · 2020-10-13T22:56:54Z

Performance numbers look pretty good:

|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1823.89 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1810.09 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1780.56 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1765.17 | ms |
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 41587.3 | ms |

nik9000 · 2020-10-14T13:52:16Z

I ran some more quick and dirty performance tests:

$ for d in 180 120 60 40 30 20 10 9 8 7 6 5 4 3 2 1; do
>   for i in $(seq 1 3); do
>     echo -n $d:
>     curl -s -XPOST -HContent-Type:application/json -uelastic:password 'localhost:9200/_search?pretty&error_trace' -d'{
>       "size": 0,
>       "query": {
>         "range": {
>           "dropoff_datetime": {
>             "gte": "2015-01-01 00:00:00",
>             "lt": "2016-01-01 00:00:00"
>           }
>         }
>       },
>       "aggs": {
>         "dropoffs_over_time": {
>           "date_histogram": {
>             "field": "dropoff_datetime",
>             "fixed_interval": "'$d'd",
>             "time_zone": "America/New_York"
>           }
>         }
>       }
>     }' | grep took
>   done
> done
180:  "took" : 1720,
180:  "took" : 1684,
180:  "took" : 1684,
120:  "took" : 3093,
120:  "took" : 4504,
120:  "took" : 1702,
60:  "took" : 1788,
60:  "took" : 1789,
60:  "took" : 1853,
40:  "took" : 3268,
40:  "took" : 4992,
40:  "took" : 1869,
30:  "took" : 3338,
30:  "took" : 4868,
30:  "took" : 1864,
20:  "took" : 3386,
20:  "took" : 4967,
20:  "took" : 1921,
10:  "took" : 3560,
10:  "took" : 5203,
10:  "took" : 2077,
9:  "took" : 3563,
9:  "took" : 5323,
9:  "took" : 2083,
8:  "took" : 3599,
8:  "took" : 5329,
8:  "took" : 2124,
7:  "took" : 3682,
7:  "took" : 5364,
7:  "took" : 2146,
6:  "took" : 3706,
6:  "took" : 5533,
6:  "took" : 2288,
5:  "took" : 3773,
5:  "took" : 5824,
5:  "took" : 2433,
4:  "took" : 3889,
4:  "took" : 6038,
4:  "took" : 2585,
3:  "took" : 4136,
3:  "took" : 6458,
3:  "took" : 2826,
2:  "took" : 10773, <--- the optimization turns off here because there are more than 128 buckets. That 128 limit seems like maybe it is bad!
2:  "took" : 12661,
2:  "took" : 11599,
1:  "took" : 11511,
1:  "took" : 11859,
1:  "took" : 11864,

I think the pattern you see here comes from being able to use the filter cache. Still, even with the filter cache filled with things we don't want the agg runs significantly faster than before.

nik9000 · 2020-10-14T13:53:42Z

By the way, this is basically just a revival of @polyfractal's #47712, but reworked so that we can use it for date_histogram which is very very common.

nik9000 · 2020-10-14T17:00:15Z

Working on the profile output for this:

{
  "type" : "DateHistogramAggregator.FromDateRange",
  "description" : "histo",
  "time_in_nanos" : 2801503,
  "breakdown" : {
    "reduce" : 0,
    "post_collection_count" : 1,         <---- all of these numbers come from the delegate
    "build_leaf_collector" : 2423271,    <---- the "FilterByFilter" aggregator does everything here
    "build_aggregation" : 371877,
    "build_aggregation_count" : 1,
    "build_leaf_collector_count" : 1,
    "post_collection" : 2352,
    "initialize" : 4003,
    "initialize_count" : 1,
    "reduce_count" : 0,
    "collect" : 0,                       <---- in fact, we never call "collect" on it
    "collect_count" : 0
  },
  "debug" : {
    "delegate" : "RangeAggregator.FromFilters",
    "delegate_debug" : {
      "delegate" : "FiltersAggregator.FilterByFilter",
      "delegate_debug" : {
        "segments_with_deleted_docs" : 0  <---- mostly added this to prove that you could get debug information from the delegate
      }
    }
  }
}

This is weird

This allows `date_histogram`s with `hard_bounds` and `extended_bounds` to use the "as range" style optimizations introducedin #63643. There isn't any work to do for `exended_bounds` besides add a test. For `hard_bounds` we have to be careful when constructing the ranges that to filter.

This allows us to run the optimization introduced in elastic#63643 when the `date_histogram` has children. It isn't a revolutionary performance improvement though because children tend to be a lot heavier than the `date_histogram`. It is faster, but only by a couple of percentage points.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with elastic#63643 we run `date_histogram` and `range` aggregations using `filters`.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters.

Fixes a bug where nested documents that match a filter in the `filters` agg will be counted as matching the filter. Usually nested documents only match if you explicitly ask to match them. Worse, we only mach them in the "filter by filter" mode that we wrote to speed up date_histogram. The `filters` agg is fairly rare, but with #63643 we run `date_histogram` and `range` aggregations using `filters`.

This allows many of the optimizations added in elastic#63643 and elastic#68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.

This allows many of the optimizations added in #63643 and #68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.

This allows many of the optimizations added in elastic#63643 and elastic#68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.

This allows many of the optimizations added in #63643 and #68871 to run on aggregations with sub-aggregations. This should: * Speed up `terms` aggregations on fields with less than 1000 values that also have sub-aggregations. Locally I see 2 second searches run in 1.2 seconds. * Applies that same speedup to `range` and `date_histogram` aggregations but it feels less impressive because the point range queries are a little slower to get up and go. * Massively speed up `filters` aggregations with sub-aggregations that don't have a `parent` aggregation or collect "other" buckets. Also save a ton of memory while collecting them.

nik9000 added 14 commits October 12, 2020 15:34

Execute date_histo agg as date_range agg

3eb94bc

WIP

factor out collector

3f99f04

ordered

6c8cb0b

refactor

995ca24

Fixup

d7becd8

Better name

0a1987d

Experiment

1258ad4

Rework

da02047

Use query

9157f00

Super hack

20047dc

Shuffle

e69628e

Tests

607ae5f

look

0e68cad

no looking

9754fb0

nik9000 requested review from not-napoleon and polyfractal October 13, 2020 20:37

nik9000 commented Oct 13, 2020

View reviewed changes

nik9000 added 3 commits October 13, 2020 16:42

tests

0750aa8

Handle unbounded ranges

c57c98a

Test for max and min

4021327

Merge branch 'master' into date_histo_as_range

7a44c21

nik9000 added 4 commits October 14, 2020 13:13

Fixup profiler

73aaa45

Rate agg

b73df70

This is weird

WIP

b7e8dcc

Fixup weird formats

7c18141

nik9000 mentioned this pull request Dec 8, 2020

Optimize date_historam's hard_bounds (backport of #66051) #66061

Merged

nik9000 mentioned this pull request Dec 29, 2020

Support for overlapping "buckets" in the date histogram #66856

Closed

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 4, 2021

Add highlight for elastic#63643

75ca076

nik9000 mentioned this pull request Jan 4, 2021

Add highlight for #63643 #66939

Merged

nik9000 added a commit that referenced this pull request Jan 4, 2021

Add highlight for #63643 (#66939)

399b9d4

nik9000 mentioned this pull request Jan 5, 2021

Small speed up of date_histogram with children #67012

Closed

nik9000 mentioned this pull request Jan 5, 2021

Fix bug with nested and filters agg #67043

Merged

nik9000 mentioned this pull request Jan 7, 2021

Fix bug with nested and filters agg (backport of #67043) #67167

Merged

nik9000 mentioned this pull request Jan 7, 2021

Fix bug with nested and filters agg (backport of #67043) #67171

Merged

nik9000 mentioned this pull request Mar 2, 2021

Speed up aggs with sub-aggregations #69806

Merged

nik9000 mentioned this pull request Mar 3, 2021

Speed up aggs with sub-aggregations (backport of #69806) #69940

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

wjp719 mentioned this pull request Apr 6, 2022

More optimal forced merges when max_num_segments is greater than 1 #85065

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up date_histogram without children #63643

Speed up date_histogram without children #63643

nik9000 commented Oct 13, 2020 •

edited

Loading

nik9000 commented Oct 13, 2020

nik9000 Oct 13, 2020

not-napoleon Oct 15, 2020

nik9000 Oct 22, 2020

nik9000 commented Oct 13, 2020

nik9000 commented Oct 14, 2020

nik9000 commented Oct 14, 2020

nik9000 commented Oct 14, 2020

Speed up date_histogram without children #63643

Speed up date_histogram without children #63643

Conversation

nik9000 commented Oct 13, 2020 • edited Loading

nik9000 commented Oct 13, 2020

nik9000 Oct 13, 2020

Choose a reason for hiding this comment

not-napoleon Oct 15, 2020

Choose a reason for hiding this comment

nik9000 Oct 22, 2020

Choose a reason for hiding this comment

nik9000 commented Oct 13, 2020

nik9000 commented Oct 14, 2020

nik9000 commented Oct 14, 2020

nik9000 commented Oct 14, 2020

nik9000 commented Oct 13, 2020 •

edited

Loading