Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up date_histogram without children #63643

Merged
merged 60 commits into from
Nov 9, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 13, 2020

This speeds up date_histogram aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:

|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |

This uses the work we did in #61467 to precompute the rounding points for
a date_histogram. Now, when we know the rounding points we execute the
date_histogram as a range aggregation. This is nice for two reasons:

  1. We can further rewrite the range aggregation (see below)
  2. We don't need to allocate a hash to convert rounding points
    to ordinals.
  3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing range aggregations as
a filters aggregation. Normally the filters aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original date_histogram.

The range aggregation is fairly careful in how it rewrites, giving up
on the filters aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}

is executed like:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}

Which in turn is executed like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}

And that is faster because we can execute it "filter by filter".

Finally, notice the range query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the filters agg needs special case handling when the query
is a range query and the filter is a range query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard filter collection, but not nearly
as fast as it could be.

@nik9000
Copy link
Member Author

nik9000 commented Oct 13, 2020

I'm running rally against this now but playing with it by hand seems pretty good.

return weights;
}

private Query filterMatchingBoth(Query lhs, Query rhs) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and everything in it is kind of shameful but it gives a 2x speed improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this merges two filter queries so they can be performed in one pass? I know it's a private method, but I still think a bit of documentation for what it does and why that's important would be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000
Copy link
Member Author

nik9000 commented Oct 13, 2020

Performance numbers look pretty good:

|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1823.89 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1810.09 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1780.56 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1765.17 | ms |
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 41587.3 | ms |

@nik9000
Copy link
Member Author

nik9000 commented Oct 14, 2020

I ran some more quick and dirty performance tests:

$ for d in 180 120 60 40 30 20 10 9 8 7 6 5 4 3 2 1; do
>   for i in $(seq 1 3); do
>     echo -n $d:
>     curl -s -XPOST -HContent-Type:application/json -uelastic:password 'localhost:9200/_search?pretty&error_trace' -d'{
>       "size": 0,
>       "query": {
>         "range": {
>           "dropoff_datetime": {
>             "gte": "2015-01-01 00:00:00",
>             "lt": "2016-01-01 00:00:00"
>           }
>         }
>       },
>       "aggs": {
>         "dropoffs_over_time": {
>           "date_histogram": {
>             "field": "dropoff_datetime",
>             "fixed_interval": "'$d'd",
>             "time_zone": "America/New_York"
>           }
>         }
>       }
>     }' | grep took
>   done
> done
180:  "took" : 1720,
180:  "took" : 1684,
180:  "took" : 1684,
120:  "took" : 3093,
120:  "took" : 4504,
120:  "took" : 1702,
60:  "took" : 1788,
60:  "took" : 1789,
60:  "took" : 1853,
40:  "took" : 3268,
40:  "took" : 4992,
40:  "took" : 1869,
30:  "took" : 3338,
30:  "took" : 4868,
30:  "took" : 1864,
20:  "took" : 3386,
20:  "took" : 4967,
20:  "took" : 1921,
10:  "took" : 3560,
10:  "took" : 5203,
10:  "took" : 2077,
9:  "took" : 3563,
9:  "took" : 5323,
9:  "took" : 2083,
8:  "took" : 3599,
8:  "took" : 5329,
8:  "took" : 2124,
7:  "took" : 3682,
7:  "took" : 5364,
7:  "took" : 2146,
6:  "took" : 3706,
6:  "took" : 5533,
6:  "took" : 2288,
5:  "took" : 3773,
5:  "took" : 5824,
5:  "took" : 2433,
4:  "took" : 3889,
4:  "took" : 6038,
4:  "took" : 2585,
3:  "took" : 4136,
3:  "took" : 6458,
3:  "took" : 2826,
2:  "took" : 10773, <--- the optimization turns off here because there are more than 128 buckets. That 128 limit seems like maybe it is bad!
2:  "took" : 12661,
2:  "took" : 11599,
1:  "took" : 11511,
1:  "took" : 11859,
1:  "took" : 11864,

I think the pattern you see here comes from being able to use the filter cache. Still, even with the filter cache filled with things we don't want the agg runs significantly faster than before.

@nik9000
Copy link
Member Author

nik9000 commented Oct 14, 2020

By the way, this is basically just a revival of @polyfractal's #47712, but reworked so that we can use it for date_histogram which is very very common.

@nik9000
Copy link
Member Author

nik9000 commented Oct 14, 2020

Working on the profile output for this:

{
  "type" : "DateHistogramAggregator.FromDateRange",
  "description" : "histo",
  "time_in_nanos" : 2801503,
  "breakdown" : {
    "reduce" : 0,
    "post_collection_count" : 1,         <---- all of these numbers come from the delegate
    "build_leaf_collector" : 2423271,    <---- the "FilterByFilter" aggregator does everything here
    "build_aggregation" : 371877,
    "build_aggregation_count" : 1,
    "build_leaf_collector_count" : 1,
    "post_collection" : 2352,
    "initialize" : 4003,
    "initialize_count" : 1,
    "reduce_count" : 0,
    "collect" : 0,                       <---- in fact, we never call "collect" on it
    "collect_count" : 0
  },
  "debug" : {
    "delegate" : "RangeAggregator.FromFilters",
    "delegate_debug" : {
      "delegate" : "FiltersAggregator.FilterByFilter",
      "delegate_debug" : {
        "segments_with_deleted_docs" : 0  <---- mostly added this to prove that you could get debug information from the delegate
      }
    }
  }
}

nik9000 added a commit that referenced this pull request Dec 8, 2020
This allows `date_histogram`s with `hard_bounds` and `extended_bounds`
to use the "as range" style optimizations introducedin #63643. There
isn't any work to do for `exended_bounds` besides add a test. For
`hard_bounds` we have to be careful when constructing the ranges that to
filter.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 4, 2021
nik9000 added a commit that referenced this pull request Jan 4, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 5, 2021
This allows us to run the optimization introduced in elastic#63643 when the
`date_histogram` has children. It isn't a revolutionary performance
improvement though because children tend to be a lot heavier than the
`date_histogram`. It is faster, but only by a couple of percentage
points.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 5, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with elastic#63643 we run
`date_histogram` and `range` aggregations using `filters.
nik9000 added a commit that referenced this pull request Jan 7, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with #63643 we run
`date_histogram` and `range` aggregations using `filters.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 7, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with elastic#63643 we run
`date_histogram` and `range` aggregations using `filters.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 7, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with elastic#63643 we run
`date_histogram` and `range` aggregations using `filters`.
nik9000 added a commit that referenced this pull request Jan 7, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with #63643 we run
`date_histogram` and `range` aggregations using `filters.
nik9000 added a commit that referenced this pull request Jan 7, 2021
Fixes a bug where nested documents that match a filter in the `filters`
agg will be counted as matching the filter. Usually nested documents
only match if you explicitly ask to match them. Worse, we only mach them
in the "filter by filter" mode that we wrote to speed up date_histogram.
The `filters` agg is fairly rare, but with #63643 we run
`date_histogram` and `range` aggregations using `filters`.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 2, 2021
This allows many of the optimizations added in elastic#63643 and elastic#68871 to run
on aggregations with sub-aggregations. This should:
* Speed up `terms` aggregations on fields with less than 1000 values that
  also have sub-aggregations. Locally I see 2 second searches run in 1.2
  seconds.
* Applies that same speedup to `range` and `date_histogram` aggregations but
  it feels less impressive because the point range queries are a little
  slower to get up and go.
* Massively speed up `filters` aggregations with sub-aggregations that
  don't have a `parent` aggregation or collect "other" buckets. Also
  save a ton of memory while collecting them.
nik9000 added a commit that referenced this pull request Mar 3, 2021
This allows many of the optimizations added in #63643 and #68871 to run
on aggregations with sub-aggregations. This should:
* Speed up `terms` aggregations on fields with less than 1000 values that
  also have sub-aggregations. Locally I see 2 second searches run in 1.2
  seconds.
* Applies that same speedup to `range` and `date_histogram` aggregations but
  it feels less impressive because the point range queries are a little
  slower to get up and go.
* Massively speed up `filters` aggregations with sub-aggregations that
  don't have a `parent` aggregation or collect "other" buckets. Also
  save a ton of memory while collecting them.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 3, 2021
This allows many of the optimizations added in elastic#63643 and elastic#68871 to run
on aggregations with sub-aggregations. This should:
* Speed up `terms` aggregations on fields with less than 1000 values that
  also have sub-aggregations. Locally I see 2 second searches run in 1.2
  seconds.
* Applies that same speedup to `range` and `date_histogram` aggregations but
  it feels less impressive because the point range queries are a little
  slower to get up and go.
* Massively speed up `filters` aggregations with sub-aggregations that
  don't have a `parent` aggregation or collect "other" buckets. Also
  save a ton of memory while collecting them.
nik9000 added a commit that referenced this pull request Mar 5, 2021
This allows many of the optimizations added in #63643 and #68871 to run
on aggregations with sub-aggregations. This should:
* Speed up `terms` aggregations on fields with less than 1000 values that
  also have sub-aggregations. Locally I see 2 second searches run in 1.2
  seconds.
* Applies that same speedup to `range` and `date_histogram` aggregations but
  it feels less impressive because the point range queries are a little
  slower to get up and go.
* Massively speed up `filters` aggregations with sub-aggregations that
  don't have a `parent` aggregation or collect "other" buckets. Also
  save a ton of memory while collecting them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants