Skip to content

Latest commit

 

History

History
317 lines (248 loc) · 11.9 KB

search.asciidoc

File metadata and controls

317 lines (248 loc) · 11.9 KB

Search and Query DSL changes

Off-heap terms index

The terms dictionary is the part of the inverted index that records all terms that occur within a segment in sorted order. In order to provide fast retrieval, terms dictionaries come with a small terms index that allows for efficient random access by term. Until now this terms index had always been loaded on-heap.

As of 7.0, the terms index is loaded on-heap for fields that only have unique values such as _id fields, and off-heap otherwise - likely most other fields. This is expected to reduce memory requirements but might slow down search requests if both below conditions are met:

  • The size of the data directory on each node is significantly larger than the amount of memory that is available to the filesystem cache.

  • The number of matches of the query is not several orders of magnitude greater than the number of terms that the query tries to match, either explicitly via term or terms queries, or implicitly via multi-term queries such as prefix, wildcard or fuzzy queries.

This change affects both existing indices created with Elasticsearch 6.x and new indices created with Elasticsearch 7.x.

Changes to queries

  • The default value for transpositions parameter of fuzzy query has been changed to true.

  • The query_string options use_dismax, split_on_whitespace, all_fields, locale, auto_generate_phrase_query and lowercase_expanded_terms deprecated in 6.x have been removed.

  • Purely negative queries (only MUST_NOT clauses) now return a score of 0 rather than 1.

  • The boundary specified using geohashes in the geo_bounding_box query now include entire geohash cell, instead of just geohash center.

  • Attempts to generate multi-term phrase queries against non-text fields with a custom analyzer will now throw an exception.

  • An envelope crossing the dateline in a `geo_shape `query is now processed correctly when specified using REST API instead of having its left and right corners flipped.

  • Attempts to set boost on inner span queries will now throw a parsing exception.

Adaptive replica selection enabled by default

Adaptive replica selection has been enabled by default. If you wish to return to the older round robin of search requests, you can use the cluster.routing.use_adaptive_replica_selection setting:

PUT /_cluster/settings
{
    "transient": {
        "cluster.routing.use_adaptive_replica_selection": false
    }
}

Search API returns 400 for invalid requests

The Search API returns 400 - Bad request while it would previously return 500 - Internal Server Error in the following cases of invalid request:

  • the result window is too large

  • sort is used in combination with rescore

  • the rescore window is too large

  • the number of slices is too large

  • keep alive for scroll is too large

  • number of filters in the adjacency matrix aggregation is too large

  • script compilation errors

Scroll queries cannot use the request_cache anymore

Setting request_cache:true on a query that creates a scroll (scroll=1m) has been deprecated in 6 and will now return a 400 - Bad request. Scroll queries are not meant to be cached.

Scroll queries cannot use rescore anymore

Including a rescore clause on a query that creates a scroll (scroll=1m) has been deprecated in 6.5 and will now return a 400 - Bad request. Allowing rescore on scroll queries would break the scroll sort. In the 6.x line, the rescore clause was silently ignored (for scroll queries), and it was allowed in the 5.x line.

Term Suggesters supported distance algorithms

The following string distance algorithms were given additional names in 6.2 and their existing names were deprecated. The deprecated names have now been removed.

  • levenstein - replaced by levenshtein

  • jarowinkler - replaced by jaro_winkler

The popular mode for Suggesters (term and phrase) now uses the doc frequency (instead of the sum of the doc frequency) of the input terms to compute the frequency threshold for candidate suggestions.

Limiting the number of terms that can be used in a Terms Query request

Executing a Terms Query with a lot of terms may degrade the cluster performance, as each additional term demands extra processing and memory. To safeguard against this, the maximum number of terms that can be used in a Terms Query request has been limited to 65536. This default maximum can be changed for a particular index with the index setting index.max_terms_count.

Limiting the length of regex that can be used in a Regexp Query request

Executing a Regexp Query with a long regex string may degrade search performance. To safeguard against this, the maximum length of regex that can be used in a Regexp Query request has been limited to 1000. This default maximum can be changed for a particular index with the index setting index.max_regex_length.

Limiting the number of auto-expanded fields

Executing queries that use automatic expansion of fields (e.g. query_string, simple_query_string or multi_match) can have performance issues for indices with a large numbers of fields. To safeguard against this, a default limit of 1024 fields has been introduced for queries using the "all fields" mode ("default_field": "") or other fieldname expansions (e.g. "foo"). If needed, you can change this limit using the indices.query.bool.max_clause_count static cluster setting.

Invalid _search request body

Search requests with extra content after the main object will no longer be accepted by the _search endpoint. A parsing exception will be thrown instead.

Doc-value fields default format

The format of doc-value fields is changing to be the same as what could be obtained in 6.x with the special use_field_mapping format. This is mostly a change for date fields, which are now formatted based on the format that is configured in the mappings by default. This behavior can be changed by specifying a format within the doc-value field.

Context Completion Suggester

The ability to query and index context enabled suggestions without context, deprecated in 6.x, has been removed. Context enabled suggestion queries without contexts have to visit every suggestion, which degrades the search performance considerably.

For geo context the value of the path parameter is now validated against the mapping, and the context is only accepted if path points to a field with geo_point type.

Semantics changed for max_concurrent_shard_requests

max_concurrent_shard_requests used to limit the total number of concurrent shard requests a single high level search request can execute. In 7.0 this changed to be the max number of concurrent shard requests per node. The default is now 5.

max_score set to null when scores are not tracked

max_score used to be set to 0 whenever scores are not tracked. null is now used instead which is a more appropriate value for a scenario where scores are not available.

Negative boosts are not allowed

Setting a negative boost for a query or a field, deprecated in 6x, is not allowed in this version. To deboost a specific query or field you can use a boost comprise between 0 and 1.

Negative scores are not allowed in Function Score Query

Negative scores in the Function Score Query are deprecated in 6.x, and are not allowed in this version. If a negative score is produced as a result of computation (e.g. in script_score or field_value_factor functions), an error will be thrown.

The filter context has been removed

The filter context has been removed from Elasticsearch’s query builders, the distinction between queries and filters is now decided in Lucene depending on whether queries need to access score or not. As a result bool queries with should clauses that don’t need to access the score will no longer set their minimum_should_match to 1. This behavior has been deprecated in the previous major version.

hits.total is now an object in the search response

The total hits that match the search request is now returned as an object with a value and a relation. value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte):

{
    "hits": {
        "total": {
            "value": 1000,
            "relation": "eq"
        },
        ...
    }
}

The total object in the response indicates that the query matches exactly 1000 documents ("eq"). The value is always accurate ("relation": "eq") when track_total_hits is set to true in the request. You can also retrieve hits.total as a number in the rest response by adding rest_total_hits_as_int=true in the request parameter of the search request. This parameter has been added to ease the transition to the new format and will be removed in the next major version (8.0).

hits.total is omitted in the response if track_total_hits is disabled (false)

If track_total_hits is set to false in the search request the search response will set hits.total to null and the object will not be displayed in the rest layer. You can add rest_total_hits_as_int=true in the search request parameters to get the old format back ("total": -1).

track_total_hits defaults to 10,000

By default search request will count the total hits accurately up to 10,000 documents. If the total number of hits that match the query is greater than this value, the response will indicate that the returned value is a lower bound:

{
     "_shards": ...
     "timed_out": false,
     "took": 100,
     "hits": {
         "max_score": 1.0,
         "total" : {
             "value": 10000,    (1)
             "relation": "gte"  (2)
         },
         "hits": ...
     }
}
  1. There are at least 10000 documents that match the query

  2. This is a lower bound ("gte").

You can force the count to always be accurate by setting "track_total_hits to true explicitly in the search request.

Limitations on Similarities

Lucene 8 introduced more constraints on similarities, in particular:

  • scores must not be negative,

  • scores must not decrease when term freq increases,

  • scores must not increase when norm (interpreted as an unsigned long) increases.

Weights in Function Score must be positive

Negative weight parameters in the function_score are no longer allowed.

Query string and Simple query string limit expansion of fields to 1024

The number of automatically expanded fields for the "all fields" mode ("default_field": "*") for the query_string and simple_query_string queries is now 1024 fields.