The terms dictionary is the part of the inverted index that records all terms that occur within a segment in sorted order. In order to provide fast retrieval, terms dictionaries come with a small terms index that allows for efficient random access by term. Until now this terms index had always been loaded on-heap.
As of 7.0, the terms index is loaded on-heap for fields that only have unique
values such as _id
fields, and off-heap otherwise - likely most other fields.
This is expected to reduce memory requirements but might slow down search
requests if both below conditions are met:
-
The size of the data directory on each node is significantly larger than the amount of memory that is available to the filesystem cache.
-
The number of matches of the query is not several orders of magnitude greater than the number of terms that the query tries to match, either explicitly via
term
orterms
queries, or implicitly via multi-term queries such asprefix
,wildcard
orfuzzy
queries.
This change affects both existing indices created with Elasticsearch 6.x and new indices created with Elasticsearch 7.x.
-
The default value for
transpositions
parameter offuzzy
query has been changed totrue
. -
The
query_string
optionsuse_dismax
,split_on_whitespace
,all_fields
,locale
,auto_generate_phrase_query
andlowercase_expanded_terms
deprecated in 6.x have been removed. -
Purely negative queries (only MUST_NOT clauses) now return a score of
0
rather than1
. -
The boundary specified using geohashes in the
geo_bounding_box
query now include entire geohash cell, instead of just geohash center. -
Attempts to generate multi-term phrase queries against non-text fields with a custom analyzer will now throw an exception.
-
An
envelope
crossing the dateline in a `geo_shape `query is now processed correctly when specified using REST API instead of having its left and right corners flipped. -
Attempts to set
boost
on inner span queries will now throw a parsing exception.
Adaptive replica selection has been enabled by default. If you wish to return to
the older round robin of search requests, you can use the
cluster.routing.use_adaptive_replica_selection
setting:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.use_adaptive_replica_selection": false
}
}
The Search API returns 400 - Bad request
while it would previously return
500 - Internal Server Error
in the following cases of invalid request:
-
the result window is too large
-
sort is used in combination with rescore
-
the rescore window is too large
-
the number of slices is too large
-
keep alive for scroll is too large
-
number of filters in the adjacency matrix aggregation is too large
-
script compilation errors
Setting request_cache:true
on a query that creates a scroll (scroll=1m
)
has been deprecated in 6 and will now return a 400 - Bad request
.
Scroll queries are not meant to be cached.
Including a rescore clause on a query that creates a scroll (scroll=1m
) has
been deprecated in 6.5 and will now return a 400 - Bad request
. Allowing
rescore on scroll queries would break the scroll sort. In the 6.x line, the
rescore clause was silently ignored (for scroll queries), and it was allowed in
the 5.x line.
The following string distance algorithms were given additional names in 6.2 and their existing names were deprecated. The deprecated names have now been removed.
-
levenstein
- replaced bylevenshtein
-
jarowinkler
- replaced byjaro_winkler
The popular
mode for Suggesters (term
and phrase
) now uses the doc frequency
(instead of the sum of the doc frequency) of the input terms to compute the frequency
threshold for candidate suggestions.
Executing a Terms Query with a lot of terms may degrade the cluster performance,
as each additional term demands extra processing and memory.
To safeguard against this, the maximum number of terms that can be used in a
Terms Query request has been limited to 65536. This default maximum can be changed
for a particular index with the index setting index.max_terms_count
.
Executing a Regexp Query with a long regex string may degrade search performance.
To safeguard against this, the maximum length of regex that can be used in a
Regexp Query request has been limited to 1000. This default maximum can be changed
for a particular index with the index setting index.max_regex_length
.
Executing queries that use automatic expansion of fields (e.g. query_string
, simple_query_string
or multi_match
) can have performance issues for indices with a large numbers of fields.
To safeguard against this, a default limit of 1024 fields has been introduced for
queries using the "all fields" mode ("default_field": ""
) or other fieldname
expansions (e.g. "foo
"
). If needed, you can change this limit using the
indices.query.bool.max_clause_count
static cluster setting.
Search requests with extra content after the main object will no longer be accepted
by the _search
endpoint. A parsing exception will be thrown instead.
The format of doc-value fields is changing to be the same as what could be
obtained in 6.x with the special use_field_mapping
format. This is mostly a
change for date fields, which are now formatted based on the format that is
configured in the mappings by default. This behavior can be changed by
specifying a format
within the doc-value
field.
The ability to query and index context enabled suggestions without context, deprecated in 6.x, has been removed. Context enabled suggestion queries without contexts have to visit every suggestion, which degrades the search performance considerably.
For geo context the value of the path
parameter is now validated against the mapping,
and the context is only accepted if path
points to a field with geo_point
type.
max_concurrent_shard_requests
used to limit the total number of concurrent shard
requests a single high level search request can execute. In 7.0 this changed to be the
max number of concurrent shard requests per node. The default is now 5
.
max_score
used to be set to 0
whenever scores are not tracked. null
is now used
instead which is a more appropriate value for a scenario where scores are not available.
Setting a negative boost
for a query or a field, deprecated in 6x, is not allowed in this version.
To deboost a specific query or field you can use a boost
comprise between 0 and 1.
Negative scores in the Function Score Query are deprecated in 6.x, and are
not allowed in this version. If a negative score is produced as a result
of computation (e.g. in script_score
or field_value_factor
functions),
an error will be thrown.
The filter
context has been removed from Elasticsearch’s query builders,
the distinction between queries and filters is now decided in Lucene depending
on whether queries need to access score or not. As a result bool
queries with
should
clauses that don’t need to access the score will no longer set their
minimum_should_match
to 1. This behavior has been deprecated in the previous
major version.
The total hits that match the search request is now returned as an object
with a value
and a relation
. value
indicates the number of hits that
match and relation
indicates whether the value is accurate (eq
) or a lower bound
(gte
):
{
"hits": {
"total": {
"value": 1000,
"relation": "eq"
},
...
}
}
The total
object in the response indicates that the query matches exactly 1000
documents ("eq"). The value
is always accurate ("relation": "eq"
) when
track_total_hits
is set to true in the request.
You can also retrieve hits.total
as a number in the rest response by adding
rest_total_hits_as_int=true
in the request parameter of the search request.
This parameter has been added to ease the transition to the new format and
will be removed in the next major version (8.0).
If track_total_hits
is set to false
in the search request the search response
will set hits.total
to null and the object will not be displayed in the rest
layer. You can add rest_total_hits_as_int=true
in the search request parameters
to get the old format back ("total": -1
).
By default search request will count the total hits accurately up to 10,000
documents. If the total number of hits that match the query is greater than this
value, the response will indicate that the returned value is a lower bound:
{
"_shards": ...
"timed_out": false,
"took": 100,
"hits": {
"max_score": 1.0,
"total" : {
"value": 10000, (1)
"relation": "gte" (2)
},
"hits": ...
}
}
-
There are at least 10000 documents that match the query
-
This is a lower bound (
"gte"
).
You can force the count to always be accurate by setting "track_total_hits
to true explicitly in the search request.
Lucene 8 introduced more constraints on similarities, in particular:
-
scores must not be negative,
-
scores must not decrease when term freq increases,
-
scores must not increase when norm (interpreted as an unsigned long) increases.
Negative weight
parameters in the function_score
are no longer allowed.
The number of automatically expanded fields for the "all fields"
mode ("default_field": "*"
) for the query_string and simple_query_string
queries is now 1024 fields.