-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternatives to disabling or filtering the _source
field at index time
#11116
Comments
I like the discussion here, thank for raising it. In general I still think that it should up to the user if he wants to disable the _source field of filter it. The problem with this was just because there was a lot of documentation around that suggested to do this (same wrong suggestions like those tons of blog posts telling Apache Solr users to commit and optimize the whole index after each document insert...). I would suggest to still allow to disable or filter the_source field, but then atomatically all those services like document update API or reindexing just throw UnsupportedOperationException. I am perfectly fine with that. I don't rely on Elasticsearch as a primary data store. I just index all documents and can reindex the stuff without the need for _source documents. I just need to store the search result snippet - and I am fine. This is why I want to filter _source. This is just the classical use-case for lucene: Index large documents and store just the snippet in the index. This is the classical full text search engine use-case. If this one is no longer supported, sorry: Elasticsearch is no longer an option for this type of stuff. |
I generally don't like the "protect users from doing the wrong thing" approach. To me these issues should be solved by documentation, but should not limit the options users have unless usage of these provide security/stability issues (like To change Same goes for I understand some of the complications and the desire to limit code paths. My compromise suggestion would be to always keep |
This should be possible to implement quite easily by a simple check on the mapping like: if (mapping.isSourceDisabled() || mapping.isSourceFiltered()) throw new UnsupportedOperationException(...) This is in my opinion the easiest to do. And of course in the documentation clearly state that those options disable all those services that rely on full _source field availability. |
We haven't hit this, btw. We do 130 million prefix searches a day (edge ngram query, not completion suggester) that just return a handful of small fields. We load them from _source without trouble. I tried it with stored fields a year ago and didn't see any improvement. I'm sure there is a breaking point, but its a couple of times bigger than the average size of wikipedia pages. My instinct is once you start talking about single digit MB size fields. It'd be nice to those large fields out of the working set but that's a harder thing to measure. |
In the PDF case this is often the issue. I have some documents with like 80 MiB of text (I know this is a problem completely), but we just have them there to allow actually finding them, but the score of those hits is quite small (forcefully boosted down). I would never ever load those fields from source, so it is for sure an issue to store this completely useless information. As a user I want to have the power to prevent them to go as plain text into a field, sorry. I am using Elasticsearch not as a database or data store, just as a search engine. PERIOD. As said on the other issue: One important thing is to store the _source field as CBOR instead of JSON, this improved a lot if you are scanning the whole index. The other issue: I just don't want to exhaust my I/O cache just because I load 80 MiB of data for nonsense, just to display the title of a document. |
80mb will do that, yeah. In #9034 @jpountz describes smooshing the _source into multiple stored fields so you only load what you need. Its still pretty useless dropping the 80mb document onto the disk but at least in the you don't blow out the cache. That feels like it'd be good enough for me. I like simplicity of that proposal because it feels like you really could hide stored fields behind a _source abstraction. |
For 2.0, we're going to:
|
I agree with @honzakral because I also hate the "protect users from doing the wrong thing" approach. Documentation is the right approach. And failing early! If I disable or filter my source field because I don't ever want to load the data or reindex my stuff its my own decision. It is just important to:
There is nothing more from my side. I just want to have this simple possibility to decide on my own and have the full flexibility. The code is self-contained and was not even removed in the "disable source filtering" patch (#10814). The whole source filtering code is still there!!! It was just "forbidden" to use. The only actual code change next to cleanups was a single if-statement that disallowed to use of "include/exclude" in the mapping. I want this one to be reverted. |
@clintongormley looks like a nice proposal! |
I agree with @honzakral that issues like that should be solved by documentation and instructions, not by the implementation. However, if there's a general consensus to move this configuration from index mappings to settings, I'm OK with that. Let's just keep the configuration option available to users. (Regarding filtering |
This adds back the ability to disable _source, as well as set includes and excludes. However, it also restricts these settings to not be updateable. enabled was actually already not modifiable, but no conflict was previously given if an attempt was made to change it. This also adds a check that can be made on the source mapper to know if the the source is "complete" and can be used for purposes other than returning in search or get requests. There is one example use here in highlighting, but more need to be added in a follow up issue (eg in the update API). closes elastic#11116
It would be very helpful to also add back the ability to disable compression as well. We have our ES nodes virtualized and stored on a Pure Storage array that does inline dedup and compression, however the ES compression completely breaks its dedup features. |
The ability to disable compression won't come back. By the way it only applied to a part of the index (stored fields and term vectors), while it has never been possible to disable compression on the terms dictionary, postings, position data, doc values, etc. According to the documentation, PureStorage does deduplication of blocks of 512 bytes and uses LZO for compression so I don't think that disabling compression in elasticsearch would bring any significant gains on realistic data when running on PureStorage. |
In #10915 we removed the ability to disable the
_source
field, and in #10814 we removed the ability to useincludes
andexcludes
to remove selected fields from the_source
field that is stored with each document.The reason for this is that a number of important existing and future features rely on having the complete original
_source
field available in Elasticsearch, such as:update
APIIn our experience, many new users disable
_source
just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.Instead, we have the ability to:
_source
field that is returned to the user (Added source fetching and filtering parameters to search, get, multi-get, get-source and explain requests #3302)best_compression
option for indices #8863)path
parameter (API: Add response filtering withfilter_path
parameter #10980)The above changes are good for the most common use cases, probably 90% of our user base. However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:
No source needed
High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the
_source
field and all of the benefits that come with having the original source.Why an index setting?
Previously, users could do this just by setting
_source.enabled: false
, so why switch this to an index setting? Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it. By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.Reading a large
_source
is slow and unnecessaryUsers who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets. However, returning just the
title
field necessitates reading (and then filtering out) the largecontents
field as well.Previously, users used the
source.includes
andsource.excludes
filters to remove these large fields from the_source
, but as a consequence, this disables all of the features mentioned above. As an alternative, the user can still disable the_source
field and set individual fields tostore: true
.It would be nice to do better though: to keep the original
_source
but make search responses requiring just a few fields faster than they are today. Two proposals:Add a
_response_source
fieldThe original
_source
would still be stored, but the_response_source
would be a second stored field with a filtered list of fields (behaving like the oldincludes/excludes). The user could choose which field should be returned with their search requests. Compression would minimise the amount of extra storage required because the fields in the
_response_sourcewould be a subset of those in the
_source`.Store top-level fields as separate stored fields
As suggested in #9034, the
_source
field would be stored as separate stored fields, one for top-level field in the JSON document. This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as[1,null,1]
or[]
etc can be returned correctly.An advantage of this solution is that the decision about which fields to return is query time, while the
_response_source
option is set at index time.This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.
Thoughts?
The text was updated successfully, but these errors were encountered: