Alternatives to disabling or filtering the `_source` field at index time #11116

clintongormley · 2015-05-12T12:58:49Z

In #10915 we removed the ability to disable the _source field, and in #10814 we removed the ability to use includes and excludes to remove selected fields from the _source field that is stored with each document.

The reason for this is that a number of important existing and future features rely on having the complete original _source field available in Elasticsearch, such as:

the update API
on-the-fly highlighting
reindexing (either to change mappings/analysis or in to upgrade an index over major versions)
automated repair of index corruption
the ability to debug problems by viewing the original source used for indexing

In our experience, many new users disable _source just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.

Instead, we have the ability to:

filter the contents of the _source field that is returned to the user (Added source fetching and filtering parameters to search, get, multi-get, get-source and explain requests #3302)
enable a higher compression ratio (Add best_compression option for indices #8863)
filter down the entire search response with the path parameter (API: Add response filtering with filter_path parameter #10980)

The above changes are good for the most common use cases, probably 90% of our user base. However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:

No source needed

High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the _source field and all of the benefits that come with having the original source.

Why an index setting?

Previously, users could do this just by setting _source.enabled: false, so why switch this to an index setting? Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it. By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.

Reading a large `_source` is slow and unnecessary

Users who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets. However, returning just the title field necessitates reading (and then filtering out) the large contents field as well.

Previously, users used the source.includes and source.excludes filters to remove these large fields from the _source, but as a consequence, this disables all of the features mentioned above. As an alternative, the user can still disable the _source field and set individual fields to store: true.

It would be nice to do better though: to keep the original _source but make search responses requiring just a few fields faster than they are today. Two proposals:

Add a _response_source field

The original _source would still be stored, but the _response_source would be a second stored field with a filtered list of fields (behaving like the old includes/excludes). The user could choose which field should be returned with their search requests. Compression would minimise the amount of extra storage required because the fields in the_response_sourcewould be a subset of those in the_source`.

Store top-level fields as separate stored fields

As suggested in #9034, the _source field would be stored as separate stored fields, one for top-level field in the JSON document. This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as [1,null,1] or [] etc can be returned correctly.

An advantage of this solution is that the decision about which fields to return is query time, while the _response_source option is set at index time.

This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.

Thoughts?

The text was updated successfully, but these errors were encountered:

uschindler · 2015-05-12T13:30:55Z

I like the discussion here, thank for raising it.

In general I still think that it should up to the user if he wants to disable the _source field of filter it. The problem with this was just because there was a lot of documentation around that suggested to do this (same wrong suggestions like those tons of blog posts telling Apache Solr users to commit and optimize the whole index after each document insert...).

I would suggest to still allow to disable or filter the_source field, but then atomatically all those services like document update API or reindexing just throw UnsupportedOperationException.

I am perfectly fine with that. I don't rely on Elasticsearch as a primary data store. I just index all documents and can reindex the stuff without the need for _source documents. I just need to store the search result snippet - and I am fine. This is why I want to filter _source. This is just the classical use-case for lucene: Index large documents and store just the snippet in the index. This is the classical full text search engine use-case. If this one is no longer supported, sorry: Elasticsearch is no longer an option for this type of stuff.

honzakral · 2015-05-12T13:34:01Z

I generally don't like the "protect users from doing the wrong thing" approach. To me these issues should be solved by documentation, but should not limit the options users have unless usage of these provide security/stability issues (like minimum_master_nodes).

To change _source.enabled into an index settings we take away a lot of flexibility - for example we cannot have parent/child documents with different settings (since those have to be in the same index) or generally have different types with different settings in the same index - something that is very useful in many cases (think users, blog posts and comments where index ).

Same goes for _source filtering - it can be very useful if we have, for example, a large text that we wish to search but not display/highlight. It decreases the space taken on disk and the speed of the search itself (no need to retrieve, parse and discard this blob). Imagine a use case of people indexing documents (as in word/pdf/... a fairly common use case) - they might want to store the metadata in ES, but there is no need to store the text extracted from some binary file. If such document were to be reindexed, it would most likely be from the binary source again.

I understand some of the complications and the desire to limit code paths. My compromise suggestion would be to always keep _source, but allow for the index-time filtering essentially allowing people to say: "_source": {"exclude": "*"}. It can be clearly documented as to its effects ("no reindex, update, highlighting" in bold, friendly letters) and should still allow us to remove a bunch of functionality while keeping the flexibility people like about elasticsearch.

uschindler · 2015-05-12T13:38:26Z

I would suggest to still allow to disable or filter the_source field, but then atomatically all those services like document update API or reindexing just throw UnsupportedOperationException.

This should be possible to implement quite easily by a simple check on the mapping like: if (mapping.isSourceDisabled() || mapping.isSourceFiltered()) throw new UnsupportedOperationException(...)

This is in my opinion the easiest to do. And of course in the documentation clearly state that those options disable all those services that rely on full _source field availability.

nik9000 · 2015-05-12T14:17:09Z

Reading a large _source is slow and unnecessary

We haven't hit this, btw. We do 130 million prefix searches a day (edge ngram query, not completion suggester) that just return a handful of small fields. We load them from _source without trouble. I tried it with stored fields a year ago and didn't see any improvement. I'm sure there is a breaking point, but its a couple of times bigger than the average size of wikipedia pages. My instinct is once you start talking about single digit MB size fields.

It'd be nice to those large fields out of the working set but that's a harder thing to measure.

uschindler · 2015-05-12T15:12:43Z

In the PDF case this is often the issue. I have some documents with like 80 MiB of text (I know this is a problem completely), but we just have them there to allow actually finding them, but the score of those hits is quite small (forcefully boosted down). I would never ever load those fields from source, so it is for sure an issue to store this completely useless information. As a user I want to have the power to prevent them to go as plain text into a field, sorry. I am using Elasticsearch not as a database or data store, just as a search engine. PERIOD.

As said on the other issue: One important thing is to store the _source field as CBOR instead of JSON, this improved a lot if you are scanning the whole index.

The other issue: I just don't want to exhaust my I/O cache just because I load 80 MiB of data for nonsense, just to display the title of a document.

nik9000 · 2015-05-12T15:19:45Z

The other issue: I just don't want to exhaust my I/O cache just because I load 80 MiB of data for nonsense, just to display the title of a document.

80mb will do that, yeah.

In #9034 @jpountz describes smooshing the _source into multiple stored fields so you only load what you need. Its still pretty useless dropping the 80mb document onto the disk but at least in the you don't blow out the cache. That feels like it'd be good enough for me.

I like simplicity of that proposal because it feels like you really could hide stored fields behind a _source abstraction.

clintongormley · 2015-05-12T15:29:22Z

For 2.0, we're going to:

add back the ability to disable storing source
add back the includes and excludes parameters
allow these settings to be set per-type
only allow these settings to be set when creating a type (currently they are dynamically updatable)
if enabled is false, or includes/excludes is set, throw an unsupported exception when trying to use features which require source.
add better documentation so that the user understands the cost of disabling source

uschindler · 2015-05-12T15:37:26Z

I agree with @honzakral because I also hate the "protect users from doing the wrong thing" approach. Documentation is the right approach. And failing early! If I disable or filter my source field because I don't ever want to load the data or reindex my stuff its my own decision.

It is just important to:

document this
throw UnsupportedOperationException if you try to do something that requires full _source. It was indeed bad to silently do the wrong thing. Just be clear and fail early if you try to use the update document or reindexing API or whatever API. If you do this, developers will get error and will think 2 times before they disable or filter the _source field. But those who really want to do this still have the possibility.

There is nothing more from my side. I just want to have this simple possibility to decide on my own and have the full flexibility. The code is self-contained and was not even removed in the "disable source filtering" patch (#10814). The whole source filtering code is still there!!! It was just "forbidden" to use. The only actual code change next to cleanups was a single if-statement that disallowed to use of "include/exclude" in the mapping. I want this one to be reverted.

uschindler · 2015-05-12T15:38:03Z

@clintongormley looks like a nice proposal!

karmi · 2015-05-12T15:58:02Z

Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it.

I agree with @honzakral that issues like that should be solved by documentation and instructions, not by the implementation.

However, if there's a general consensus to move this configuration from index mappings to settings, I'm OK with that. Let's just keep the configuration option available to users.

(Regarding filtering _source, the _response_source approach is something which doesn't feel right to me... The second option is better.)

This adds back the ability to disable _source, as well as set includes and excludes. However, it also restricts these settings to not be updateable. enabled was actually already not modifiable, but no conflict was previously given if an attempt was made to change it. This also adds a check that can be made on the source mapper to know if the the source is "complete" and can be used for purposes other than returning in search or get requests. There is one example use here in highlighting, but more need to be added in a follow up issue (eg in the update API). closes elastic#11116

blakeparker · 2015-08-18T17:18:12Z

It would be very helpful to also add back the ability to disable compression as well. We have our ES nodes virtualized and stored on a Pure Storage array that does inline dedup and compression, however the ES compression completely breaks its dedup features.

jpountz · 2015-08-19T09:32:58Z

The ability to disable compression won't come back. By the way it only applied to a part of the index (stored fields and term vectors), while it has never been possible to disable compression on the terms dictionary, postings, position data, doc values, etc. According to the documentation, PureStorage does deduplication of blocks of 512 bytes and uses LZO for compression so I don't think that disabling compression in elasticsearch would bring any significant gains on realistic data when running on PureStorage.

clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types Meta labels May 12, 2015

clintongormley mentioned this issue May 12, 2015

Remove includes and excludes from _source #10814

Merged

rjernst mentioned this issue May 14, 2015

Add back support for enabled/includes/excludes in _source field #11171

Merged

rjernst self-assigned this May 14, 2015

rjernst closed this as completed in #11171 May 14, 2015

rjernst removed the discuss label May 14, 2015

clintongormley mentioned this issue May 15, 2015

Partial_fields should not reorder document properties #11160

Closed

ppf2 mentioned this issue Jul 9, 2015

Add back documentation on disable _source #12141

Closed

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternatives to disabling or filtering the `_source` field at index time #11116

Alternatives to disabling or filtering the `_source` field at index time #11116

clintongormley commented May 12, 2015

uschindler commented May 12, 2015

honzakral commented May 12, 2015

uschindler commented May 12, 2015

nik9000 commented May 12, 2015

uschindler commented May 12, 2015

nik9000 commented May 12, 2015

clintongormley commented May 12, 2015

uschindler commented May 12, 2015

uschindler commented May 12, 2015

karmi commented May 12, 2015

blakeparker commented Aug 18, 2015

jpountz commented Aug 19, 2015

Alternatives to disabling or filtering the _source field at index time #11116

Alternatives to disabling or filtering the _source field at index time #11116

Comments

clintongormley commented May 12, 2015

No source needed

Reading a large _source is slow and unnecessary

uschindler commented May 12, 2015

honzakral commented May 12, 2015

uschindler commented May 12, 2015

nik9000 commented May 12, 2015

uschindler commented May 12, 2015

nik9000 commented May 12, 2015

clintongormley commented May 12, 2015

uschindler commented May 12, 2015

uschindler commented May 12, 2015

karmi commented May 12, 2015

blakeparker commented Aug 18, 2015

jpountz commented Aug 19, 2015

Alternatives to disabling or filtering the `_source` field at index time #11116

Alternatives to disabling or filtering the `_source` field at index time #11116

Reading a large `_source` is slow and unnecessary