Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives to disabling or filtering the _source field at index time #11116

Closed
clintongormley opened this issue May 12, 2015 · 12 comments · Fixed by #11171
Closed

Alternatives to disabling or filtering the _source field at index time #11116

clintongormley opened this issue May 12, 2015 · 12 comments · Fixed by #11171
Assignees
Labels
Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

In #10915 we removed the ability to disable the _source field, and in #10814 we removed the ability to use includes and excludes to remove selected fields from the _source field that is stored with each document.

The reason for this is that a number of important existing and future features rely on having the complete original _source field available in Elasticsearch, such as:

  • the update API
  • on-the-fly highlighting
  • reindexing (either to change mappings/analysis or in to upgrade an index over major versions)
  • automated repair of index corruption
  • the ability to debug problems by viewing the original source used for indexing

In our experience, many new users disable _source just to save disk space, or because it seemed like a nice optimisation. Almost all of them later regret it, and found themselves unable to move forward because rebuilding the index from the original data store was too costly.

Instead, we have the ability to:

The above changes are good for the most common use cases, probably 90% of our user base. However, there are two use cases in particular where controlling how or whether the source is indexed would beneficial to the more expert user:

No source needed

High volume indexing of documents used almost exclusively for analytics. The source field is not required in the search results, indices can be rebuilt from fast primary data stores, minimising disk usage and write performance matters. In this case, we can provide an index setting to completely disable the storage of the _source field and all of the benefits that come with having the original source.

Why an index setting?

Previously, users could do this just by setting _source.enabled: false, so why switch this to an index setting? Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it. By making it an index setting, it (1) invalidates the behaviour that has been recommended in blog posts, making users go back to read the documentation and (2) allows us to use a scary enough name (with accompanying docs) that will make users think twice.

Reading a large _source is slow and unnecessary

Users who are indexing a large field (like the contents of a PDF) plus several small fields (eg title, creation date, tags, etc) are likely to want to return just the small fields plus highlighted snippets. However, returning just the title field necessitates reading (and then filtering out) the large contents field as well.

Previously, users used the source.includes and source.excludes filters to remove these large fields from the _source, but as a consequence, this disables all of the features mentioned above. As an alternative, the user can still disable the _source field and set individual fields to store: true.

It would be nice to do better though: to keep the original _source but make search responses requiring just a few fields faster than they are today. Two proposals:

Add a _response_source field

The original _source would still be stored, but the _response_source would be a second stored field with a filtered list of fields (behaving like the old includes/excludes). The user could choose which field should be returned with their search requests. Compression would minimise the amount of extra storage required because the fields in the_response_sourcewould be a subset of those in the_source`.

Store top-level fields as separate stored fields

As suggested in #9034, the _source field would be stored as separate stored fields, one for top-level field in the JSON document. This would allow Elasticsearch to efficiently skip over filtered out fields to return just the required subset, yet it preserves the original JSON so that values such as [1,null,1] or [] etc can be returned correctly.

An advantage of this solution is that the decision about which fields to return is query time, while the _response_source option is set at index time.

This also opens up the possibility to enable more efficient compression techniques for individual fields, depending on the type of data contained in each field.

Thoughts?

@clintongormley clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types Meta labels May 12, 2015
@uschindler
Copy link
Contributor

I like the discussion here, thank for raising it.

In general I still think that it should up to the user if he wants to disable the _source field of filter it. The problem with this was just because there was a lot of documentation around that suggested to do this (same wrong suggestions like those tons of blog posts telling Apache Solr users to commit and optimize the whole index after each document insert...).

I would suggest to still allow to disable or filter the_source field, but then atomatically all those services like document update API or reindexing just throw UnsupportedOperationException.

I am perfectly fine with that. I don't rely on Elasticsearch as a primary data store. I just index all documents and can reindex the stuff without the need for _source documents. I just need to store the search result snippet - and I am fine. This is why I want to filter _source. This is just the classical use-case for lucene: Index large documents and store just the snippet in the index. This is the classical full text search engine use-case. If this one is no longer supported, sorry: Elasticsearch is no longer an option for this type of stuff.

@honzakral
Copy link
Contributor

I generally don't like the "protect users from doing the wrong thing" approach. To me these issues should be solved by documentation, but should not limit the options users have unless usage of these provide security/stability issues (like minimum_master_nodes).

To change _source.enabled into an index settings we take away a lot of flexibility - for example we cannot have parent/child documents with different settings (since those have to be in the same index) or generally have different types with different settings in the same index - something that is very useful in many cases (think users, blog posts and comments where index ).

Same goes for _source filtering - it can be very useful if we have, for example, a large text that we wish to search but not display/highlight. It decreases the space taken on disk and the speed of the search itself (no need to retrieve, parse and discard this blob). Imagine a use case of people indexing documents (as in word/pdf/... a fairly common use case) - they might want to store the metadata in ES, but there is no need to store the text extracted from some binary file. If such document were to be reindexed, it would most likely be from the binary source again.

I understand some of the complications and the desire to limit code paths. My compromise suggestion would be to always keep _source, but allow for the index-time filtering essentially allowing people to say: "_source": {"exclude": "*"}. It can be clearly documented as to its effects ("no reindex, update, highlighting" in bold, friendly letters) and should still allow us to remove a bunch of functionality while keeping the flexibility people like about elasticsearch.

@uschindler
Copy link
Contributor

I would suggest to still allow to disable or filter the_source field, but then atomatically all those services like document update API or reindexing just throw UnsupportedOperationException.

This should be possible to implement quite easily by a simple check on the mapping like: if (mapping.isSourceDisabled() || mapping.isSourceFiltered()) throw new UnsupportedOperationException(...)

This is in my opinion the easiest to do. And of course in the documentation clearly state that those options disable all those services that rely on full _source field availability.

@nik9000
Copy link
Member

nik9000 commented May 12, 2015

Reading a large _source is slow and unnecessary

We haven't hit this, btw. We do 130 million prefix searches a day (edge ngram query, not completion suggester) that just return a handful of small fields. We load them from _source without trouble. I tried it with stored fields a year ago and didn't see any improvement. I'm sure there is a breaking point, but its a couple of times bigger than the average size of wikipedia pages. My instinct is once you start talking about single digit MB size fields.

It'd be nice to those large fields out of the working set but that's a harder thing to measure.

@uschindler
Copy link
Contributor

In the PDF case this is often the issue. I have some documents with like 80 MiB of text (I know this is a problem completely), but we just have them there to allow actually finding them, but the score of those hits is quite small (forcefully boosted down). I would never ever load those fields from source, so it is for sure an issue to store this completely useless information. As a user I want to have the power to prevent them to go as plain text into a field, sorry. I am using Elasticsearch not as a database or data store, just as a search engine. PERIOD.

As said on the other issue: One important thing is to store the _source field as CBOR instead of JSON, this improved a lot if you are scanning the whole index.

The other issue: I just don't want to exhaust my I/O cache just because I load 80 MiB of data for nonsense, just to display the title of a document.

@nik9000
Copy link
Member

nik9000 commented May 12, 2015

The other issue: I just don't want to exhaust my I/O cache just because I load 80 MiB of data for nonsense, just to display the title of a document.

80mb will do that, yeah.

In #9034 @jpountz describes smooshing the _source into multiple stored fields so you only load what you need. Its still pretty useless dropping the 80mb document onto the disk but at least in the you don't blow out the cache. That feels like it'd be good enough for me.

I like simplicity of that proposal because it feels like you really could hide stored fields behind a _source abstraction.

@clintongormley
Copy link
Contributor Author

For 2.0, we're going to:

  • add back the ability to disable storing source
  • add back the includes and excludes parameters
  • allow these settings to be set per-type
  • only allow these settings to be set when creating a type (currently they are dynamically updatable)
  • if enabled is false, or includes/excludes is set, throw an unsupported exception when trying to use features which require source.
  • add better documentation so that the user understands the cost of disabling source

@uschindler
Copy link
Contributor

I agree with @honzakral because I also hate the "protect users from doing the wrong thing" approach. Documentation is the right approach. And failing early! If I disable or filter my source field because I don't ever want to load the data or reindex my stuff its my own decision.

It is just important to:

  • document this
  • throw UnsupportedOperationException if you try to do something that requires full _source. It was indeed bad to silently do the wrong thing. Just be clear and fail early if you try to use the update document or reindexing API or whatever API. If you do this, developers will get error and will think 2 times before they disable or filter the _source field. But those who really want to do this still have the possibility.

There is nothing more from my side. I just want to have this simple possibility to decide on my own and have the full flexibility. The code is self-contained and was not even removed in the "disable source filtering" patch (#10814). The whole source filtering code is still there!!! It was just "forbidden" to use. The only actual code change next to cleanups was a single if-statement that disallowed to use of "include/exclude" in the mapping. I want this one to be reverted.

@uschindler
Copy link
Contributor

@clintongormley looks like a nice proposal!

@karmi
Copy link
Contributor

karmi commented May 12, 2015

Doing this in the mapping was too convenient, so users who didn't understand the consequences used the option and ended up suffering for it.

I agree with @honzakral that issues like that should be solved by documentation and instructions, not by the implementation.

However, if there's a general consensus to move this configuration from index mappings to settings, I'm OK with that. Let's just keep the configuration option available to users.

(Regarding filtering _source, the _response_source approach is something which doesn't feel right to me... The second option is better.)

rjernst added a commit to rjernst/elasticsearch that referenced this issue May 14, 2015
This adds back the ability to disable _source, as well as set includes
and excludes. However, it also restricts these settings to not be
updateable. enabled was actually already not modifiable, but no
conflict was previously given if an attempt was made to change it.

This also adds a check that can be made on the source mapper to
know if the the source is "complete" and can be used for
purposes other than returning in search or get requests. There is
one example use here in highlighting, but more need to be added
in a follow up issue (eg in the update API).

closes elastic#11116
@rjernst rjernst self-assigned this May 14, 2015
@rjernst rjernst removed the discuss label May 14, 2015
@blakeparker
Copy link

It would be very helpful to also add back the ability to disable compression as well. We have our ES nodes virtualized and stored on a Pure Storage array that does inline dedup and compression, however the ES compression completely breaks its dedup features.

@jpountz
Copy link
Contributor

jpountz commented Aug 19, 2015

The ability to disable compression won't come back. By the way it only applied to a part of the index (stored fields and term vectors), while it has never been possible to disable compression on the terms dictionary, postings, position data, doc values, etc. According to the documentation, PureStorage does deduplication of blocks of 512 bytes and uses LZO for compression so I don't think that disabling compression in elasticsearch would bring any significant gains on realistic data when running on PureStorage.

@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants