-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove includes and excludes from _source #10814
Conversation
LGTM |
f35f01a
to
d004181
Compare
Regardless of the outcome of elastic#8142, we should at least enforce that when _source is enabled, it is sufficient to reindex. This change removes the excludes and includes settings, since these modify the source, causing us to lose the ability to reindex some fields. closes elastic#10814
Regardless of the outcome of elastic#8142, we should at least enforce that when _source is enabled, it is sufficient to reindex. This change removes the excludes and includes settings, since these modify the source, causing us to lose the ability to reindex some fields. closes elastic#10814
d004181
to
bf09e58
Compare
I also added migration docs and removed docs in master. Will add deprecation in 1.x as well and link here. |
This is a big no-go for me and would cause big headaches to me. I have excluded a lot of data from the original source field (which is still duplicate information, but cannot be easily modelled with copy_to). Reindexing is not an issue to me, I will always be able to regenerate the data from the external source or even _source. So big -1, sorry. This is a desaster. This now slows down searching, because it transfers a large amount of useless data. |
@uschindler we have plans to support disabling |
Hi @clintongormley ! In general I use the _source field and I am happy with it, so I would never disable it completely, so I am just complaining about removing the functionality to filter parts during indexing. In general there are 2 user types:
In our case its the second type of user. So theoretically I could disable the _source field completely, but it is still good for easily fetching the information to show search results. Because of this, it is perfectly fine to only store the JSON information in the index thats needed for displaying search results and some statistics,... Basically the Idea behind Lucene is to have an index where you index documents and only store the information in the index to show the document results. With this change (unable to disable _source or unable to filter _source) this is no longer possible. The performance improvements are fine for stored fields, but if you have a lot of textual data solely there for searching (which never get retrieved) - partly up to several megabytes of text from PDF files, xml documents,... the additional overhead to extract the required fields is immense. By changing to CBOR encoding of the _source field, this got better, but loading the whole stuff is still an overhead, if you just show like 2 kilobytes of thumbnail information. Instead you have to load 3 megabytes of CBOR data from the source field and let Elasticsearch filter it during returning search results. One additional thing that happens on our side (as described before): We sometimes have to redundantly add the same data to the _source document several times, because a simple copy_to is not always easily possible. If copy_to would allow groovy scripts to dynamically copy stuff to other fields, I could live with that. E.g. one thing is to build suggestion fields or combining several fields with a special formatting, e.g. for Facets. In fact the data is redundant and could completely be regenerated also from the "filtered" source. I agree with the problems that happen with disabled _source or applied filtering on _source when people want to do some things like updating or reindexing, but there should be done 2 things:
|
@uschindler thanks for the reply. I'm going to open a separate issue to discuss alternative solutions. Just one comment in the meantime:
Have you seen the See http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-transform.html |
@uschindler ticket opened here: #11116 |
Hi Clinton, thanks for opening this issue!
I know this functionality and I already thought about using it. The common use-case is to prepare the "suggest" field, which needs a different processing than all the other fields. This is for example here currently done in the preprocessing of the XML data using XSL... I could rewrite that using groovy! My general problem is, as said on the new issue: I just want to keep _source, but decide on myself what I really want to store in the index, because I index tons of data, but only want to display very small snippets. |
Regardless of the outcome of #8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.
Note: I also renamed one of the UpdateMappingTests so the class names are unique.