-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reusing indexed binary fields. #12053
Conversation
Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.
Can we just do this instead? I think an important point here is that you shouldnt be calling setBytesValue if it is tokenized (TokenStream in use). You need Reader/String. |
and yeah, you don't have such checks on numeric values, but numeric values don't have TokenStream tokenization. Being consistent with them makes no sense, that isn't what this is about. otherwise, if we cant agree here, lets just keep the restriction. |
the fact that the tests pass with this change is really upsetting too. we should at least add checks for the type of luser moments we want to prevent, e.g. calling setBytesRef on a fucking TextField, etc. If we dont add these checks then users are going to invoke these methods and... nothing will happen at all... or something that isn't what they want. |
I'm good with adding more validation, I pushed more changes:
The last bit makes the change breaking, so I'm targeting 10.0 only. |
will take another pass, this sounds really good to me. I think the restriction was just a hacky way of preventing some of the issues, its obviously not ideal. .document api is just a PITA |
I pushed a new commit that also disallows term vector offsets on binary fields. |
@rmuir Do you have thoughts on this change? |
@jpountz I will try to review this today. Sorry for the delay. I haven't written java code in years, i'm crazy busy at work, and i try to give more time to |
Thank you, and no worries at all about the delay, I just wanted to check if it was still on your mind since you said you were interested in looking into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, I think it really simplifies Field. At least I can reason about it better now.
The checks look complete, I spent a good deal of time looking for holes. Thanks for the paranoia!
Thanks @rmuir ! |
Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.
Today Lucene allows creating indexed binary fields, e.g. via
StringField(String, BytesRef, Field.Store)
, but not reusing them: callingsetBytesValue
on aStringField
throws.This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling
setBytesValue
on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent.Doing this change would help improve the nightly benchmarks for the NYC taxis dataset by doing the String -> UTF-8 conversion only once for keywords, instead of once for the
StringField
and one for theSortedDocValuesField
, while still reusing fields.