Allow reusing indexed binary fields. #12053

jpountz · 2023-01-01T09:21:36Z

Today Lucene allows creating indexed binary fields, e.g. via StringField(String, BytesRef, Field.Store), but not reusing them: calling setBytesValue on a StringField throws.

This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling setBytesValue on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the nightly benchmarks for the NYC taxis dataset by doing the String -> UTF-8 conversion only once for keywords, instead of once for the StringField and one for the SortedDocValuesField, while still reusing fields.

Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.

rmuir · 2023-01-01T16:28:20Z

I considered an alternative that consisted of failing if calling setBytesValue on a field that is indexed and tokenized

Can we just do this instead?

I think an important point here is that you shouldnt be calling setBytesValue if it is tokenized (TokenStream in use). You need Reader/String.

rmuir · 2023-01-01T16:30:44Z

and yeah, you don't have such checks on numeric values, but numeric values don't have TokenStream tokenization. Being consistent with them makes no sense, that isn't what this is about.

otherwise, if we cant agree here, lets just keep the restriction.

rmuir · 2023-01-01T16:37:17Z

the fact that the tests pass with this change is really upsetting too. we should at least add checks for the type of luser moments we want to prevent, e.g. calling setBytesRef on a fucking TextField, etc. If we dont add these checks then users are going to invoke these methods and... nothing will happen at all... or something that isn't what they want.

jpountz · 2023-01-03T13:41:28Z

I'm good with adding more validation, I pushed more changes:

The Field ctor that takes a BytesRef complains if the field is tokenized or if offsets are indexed.
The Field ctor that takes a BytesRef complains if the field is neither indexed, nor docvalued, nor stored.
It is no longer possible to configure a TokenStream plus a value on the same field: either the value is a token stream, or it's something else. Otherwise this introduces weird situations, like we only need to tolerate a tokenized binary value if a token stream is configured too.

The last bit makes the change breaking, so I'm targeting 10.0 only.

rmuir · 2023-01-03T14:04:57Z

will take another pass, this sounds really good to me. I think the restriction was just a hacky way of preventing some of the issues, its obviously not ideal.

.document api is just a PITA

jpountz · 2023-01-04T17:12:10Z

I pushed a new commit that also disallows term vector offsets on binary fields.

jpountz · 2023-01-11T09:16:57Z

@rmuir Do you have thoughts on this change?

rmuir · 2023-01-11T09:22:04Z

@jpountz I will try to review this today. Sorry for the delay. I haven't written java code in years, i'm crazy busy at work, and i try to give more time to .document api.... all contributing to my slowness. I had in mind that i wanted to checkout this branch and "poke" too

jpountz · 2023-01-11T09:23:45Z

Thank you, and no worries at all about the delay, I just wanted to check if it was still on your mind since you said you were interested in looking into it.

rmuir

This is great, I think it really simplifies Field. At least I can reason about it better now.

The checks look complete, I spent a good deal of time looking for holes. Thanks for the paranoia!

jpountz · 2023-01-12T08:31:57Z

Thanks @rmuir !

This reverts commit 8477854.

Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.

jpountz added 2 commits January 3, 2023 14:31

Improve validation.

615b7c3

fix

f50943c

jpountz mentioned this pull request Jan 4, 2023

Introduce a new KeywordField. #12054

Merged

Also disallow term vectors offsets on binary fields.

f2f661a

tidy

ccdd9e0

rmuir approved these changes Jan 11, 2023

View reviewed changes

jpountz merged commit 8477854 into apache:main Jan 12, 2023

jpountz deleted the field_set_binary_value branch January 12, 2023 08:32

jpountz added this to the 10.0.0 milestone Jan 12, 2023

jpountz added a commit that referenced this pull request Jan 12, 2023

Revert "Allow reusing indexed binary fields. (#12053)"

525a110

This reverts commit 8477854.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reusing indexed binary fields. #12053

Allow reusing indexed binary fields. #12053

jpountz commented Jan 1, 2023

rmuir commented Jan 1, 2023

rmuir commented Jan 1, 2023

rmuir commented Jan 1, 2023

jpountz commented Jan 3, 2023

rmuir commented Jan 3, 2023

jpountz commented Jan 4, 2023

jpountz commented Jan 11, 2023

rmuir commented Jan 11, 2023

jpountz commented Jan 11, 2023

rmuir left a comment

jpountz commented Jan 12, 2023

Allow reusing indexed binary fields. #12053

Allow reusing indexed binary fields. #12053

Conversation

jpountz commented Jan 1, 2023

rmuir commented Jan 1, 2023

rmuir commented Jan 1, 2023

rmuir commented Jan 1, 2023

jpountz commented Jan 3, 2023

rmuir commented Jan 3, 2023

jpountz commented Jan 4, 2023

jpountz commented Jan 11, 2023

rmuir commented Jan 11, 2023

jpountz commented Jan 11, 2023

rmuir left a comment

Choose a reason for hiding this comment

jpountz commented Jan 12, 2023