Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reusing indexed binary fields. #12053

Merged
merged 5 commits into from
Jan 12, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 1, 2023

Today Lucene allows creating indexed binary fields, e.g. via StringField(String, BytesRef, Field.Store), but not reusing them: calling setBytesValue on a StringField throws.

This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling setBytesValue on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the nightly benchmarks for the NYC taxis dataset by doing the String -> UTF-8 conversion only once for keywords, instead of once for the StringField and one for the SortedDocValuesField, while still reusing fields.

Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
@rmuir
Copy link
Member

rmuir commented Jan 1, 2023

I considered an alternative that consisted of failing if calling setBytesValue on a field that is indexed and tokenized

Can we just do this instead?

I think an important point here is that you shouldnt be calling setBytesValue if it is tokenized (TokenStream in use). You need Reader/String.

@rmuir
Copy link
Member

rmuir commented Jan 1, 2023

and yeah, you don't have such checks on numeric values, but numeric values don't have TokenStream tokenization. Being consistent with them makes no sense, that isn't what this is about.

otherwise, if we cant agree here, lets just keep the restriction.

@rmuir
Copy link
Member

rmuir commented Jan 1, 2023

the fact that the tests pass with this change is really upsetting too. we should at least add checks for the type of luser moments we want to prevent, e.g. calling setBytesRef on a fucking TextField, etc. If we dont add these checks then users are going to invoke these methods and... nothing will happen at all... or something that isn't what they want.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 3, 2023

I'm good with adding more validation, I pushed more changes:

  • The Field ctor that takes a BytesRef complains if the field is tokenized or if offsets are indexed.
  • The Field ctor that takes a BytesRef complains if the field is neither indexed, nor docvalued, nor stored.
  • It is no longer possible to configure a TokenStream plus a value on the same field: either the value is a token stream, or it's something else. Otherwise this introduces weird situations, like we only need to tolerate a tokenized binary value if a token stream is configured too.

The last bit makes the change breaking, so I'm targeting 10.0 only.

@rmuir
Copy link
Member

rmuir commented Jan 3, 2023

will take another pass, this sounds really good to me. I think the restriction was just a hacky way of preventing some of the issues, its obviously not ideal.

.document api is just a PITA

@jpountz
Copy link
Contributor Author

jpountz commented Jan 4, 2023

I pushed a new commit that also disallows term vector offsets on binary fields.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 11, 2023

@rmuir Do you have thoughts on this change?

@rmuir
Copy link
Member

rmuir commented Jan 11, 2023

@jpountz I will try to review this today. Sorry for the delay. I haven't written java code in years, i'm crazy busy at work, and i try to give more time to .document api.... all contributing to my slowness. I had in mind that i wanted to checkout this branch and "poke" too

@jpountz
Copy link
Contributor Author

jpountz commented Jan 11, 2023

Thank you, and no worries at all about the delay, I just wanted to check if it was still on your mind since you said you were interested in looking into it.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, I think it really simplifies Field. At least I can reason about it better now.

The checks look complete, I spent a good deal of time looking for holes. Thanks for the paranoia!

@jpountz
Copy link
Contributor Author

jpountz commented Jan 12, 2023

Thanks @rmuir !

@jpountz jpountz merged commit 8477854 into apache:main Jan 12, 2023
@jpountz jpountz deleted the field_set_binary_value branch January 12, 2023 08:32
@jpountz jpountz added this to the 10.0.0 milestone Jan 12, 2023
jpountz added a commit that referenced this pull request Jan 12, 2023
jpountz added a commit to jpountz/lucene that referenced this pull request Jan 12, 2023
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants