Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a new KeywordField. #12054

Merged
merged 8 commits into from
Feb 7, 2023
Merged

Introduce a new KeywordField. #12054

merged 8 commits into from
Feb 7, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 1, 2023

KeywordField is a combination of StringField and SortedSetDocValuesField, similarly to how LongField is a combination of LongPoint and SortedNumericDocValuesField. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.

@gsmiller
Copy link
Contributor

gsmiller commented Jan 3, 2023

+1 to adding this new field definition. Looks like the new test failed in the precommit checks?

@jpountz
Copy link
Contributor Author

jpountz commented Jan 4, 2023

Thanks for looking @gsmiller. The test that fails is because this PR relies on behavior introduced by #12053. I'll rebase when this other PR is merged.

Field pathField = new StringField("path", file.toString(), Field.Store.YES);
doc.add(pathField);
doc.add(new KeywordField("path", file.toString()));
doc.add(new StoredField("path", file.toString()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also need to store the value, you should add a separate {@link StoredField} instance.

Let's rethink this for the new fields we are adding. I think storing is quite common and offering Field.Store.YES/NO is the best choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The challenge I'm seeing is that to index both points and doc values on numeric fields, we're creating a field that produces a binaryValue() consumed by points, as well as a numeric value consumed by doc values. But stored fields can store both binary and numeric data, so how should they know which value they should look at?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoah... this field is for strings. Kill the BytesRef constructor. It is enough to pass a String. and you can support Field.Store.YES/NO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as the LongField etc, we should deal with that in another issue. I agree it should support Field.Store.YES/NO. The things are you speak of are not barriers to that. They are self-created problems that we can fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind killing the BytesRef ctor, but I don't think it would be enough. We need this field to implement binaryValue() so that doc values can be indexed. But then stored fields are going to see a field where both stringValue() and binaryValue() return non-null, and it would be a problem for the current stored fields format which checks the binary value first, so this KeywordField would be considered as a binary field by stored fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, its a self-created problem though, because we know it should go into storedfields as a string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by self-created i mean, too many java abstractions / java abstractions are the things causing the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at current stored fields writer as example. All these damn java abstractions, yet our codec writer is doing TYPE-GUESSING?. Let's add a new method, so the codec knows the type and never guesses. This "guessing" belongs as an impl detail behind a new method in Field.java IMO (because Field.java tries to be a superhero and support all types). For structured types like KeywordField it should just be return STRING.

    Number number = field.numericValue();
    if (number != null) {
      if (number instanceof Byte || number instanceof Short || number instanceof Integer) {
        bits = NUMERIC_INT;
      } else if (number instanceof Long) {
        bits = NUMERIC_LONG;
      } else if (number instanceof Float) {
        bits = NUMERIC_FLOAT;
      } else if (number instanceof Double) {
        bits = NUMERIC_DOUBLE;
      } else {
        throw new IllegalArgumentException("cannot store numeric type " + number.getClass());
      }
      string = null;
      bytes = null;
    } else {
      bytes = field.binaryValue();
      if (bytes != null) {
        bits = BYTE_ARR;
        string = null;
      } else {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and obviously, fixing this can be a followup to this PR. but we should do it before releasing the new APIs. These new fields are supposed to be easy to use, so they should support storing as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to think more about this and opened #12116 as a possible way forward.

@gsmiller
Copy link
Contributor

Somewhat related to this PR, I've been experimenting with the idea of a "self optimizing" TermInSetQuery implementation that toggles between using postings and doc values based on index statistics, etc. I wanted to link that idea here as it's a bit related (requires indexing both postings and dv, which this PR makes easy). This is just an early idea, but I'll link an early draft here in case anyone is curious or has thoughts: #12089

`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
@jpountz
Copy link
Contributor Author

jpountz commented Feb 6, 2023

I updated this PR to

  • add a Field.Store parameter to the constructor that does not rely on Field's guessing
  • update the demo to pass Field.Store.YES as a value for this parameter instead of adding a separate StoredField
  • added a newSetQuery that creates a TermInSetQuery and hopefully soon benefits from @gsmiller 's optimization

* Field that indexes a per-document String or {@link BytesRef} into an inverted index for fast
* filtering, stores values in a columnar fashion using {@link DocValuesType#SORTED_SET} doc values
* for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.
* This field does not support scoring: queries produce constant scores. If you also need to store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can nuke this sentence about "if you also need to store the value" now

* @throws NullPointerException if {@code field} is null.
* @return a query matching documents with this exact value
*/
public static Query newSetQuery(String field, Collection<BytesRef> values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expose this as BytesRef... instead of collection, consistent with all the other newSetQuery's?

@rmuir
Copy link
Member

rmuir commented Feb 7, 2023

* added a `newSetQuery` that creates a `TermInSetQuery` and hopefully soon benefits from @gsmiller 's optimization

You can make it new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery()) right now and it performs better than what is on that PR.

@gsmiller
Copy link
Contributor

gsmiller commented Feb 7, 2023

You can make it new IndexOrDocValuesQuery(new TermInSetQuery, SortedSetDocValuesField.newSlowSetQuery()) right now and it performs better than what is on that PR.

+1 to using IndexOrDocValues for now. Given the feedback in #12089, I'm going to see if I can come up with a way to help IndexOrDocValuesQuery make a better decision between postings/doc values for the case it currently doesn't handle well, as opposed to changing the guts of TermInSetQuery. I'll benchmark a couple different ideas for that and post a separate PR with what I'm able to come up with. Thanks!

@jpountz jpountz merged commit ab074d5 into apache:main Feb 7, 2023
@jpountz jpountz deleted the keyword_field branch February 7, 2023 17:19
jpountz added a commit that referenced this pull request Feb 21, 2023
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
@gsmiller gsmiller added this to the 9.6.0 milestone Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants