-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a new KeywordField
.
#12054
Conversation
+1 to adding this new field definition. Looks like the new test failed in the precommit checks? |
Field pathField = new StringField("path", file.toString(), Field.Store.YES); | ||
doc.add(pathField); | ||
doc.add(new KeywordField("path", file.toString())); | ||
doc.add(new StoredField("path", file.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you also need to store the value, you should add a separate {@link StoredField} instance.
Let's rethink this for the new fields we are adding. I think storing is quite common and offering Field.Store.YES/NO is the best choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The challenge I'm seeing is that to index both points and doc values on numeric fields, we're creating a field that produces a binaryValue() consumed by points, as well as a numeric value consumed by doc values. But stored fields can store both binary and numeric data, so how should they know which value they should look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoah... this field is for strings. Kill the BytesRef constructor. It is enough to pass a String. and you can support Field.Store.YES/NO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as the LongField etc, we should deal with that in another issue. I agree it should support Field.Store.YES/NO. The things are you speak of are not barriers to that. They are self-created problems that we can fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind killing the BytesRef ctor, but I don't think it would be enough. We need this field to implement binaryValue()
so that doc values can be indexed. But then stored fields are going to see a field where both stringValue()
and binaryValue()
return non-null, and it would be a problem for the current stored fields format which checks the binary value first, so this KeywordField
would be considered as a binary field by stored fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, its a self-created problem though, because we know it should go into storedfields as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by self-created i mean, too many java abstractions / java abstractions are the things causing the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at current stored fields writer as example. All these damn java abstractions, yet our codec writer is doing TYPE-GUESSING?. Let's add a new method, so the codec knows the type and never guesses. This "guessing" belongs as an impl detail behind a new method in Field.java IMO (because Field.java tries to be a superhero and support all types). For structured types like KeywordField it should just be return STRING
.
Number number = field.numericValue();
if (number != null) {
if (number instanceof Byte || number instanceof Short || number instanceof Integer) {
bits = NUMERIC_INT;
} else if (number instanceof Long) {
bits = NUMERIC_LONG;
} else if (number instanceof Float) {
bits = NUMERIC_FLOAT;
} else if (number instanceof Double) {
bits = NUMERIC_DOUBLE;
} else {
throw new IllegalArgumentException("cannot store numeric type " + number.getClass());
}
string = null;
bytes = null;
} else {
bytes = field.binaryValue();
if (bytes != null) {
bits = BYTE_ARR;
string = null;
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and obviously, fixing this can be a followup to this PR. but we should do it before releasing the new APIs. These new fields are supposed to be easy to use, so they should support storing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to think more about this and opened #12116 as a possible way forward.
Somewhat related to this PR, I've been experimenting with the idea of a "self optimizing" |
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.
1dbeee1
to
bc5576a
Compare
I updated this PR to
|
* Field that indexes a per-document String or {@link BytesRef} into an inverted index for fast | ||
* filtering, stores values in a columnar fashion using {@link DocValuesType#SORTED_SET} doc values | ||
* for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval. | ||
* This field does not support scoring: queries produce constant scores. If you also need to store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can nuke this sentence about "if you also need to store the value" now
* @throws NullPointerException if {@code field} is null. | ||
* @return a query matching documents with this exact value | ||
*/ | ||
public static Query newSetQuery(String field, Collection<BytesRef> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expose this as BytesRef...
instead of collection, consistent with all the other newSetQuery
's?
You can make it |
+1 to using |
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.
KeywordField
is a combination ofStringField
andSortedSetDocValuesField
, similarly to howLongField
is a combination ofLongPoint
andSortedNumericDocValuesField
. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.