-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider #77
Comments
I understand that in your example it is unclear which tokenizer to apply to the search text if the index itself uses a different tokenizer than the field(s) being searched. I never thought about this configuration and don't have an answer. But how does lifti decide which tokenizer to use for the search text when searching across all fields with different configured tokenizers? Isn't that a similar question? O am I missing some important difference? |
@h0lg If no field is specified, then the currently the default index tokenizer is used to parse and normalize the search text - it's only if a specific field is being searched on, LIFTI uses the index tokenizer that was configured for that. In that respect, you're right in that searching across all fields will be a problem if different tokenization has been used for them, and that's exactly the same as the problem that needs to be solved here. I'd need to spend a bit more time thinking about this than I have right now, but I'm wondering if when searching for text across multiple fields:
Edge cases to consider:
I think this will require quite a bit of rework in the query parser logic, but it's certainly not impossible... |
I see, thanks for the clarification and sharing your thoughts. Explaining the intricacies of the tokenization during the field search process and what happens in which case seems daunting to me. Maybe we're thinking about it too complicated? You could go with some rule that's easy to communicate and doesn't require you to explain the underlying mechanics - even if it has limitations. e.g.
Would that make things easier? |
An extension of #76 - I've just realised that wildcard field names are going to be a bit problematic. When parsing text from a query, the
QueryTokenizer
needs to know which index tokenizer to use when processing the search text.Consider this index:
The default index tokenizer uses stemming, whereas the field
Name
has it's own index tokenizer configured without stemming. If we allowed wildcard field names like this[Na*]=Something
then it's no longer clear which tokenizer to use for the search textSomething
(especially if we ended up with another field starting withNa
).So I think as things stand, the options are:
[Tag_*]=foo
would be equivalent to searching for[Tag_One]=foo | [Tag_Two]=foo | [Tag_Three]=foo
[?Tags]=foo
(Syntax TBD). A single dynamic field provider will only ever have one index tokenizer associated to it, so this should work.The first option would have a performance impact on the query, and we're probably going to need to build in some search optimisations to cache the search results emitted by a query to save the same search predicate being performed multiple times.
The second option is a bit more limited, but at least solves the issue across a specific dynamic field source.
The text was updated successfully, but these errors were encountered: