-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple tokens on LHS in stemmer_override rules #56113
Comments
Pinging @elastic/es-search (:Search/Analysis) |
If you're ok with adding this feature (support for rules with multiple tokens on LHS into stemmer override token filter so that they work similarly to contraction rules in synonym token filter) then I can prepare a PR for that. This looks like a low-risk and low-effort change to me, although this issue is also probably a low priority 🤷♂️ |
@telendt Thank you for submitting this issue. GET test/_analyze
{
"text": "swimming, swims",
"tokenizer": "keyword",
"filter": ["stems"]
} returns {
"tokens": [
{
"token": "swim",
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 0
}
]
} It indeed would be nice for |
Yes, but only because you chose to do so. AFAIK Solr's StemmerOverrideFilterFactory accepts tab separated dictionary file. There's no confusion there as the format is different than synonyms mapping format (comma separated with Is it common to stem tokens containing comas? If not I don't see why:
could not result in: builder.add("swimming", "swim");
builder.add("swims", "swim"); |
Agreed, the parsing is done in the factory that is defined in Elasticsearch so the decision is ours. We don't need to change anything in Lucene. I don't have a strong opinion regarding the decision but we shouldn't accept bad rules silently. We should validate that the left side is a single term or accept a list of terms but I agree that the current situation is confusing so I am reopening the issue. |
Cool. It might take me some time to set up my environment and get familiar with the code but I will try to provide a PR in the following days. |
This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted.
…6484) This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted. Closes #56113
…6484) This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted. Closes #56113
Without looking into internals of
stemmer_override
I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:Simple rules, with single token on LHS, work the same (so both
synonyms
andstems
will outputread
forreading
) but rules with multiple tokens on LHS (also known as "contraction rules") do not:SYNONYMS
output:
STEMS
output
There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.
Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.
The text was updated successfully, but these errors were encountered: