Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple tokens on LHS in stemmer_override rules #56113

Closed
telendt opened this issue May 4, 2020 · 7 comments · Fixed by #56484
Closed

Multiple tokens on LHS in stemmer_override rules #56113

telendt opened this issue May 4, 2020 · 7 comments · Fixed by #56484
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@telendt
Copy link
Contributor

telendt commented May 4, 2020

Without looking into internals of stemmer_override I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym",
          "synonyms": [
            "reading => read",
            "swimming, swims => swim"
          ]
        },
        "stems": {
          "type": "stemmer_override",
          "rules": [
            "reading => read",
            "swimming, swims => swim"
          ]
        }
      }
    }
  }
}

Simple rules, with single token on LHS, work the same (so both synonyms and stems will output read for reading) but rules with multiple tokens on LHS (also known as "contraction rules") do not:

SYNONYMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["synonyms"]
}

output:

{
  "tokens": [
    {
      "token": "swim",
      "start_offset": 0,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

STEMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["stems"]
}

output

{
  "tokens": [
    {
      "token": "swimming",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.

Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.

@cbuescher cbuescher added :Search Relevance/Analysis How text is split into tokens >enhancement labels May 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 4, 2020
@telendt
Copy link
Contributor Author

telendt commented May 5, 2020

If you're ok with adding this feature (support for rules with multiple tokens on LHS into stemmer override token filter so that they work similarly to contraction rules in synonym token filter) then I can prepare a PR for that.

This looks like a low-risk and low-effort change to me, although this issue is also probably a low priority 🤷‍♂️

@mayya-sharipova
Copy link
Contributor

@telendt Thank you for submitting this issue.
What's happening is that the LSH of the rule is saved as is. So if you don't do the tokenization, your rule gets applied:

GET test/_analyze
{
  "text": "swimming, swims",
  "tokenizer": "keyword", 
  "filter": ["stems"]
}

returns

{
    "tokens": [
        {
            "token": "swim",
            "start_offset": 0,
            "end_offset": 15,
            "type": "word",
            "position": 0
        }
    ]
}

It indeed would be nice for stemmer_override filter to support contraction rules, but I think this should be done on Lucene side. elasticsearch passes rules to the underlying Lucene filters, and it is up to Lucene filters to process these rules. So I would suggest to submit an issue in Lucene Jira.
I will be closing this issue on the elasticsearch side.

@telendt
Copy link
Contributor Author

telendt commented May 5, 2020

@mayya-sharipova:

What's happening is that the LSH of the rule is saved as is [...]

Yes, but only because you chose to do so.

AFAIK Solr's StemmerOverrideFilterFactory accepts tab separated dictionary file. There's no confusion there as the format is different than synonyms mapping format (comma separated with =>). You chose similar format (with =>) and thus the confusion.

Is it common to stem tokens containing comas? If not I don't see why:

swimming, swims => swim

could not result in:

builder.add("swimming", "swim");
builder.add("swims", "swim");

@jimczi
Copy link
Contributor

jimczi commented May 5, 2020

Yes, but only because you chose to do so.

Agreed, the parsing is done in the factory that is defined in Elasticsearch so the decision is ours. We don't need to change anything in Lucene. I don't have a strong opinion regarding the decision but we shouldn't accept bad rules silently. We should validate that the left side is a single term or accept a list of terms but I agree that the current situation is confusing so I am reopening the issue.

@jimczi jimczi reopened this May 5, 2020
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 5, 2020

Thanks @jimczi for weighing in.

@telendt ok, great! Your patch to elasticsearch is welcome.

@telendt
Copy link
Contributor Author

telendt commented May 6, 2020

Cool. It might take me some time to set up my environment and get familiar with the code but I will try to provide a PR in the following days.

telendt added a commit to telendt/elasticsearch that referenced this issue May 29, 2020
This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
cbuescher pushed a commit that referenced this issue May 29, 2020
…6484)

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Closes #56113
cbuescher pushed a commit that referenced this issue May 29, 2020
…6484)

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Closes #56113
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants