Multiple tokens on LHS in stemmer_override rules #56113

telendt · 2020-05-04T13:51:36Z

Without looking into internals of stemmer_override I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym",
          "synonyms": [
            "reading => read",
            "swimming, swims => swim"
          ]
        },
        "stems": {
          "type": "stemmer_override",
          "rules": [
            "reading => read",
            "swimming, swims => swim"
          ]
        }
      }
    }
  }
}

Simple rules, with single token on LHS, work the same (so both synonyms and stems will output read for reading) but rules with multiple tokens on LHS (also known as "contraction rules") do not:

SYNONYMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["synonyms"]
}

output:

{
  "tokens": [
    {
      "token": "swim",
      "start_offset": 0,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

STEMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["stems"]
}

output

{
  "tokens": [
    {
      "token": "swimming",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.

Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-05-04T13:55:25Z

Pinging @elastic/es-search (:Search/Analysis)

telendt · 2020-05-05T07:39:08Z

If you're ok with adding this feature (support for rules with multiple tokens on LHS into stemmer override token filter so that they work similarly to contraction rules in synonym token filter) then I can prepare a PR for that.

This looks like a low-risk and low-effort change to me, although this issue is also probably a low priority 🤷‍♂️

mayya-sharipova · 2020-05-05T19:05:18Z

@telendt Thank you for submitting this issue.
What's happening is that the LSH of the rule is saved as is. So if you don't do the tokenization, your rule gets applied:

GET test/_analyze
{
  "text": "swimming, swims",
  "tokenizer": "keyword", 
  "filter": ["stems"]
}

returns

{
    "tokens": [
        {
            "token": "swim",
            "start_offset": 0,
            "end_offset": 15,
            "type": "word",
            "position": 0
        }
    ]
}

It indeed would be nice for stemmer_override filter to support contraction rules, but I think this should be done on Lucene side. elasticsearch passes rules to the underlying Lucene filters, and it is up to Lucene filters to process these rules. So I would suggest to submit an issue in Lucene Jira.
I will be closing this issue on the elasticsearch side.

telendt · 2020-05-05T19:30:48Z

@mayya-sharipova:

What's happening is that the LSH of the rule is saved as is [...]

Yes, but only because you chose to do so.

AFAIK Solr's StemmerOverrideFilterFactory accepts tab separated dictionary file. There's no confusion there as the format is different than synonyms mapping format (comma separated with =>). You chose similar format (with =>) and thus the confusion.

Is it common to stem tokens containing comas? If not I don't see why:

swimming, swims => swim

could not result in:

builder.add("swimming", "swim");
builder.add("swims", "swim");

jimczi · 2020-05-05T19:49:06Z

Yes, but only because you chose to do so.

Agreed, the parsing is done in the factory that is defined in Elasticsearch so the decision is ours. We don't need to change anything in Lucene. I don't have a strong opinion regarding the decision but we shouldn't accept bad rules silently. We should validate that the left side is a single term or accept a list of terms but I agree that the current situation is confusing so I am reopening the issue.

mayya-sharipova · 2020-05-05T20:15:38Z

Thanks @jimczi for weighing in.

@telendt ok, great! Your patch to elasticsearch is welcome.

telendt · 2020-05-06T10:07:05Z

Cool. It might take me some time to set up my environment and get familiar with the code but I will try to provide a PR in the following days.

This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted.

…6484) This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted. Closes #56113

cbuescher added :Search Relevance/Analysis How text is split into tokens >enhancement labels May 4, 2020

elasticmachine added the Team:Search Meta label for search team label May 4, 2020

mayya-sharipova closed this as completed May 5, 2020

jimczi reopened this May 5, 2020

telendt mentioned this issue May 9, 2020

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Merged

cbuescher closed this as completed in #56484 May 29, 2020

russcam mentioned this issue Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple tokens on LHS in stemmer_override rules #56113

Multiple tokens on LHS in stemmer_override rules #56113

telendt commented May 4, 2020 •

edited

Loading

elasticmachine commented May 4, 2020

telendt commented May 5, 2020

mayya-sharipova commented May 5, 2020

telendt commented May 5, 2020

jimczi commented May 5, 2020

mayya-sharipova commented May 5, 2020 •

edited

Loading

telendt commented May 6, 2020

Multiple tokens on LHS in stemmer_override rules #56113

Multiple tokens on LHS in stemmer_override rules #56113

Comments

telendt commented May 4, 2020 • edited Loading

SYNONYMS

STEMS

elasticmachine commented May 4, 2020

telendt commented May 5, 2020

mayya-sharipova commented May 5, 2020

telendt commented May 5, 2020

jimczi commented May 5, 2020

mayya-sharipova commented May 5, 2020 • edited Loading

telendt commented May 6, 2020

telendt commented May 4, 2020 •

edited

Loading

mayya-sharipova commented May 5, 2020 •

edited

Loading