Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose term frequency in Painless script score context #7558

Closed
russcam opened this issue May 15, 2023 · 14 comments
Closed

Expose term frequency in Painless script score context #7558

russcam opened this issue May 15, 2023 · 14 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.10.0

Comments

@russcam
Copy link
Contributor

russcam commented May 15, 2023

Is your feature request related to a problem? Please describe.

In our current Solr setup, we make heavy use of Solr functions for implementing query time multiplicative and additive boosting. We are in the process of migrating from Solr to Elasticsearch, and porting our querying logic over. The majority of Solr functions can be implemented with Painless scripting in a function score query script score to provide multiplicative boosting, however there are a few functions that cannot be, such as

the biggest pain in particular is the lack of termfreq, which we use as part of a boost for calculating popularity, a calculation that incorporates a baseline value so that new content of unknown popularity is factored in.

Describe the solution you'd like

I'm opening this issue to request that Painless scripting support retrieving term frequencies in a script score context.

Describe alternatives you've considered

In order to work around not having access to termfreq in a script score context, we have to write a custom script engine and script plugin that is able to look up term frequencies from the PostingsEnum. This is less than desirable because

  1. we need to maintain a plugin to provide functionality across versions
  2. A script engine approach is either constrained to a particular usage/application of term frequencies, or, the complexity of a script engine approach dramatically increases if we want to support all other functions, and allow them to be combined in arbitrary ways i.e. replicate functions that are already exposed in Painless.

Note that AWS OpenSearch service does not support custom script plugins.

@russcam russcam added enhancement Enhancement or improvement to existing feature or request untriaged labels May 15, 2023
@msfroh msfroh added the Search Search query, autocomplete ...etc label May 17, 2023
@msfroh msfroh removed the untriaged label May 17, 2023
@macohen
Copy link
Contributor

macohen commented Jun 14, 2023

@msfroh I think you mentioned that you had an idea for implementation. Can you add any more thoughts on your idea please?

@msfroh
Copy link
Collaborator

msfroh commented Jun 14, 2023

@macohen, we need to get through the "unknown unknowns" first.

I would take a day or two to build a scrappy prototype for one function (probably termfreq since @russcam called it out as the biggest issue). The lessons learned from that should make the overall effort clearer (and should be documented here).

@msfroh
Copy link
Collaborator

msfroh commented Jun 14, 2023

@macohen macohen moved this from 🆕 New to Next (Next Quarter) in Search Project Board Jun 14, 2023
@macohen
Copy link
Contributor

macohen commented Jun 19, 2023

Hi @russcam, can you speak a little more to the use case you have here? Just would be great to make sure the community understands the need and can provide input and feedback.

@russcam
Copy link
Contributor Author

russcam commented Jun 20, 2023

Being able to incorporate termfreq into Painless scripting script scoring opens up possibilities for complex scoring scenarios. In this particular case, it is used as part of a popularity score calculation in Solr. Imagine that .termfreq(<term>) existed as a function on script doc values, it'd be possible to write complex function script scores such as

def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;

which calculates a score based on multiplying the term frequency in the first field that exists in a list of fields, otherwise returning a default value.

Scripted similarities allow a script to be used to specify how scores should be computed, but scripted similarities are not flexible enough.

@noCharger
Copy link
Contributor

noCharger commented Jul 5, 2023

Being able to incorporate termfreq into Painless scripting script scoring opens up possibilities for complex scoring scenarios. In this particular case, it is used as part of a popularity score calculation in Solr. Imagine that .termfreq(<term>) existed as a function on script doc values, it'd be possible to write complex function script scores such as

def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;

which calculates a score based on multiplying the term frequency in the first field that exists in a list of fields, otherwise returning a default value.

Scripted similarities allow a script to be used to specify how scores should be computed, but scripted similarities are not flexible enough.

Hi @russcam,

While exposing term frequency in script source is totally a valid use case, I'm wondering if the _termvectors API may be considered a short-term mitigation from the use case above because it delivers these stats at the doc level such as term_freq, sum_doc_freq etc.

Here's an working example on my local setup:

  1. Craete an index mapping with term_vector enabled
PUT /test_index
{
  "mappings": {
    "properties": {
      "test": {
        "type": "object",
        "properties": {
          "key1": {
            "type": "text",
            "term_vector": "yes"
          },
          "key2": {
            "type": "date"
          },
          "key3": {
            "type": "double"
          }
        }
      }
    } 
  }
}
  1. Add a sample doc
PUT /test_index/_doc/1
{
  "test": [
    {
      "key1": "value1",
      "key2": "2023-04-13T10:00:00.000",
      "key3": 0
    },
    {
      "key1": "value1",
      "key2": "2023-04-13T10:00:00.000",
      "key3": 1
    }
  ]
}
  1. Search on the index, get the doc id, and get the stats using _termvectors API
GET /test_index/_termvectors/1
{
  "fields": ["test.key1"],
  "field_statistics": true
}

Response:

{
  "_index": "test_index",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "test.key1": {
      "field_statistics": {
        "sum_doc_freq": 1,
        "doc_count": 1,
        "sum_ttf": 2
      },
      "terms": {
        "value1": {
          "term_freq": 2
        }
      }
    }
  }
}

We should also create an doc issue for how to use _termvectors API. Meanwhile add yaml test for this use case explicitly.

cc: @macohen @nknize

@russcam
Copy link
Contributor Author

russcam commented Jul 6, 2023

@noCharger term vectors and multi term vectors APIs are useful, but you need to know the ids of documents whose term statistics you are interested in, which you don't know ahead of time for a given search query. As such, I don't believe it to be a viable interim solution.

@nknize
Copy link
Collaborator

nknize commented Jul 6, 2023

you need to know the ids of documents whose term statistics you are interested in...

💯 for the term vector API the client would be responsible.. 😞

....it'd be possible to write complex function script scores such as...

_doc(params.fields[x]) fetches the docvalue which doesn't have the term frequencies. I think we still have to get from the Postings...

I haven't dug deep into this part of the painless code for quite some time so I'd have to look closer but are you suggesting a "simple" solution like adding a new ScriptContext that exposes the term vector.termFreq for use in rescoring at query time?

@russcam
Copy link
Contributor Author

russcam commented Jul 10, 2023

_doc(params.fields[x]) fetches the docvalue which doesn't have the term frequencies. I think we still have to get from the Postings...

This is definitely pseudocode 🙂 You're right, they would need to come from the PostingsEnum or similar.

are you suggesting a "simple" solution like adding a new ScriptContext that exposes the term vector.termFreq for use in rescoring at query time?

A new ScriptContext, or perhaps some way to expose on the existing ScoreScript context used by function score script functions?

@macohen macohen added v2.10.0 and removed v2.10.0 labels Jul 11, 2023
@macohen macohen moved this from Next (Next Quarter) to Now(This Quarter) in Search Project Board Jul 13, 2023
@noCharger
Copy link
Contributor

noCharger commented Jul 14, 2023

@russcam would like to know if scripted similarity could be a potential solution for your use case because it already expose tf / total_tf / sum_tf in its script context for calculating score on fetch phase. Here's an working example:

PUT /index2
{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "scripted_term_freq": {
        "type": "scripted",
        "script": {
          "source": "double tf = doc.freq; return tf * 2;"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "field": {
        "type": "text",
        "similarity": "scripted_term_freq"
      }
    }
  }
}

PUT /index2/_doc/1
{
  "field": "foo bar foo",
  "field2": "bar foo bar"
}

PUT /index2/_doc/2
{
  "field2": "bar foo bar"
}

POST /index2/_refresh

GET /index2/_search?explain=true
{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "foo"
        }
      },
      "filter": {
        "exists": {
          "field": "field"
        }
      }
    }
  }
}

Response:

{
  "took": 47,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 4,
    "hits": [
      {
        "_shard": "[index2][0]",
        "_node": "mXxiFXK8S3SuXkY2KZp7NQ",
        "_index": "index2",
        "_id": "1",
        "_score": 4,
        "_source": {
          "field": "foo bar foo",
          "field2": "bar foo bar"
        },
        "_explanation": {
          "value": 4,
          "description": "sum of:",
          "details": [
            {
              "value": 4,
              "description": "max of:",
              "details": [
                {
                  "value": 0.18232156,
                  "description": "weight(field2:foo in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 0.18232156,
                      "description": "score(freq=1.0), computed as boost * idf * tf from:",
                      "details": [
                        {
                          "value": 2.2,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 0.18232156,
                          "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details": [
                            {
                              "value": 2,
                              "description": "n, number of documents containing term",
                              "details": []
                            },
                            {
                              "value": 2,
                              "description": "N, total number of documents with field",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 0.45454544,
                          "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "freq, occurrences of term within document",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "k1, term saturation parameter",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "b, length normalization parameter",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "dl, length of field",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "avgdl, average length of field",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 4,
                  "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 4,
                      "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = doc.freq; return tf * 2;', options={}, params={}}]) computed from:",
                      "details": [
                        {
                          "value": 1,
                          "description": "weight",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "query.boost",
                          "details": []
                        },
                        {
                          "value": 3,
                          "description": "field.docCount",
                          "details": []
                        },
                        {
                          "value": 6,
                          "description": "field.sumDocFreq",
                          "details": []
                        },
                        {
                          "value": 8,
                          "description": "field.sumTotalTermFreq",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "term.docFreq",
                          "details": []
                        },
                        {
                          "value": 4,
                          "description": "term.totalTermFreq",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "doc.freq",
                          "details": []
                        },
                        {
                          "value": 3,
                          "description": "doc.length",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 1,
                  "description": "FieldExistsQuery [field=field]",
                  "details": []
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

cc: @nknize @macohen @msfroh @jainankitk

@russcam
Copy link
Contributor Author

russcam commented Jul 14, 2023

@noCharger I don't believe script similarity is flexible enough because it doesn't allow parameters to be included into the similarity score on a per query basis, which is needed e.g. in the example in #7558 (comment), params.multiplier, params.fields, params.term and params.default_value

@noCharger
Copy link
Contributor

noCharger commented Jul 14, 2023

@noCharger I don't believe script similarity is flexible enough because it doesn't allow parameters to be included into the similarity score on a per query basis, which is needed e.g. in the example in #7558 (comment), params.multiplier, params.fields, params.term and params.default_value

correct, while the multiplier and default_value can be injected by function_score query, the target term must be in query context which is not configurable. My understanding of this limitation is accessing the stats of other terms during the query phase is very memory-intensive for large queries or indices because they are not loaded in memory.

@macohen macohen moved this from Now(This Quarter) to 🏗 In progress in Search Project Board Jul 14, 2023
@jainankitk
Copy link
Collaborator

@russcam - Feel free to review the linked RFC and provide your feedback.

@noCharger noCharger moved this from 👀 In review to ✅ Done in Search Project Board Aug 23, 2023
@noCharger
Copy link
Contributor

Close this issue since PR is merged and backported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.10.0
Projects
Archived in project
Development

No branches or pull requests

6 participants