Expose term frequency in Painless script score context #7558

russcam · 2023-05-15T06:54:59Z

Is your feature request related to a problem? Please describe.

In our current Solr setup, we make heavy use of Solr functions for implementing query time multiplicative and additive boosting. We are in the process of migrating from Solr to Elasticsearch, and porting our querying logic over. The majority of Solr functions can be implemented with Painless scripting in a function score query script score to provide multiplicative boosting, however there are a few functions that cannot be, such as

the biggest pain in particular is the lack of termfreq, which we use as part of a boost for calculating popularity, a calculation that incorporates a baseline value so that new content of unknown popularity is factored in.

Describe the solution you'd like

I'm opening this issue to request that Painless scripting support retrieving term frequencies in a script score context.

Describe alternatives you've considered

In order to work around not having access to termfreq in a script score context, we have to write a custom script engine and script plugin that is able to look up term frequencies from the PostingsEnum. This is less than desirable because

we need to maintain a plugin to provide functionality across versions
A script engine approach is either constrained to a particular usage/application of term frequencies, or, the complexity of a script engine approach dramatically increases if we want to support all other functions, and allow them to be combined in arbitrary ways i.e. replicate functions that are already exposed in Painless.

Note that AWS OpenSearch service does not support custom script plugins.

The text was updated successfully, but these errors were encountered:

macohen · 2023-06-14T03:47:24Z

@msfroh I think you mentioned that you had an idea for implementation. Can you add any more thoughts on your idea please?

msfroh · 2023-06-14T17:48:41Z

@macohen, we need to get through the "unknown unknowns" first.

I would take a day or two to build a scrappy prototype for one function (probably termfreq since @russcam called it out as the biggest issue). The lessons learned from that should make the overall effort clearer (and should be documented here).

msfroh · 2023-06-14T18:02:39Z

The very basic idea is that script scoring has access to field values (from doc values). Those doc values are retrieved from Lucene via some use of the ValueSource interface.

Solr implements the referenced scoring functions via ValueSource implementations that are already in Lucene:

payload is a little bit more complicated: https://github.com/apache/solr/blob/6e10c05cb7e2953c07e7d2a7a8e0b18f9ba65938/solr/core/src/java/org/apache/solr/search/ValueSourceParser.java#L907

We "just" need to expose the above ValueSource implementations the same way that we expose doc values.

macohen · 2023-06-19T17:04:17Z

Hi @russcam, can you speak a little more to the use case you have here? Just would be great to make sure the community understands the need and can provide input and feedback.

russcam · 2023-06-20T08:19:26Z

Being able to incorporate termfreq into Painless scripting script scoring opens up possibilities for complex scoring scenarios. In this particular case, it is used as part of a popularity score calculation in Solr. Imagine that .termfreq(<term>) existed as a function on script doc values, it'd be possible to write complex function script scores such as

def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;

which calculates a score based on multiplying the term frequency in the first field that exists in a list of fields, otherwise returning a default value.

Scripted similarities allow a script to be used to specify how scores should be computed, but scripted similarities are not flexible enough.

noCharger · 2023-07-05T21:37:15Z

Being able to incorporate termfreq into Painless scripting script scoring opens up possibilities for complex scoring scenarios. In this particular case, it is used as part of a popularity score calculation in Solr. Imagine that .termfreq(<term>) existed as a function on script doc values, it'd be possible to write complex function script scores such as
def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;
which calculates a score based on multiplying the term frequency in the first field that exists in a list of fields, otherwise returning a default value.

Scripted similarities allow a script to be used to specify how scores should be computed, but scripted similarities are not flexible enough.

Hi @russcam,

While exposing term frequency in script source is totally a valid use case, I'm wondering if the _termvectors API may be considered a short-term mitigation from the use case above because it delivers these stats at the doc level such as term_freq, sum_doc_freq etc.

Here's an working example on my local setup:

Craete an index mapping with term_vector enabled

PUT /test_index
{
  "mappings": {
    "properties": {
      "test": {
        "type": "object",
        "properties": {
          "key1": {
            "type": "text",
            "term_vector": "yes"
          },
          "key2": {
            "type": "date"
          },
          "key3": {
            "type": "double"
          }
        }
      }
    } 
  }
}

Add a sample doc

PUT /test_index/_doc/1
{
  "test": [
    {
      "key1": "value1",
      "key2": "2023-04-13T10:00:00.000",
      "key3": 0
    },
    {
      "key1": "value1",
      "key2": "2023-04-13T10:00:00.000",
      "key3": 1
    }
  ]
}

Search on the index, get the doc id, and get the stats using _termvectors API

GET /test_index/_termvectors/1
{
  "fields": ["test.key1"],
  "field_statistics": true
}

Response:

{
  "_index": "test_index",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "test.key1": {
      "field_statistics": {
        "sum_doc_freq": 1,
        "doc_count": 1,
        "sum_ttf": 2
      },
      "terms": {
        "value1": {
          "term_freq": 2
        }
      }
    }
  }
}

We should also create an doc issue for how to use _termvectors API. Meanwhile add yaml test for this use case explicitly.

cc: @macohen @nknize

russcam · 2023-07-06T08:04:21Z

@noCharger term vectors and multi term vectors APIs are useful, but you need to know the ids of documents whose term statistics you are interested in, which you don't know ahead of time for a given search query. As such, I don't believe it to be a viable interim solution.

nknize · 2023-07-06T17:40:42Z

you need to know the ids of documents whose term statistics you are interested in...

💯 for the term vector API the client would be responsible.. 😞

....it'd be possible to write complex function script scores such as...

_doc(params.fields[x]) fetches the docvalue which doesn't have the term frequencies. I think we still have to get from the Postings...

I haven't dug deep into this part of the painless code for quite some time so I'd have to look closer but are you suggesting a "simple" solution like adding a new ScriptContext that exposes the term vector.termFreq for use in rescoring at query time?

russcam · 2023-07-10T12:44:03Z

_doc(params.fields[x]) fetches the docvalue which doesn't have the term frequencies. I think we still have to get from the Postings...

This is definitely pseudocode 🙂 You're right, they would need to come from the PostingsEnum or similar.

are you suggesting a "simple" solution like adding a new ScriptContext that exposes the term vector.termFreq for use in rescoring at query time?

A new ScriptContext, or perhaps some way to expose on the existing ScoreScript context used by function score script functions?

noCharger · 2023-07-14T01:46:23Z

@russcam would like to know if scripted similarity could be a potential solution for your use case because it already expose tf / total_tf / sum_tf in its script context for calculating score on fetch phase. Here's an working example:

PUT /index2
{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "scripted_term_freq": {
        "type": "scripted",
        "script": {
          "source": "double tf = doc.freq; return tf * 2;"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "field": {
        "type": "text",
        "similarity": "scripted_term_freq"
      }
    }
  }
}

PUT /index2/_doc/1
{
  "field": "foo bar foo",
  "field2": "bar foo bar"
}

PUT /index2/_doc/2
{
  "field2": "bar foo bar"
}

POST /index2/_refresh

GET /index2/_search?explain=true
{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "foo"
        }
      },
      "filter": {
        "exists": {
          "field": "field"
        }
      }
    }
  }
}

Response:

{
  "took": 47,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 4,
    "hits": [
      {
        "_shard": "[index2][0]",
        "_node": "mXxiFXK8S3SuXkY2KZp7NQ",
        "_index": "index2",
        "_id": "1",
        "_score": 4,
        "_source": {
          "field": "foo bar foo",
          "field2": "bar foo bar"
        },
        "_explanation": {
          "value": 4,
          "description": "sum of:",
          "details": [
            {
              "value": 4,
              "description": "max of:",
              "details": [
                {
                  "value": 0.18232156,
                  "description": "weight(field2:foo in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 0.18232156,
                      "description": "score(freq=1.0), computed as boost * idf * tf from:",
                      "details": [
                        {
                          "value": 2.2,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 0.18232156,
                          "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details": [
                            {
                              "value": 2,
                              "description": "n, number of documents containing term",
                              "details": []
                            },
                            {
                              "value": 2,
                              "description": "N, total number of documents with field",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 0.45454544,
                          "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "freq, occurrences of term within document",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "k1, term saturation parameter",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "b, length normalization parameter",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "dl, length of field",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "avgdl, average length of field",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 4,
                  "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 4,
                      "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = doc.freq; return tf * 2;', options={}, params={}}]) computed from:",
                      "details": [
                        {
                          "value": 1,
                          "description": "weight",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "query.boost",
                          "details": []
                        },
                        {
                          "value": 3,
                          "description": "field.docCount",
                          "details": []
                        },
                        {
                          "value": 6,
                          "description": "field.sumDocFreq",
                          "details": []
                        },
                        {
                          "value": 8,
                          "description": "field.sumTotalTermFreq",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "term.docFreq",
                          "details": []
                        },
                        {
                          "value": 4,
                          "description": "term.totalTermFreq",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "doc.freq",
                          "details": []
                        },
                        {
                          "value": 3,
                          "description": "doc.length",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 1,
                  "description": "FieldExistsQuery [field=field]",
                  "details": []
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

cc: @nknize @macohen @msfroh @jainankitk

russcam · 2023-07-14T02:11:25Z

@noCharger I don't believe script similarity is flexible enough because it doesn't allow parameters to be included into the similarity score on a per query basis, which is needed e.g. in the example in #7558 (comment), params.multiplier, params.fields, params.term and params.default_value

noCharger · 2023-07-14T02:29:31Z

@noCharger I don't believe script similarity is flexible enough because it doesn't allow parameters to be included into the similarity score on a per query basis, which is needed e.g. in the example in #7558 (comment), params.multiplier, params.fields, params.term and params.default_value

correct, while the multiplier and default_value can be injected by function_score query, the target term must be in query context which is not configurable. My understanding of this limitation is accessing the stats of other terms during the query phase is very memory-intensive for large queries or indices because they are not loaded in memory.

jainankitk · 2023-07-19T01:17:06Z

@russcam - Feel free to review the linked RFC and provide your feedback.

noCharger · 2023-08-29T18:10:01Z

Close this issue since PR is merged and backported.

russcam added enhancement Enhancement or improvement to existing feature or request untriaged labels May 15, 2023

msfroh added the Search Search query, autocomplete ...etc label May 17, 2023

github-project-automation bot added this to Search Project Board May 17, 2023

github-project-automation bot moved this to 🆕 New in Search Project Board May 17, 2023

msfroh removed the untriaged label May 17, 2023

macohen moved this from 🆕 New to Next (Next Quarter) in Search Project Board Jun 14, 2023

macohen added v2.10.0 and removed v2.10.0 labels Jul 11, 2023

macohen moved this from Next (Next Quarter) to Now(This Quarter) in Search Project Board Jul 13, 2023

macohen added the v2.10.0 label Jul 13, 2023

macohen moved this from Now(This Quarter) to 🏗 In progress in Search Project Board Jul 14, 2023

noCharger mentioned this issue Jul 14, 2023

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

Closed

macohen assigned noCharger Jul 26, 2023

noCharger mentioned this issue Aug 8, 2023

[Feature] Expose term frequency in Painless script score context #9081

Merged

5 tasks

mingshl moved this from 🏗 In progress to 👀 In review in Search Project Board Aug 14, 2023

msfroh mentioned this issue Aug 23, 2023

[DOC] Document new doc/term frequency functions in Painless score scripts opensearch-project/documentation-website#4858

Closed

4 tasks

noCharger moved this from 👀 In review to ✅ Done in Search Project Board Aug 23, 2023

noCharger closed this as completed Aug 29, 2023

msfroh mentioned this issue Sep 11, 2023

[BUG] Exception when using tf Painless method suggests using the deprecated classic similarity #9958

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose term frequency in Painless script score context #7558

Expose term frequency in Painless script score context #7558

russcam commented May 15, 2023

macohen commented Jun 14, 2023

msfroh commented Jun 14, 2023

msfroh commented Jun 14, 2023 •

edited

Loading

macohen commented Jun 19, 2023

russcam commented Jun 20, 2023

noCharger commented Jul 5, 2023 •

edited

Loading

russcam commented Jul 6, 2023

nknize commented Jul 6, 2023

russcam commented Jul 10, 2023

noCharger commented Jul 14, 2023 •

edited

Loading

russcam commented Jul 14, 2023

noCharger commented Jul 14, 2023 •

edited

Loading

jainankitk commented Jul 19, 2023

noCharger commented Aug 29, 2023

Expose term frequency in Painless script score context #7558

Expose term frequency in Painless script score context #7558

Comments

russcam commented May 15, 2023

macohen commented Jun 14, 2023

msfroh commented Jun 14, 2023

msfroh commented Jun 14, 2023 • edited Loading

macohen commented Jun 19, 2023

russcam commented Jun 20, 2023

noCharger commented Jul 5, 2023 • edited Loading

russcam commented Jul 6, 2023

nknize commented Jul 6, 2023

russcam commented Jul 10, 2023

noCharger commented Jul 14, 2023 • edited Loading

russcam commented Jul 14, 2023

noCharger commented Jul 14, 2023 • edited Loading

jainankitk commented Jul 19, 2023

noCharger commented Aug 29, 2023

msfroh commented Jun 14, 2023 •

edited

Loading

noCharger commented Jul 5, 2023 •

edited

Loading

noCharger commented Jul 14, 2023 •

edited

Loading

noCharger commented Jul 14, 2023 •

edited

Loading