-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Lower bound for min-max normalization technique in Hybrid query #1189
Comments
We have found that an upper bound would work well for our project too, if it could be applied to one or more of the queries. Or alternatively, a way to indicate that one set of scores is already normalised. Currently we use a hybrid query, combining a radial KNN query with a min_score of 0.75, and a keyword query using BM25. The KNN query always returns results with a score in the range [0.75, 1.0], where we know that 0.75 corresponds to a poor match and that 1.0 is a very good match. However, we find that after normalisation poor matches are ranked higher than expected. For example, if we get a string of low scoring results from the KNN query such as [0.77, 0.77, 0.76, 0.75, 0.75], then these are currently normalised to [1.0, 1.0, 0.5, 0.0, 0.0], which indicates the result with a cosine similarity of 0.77 is a strong match - when it is not. This results in weak KNN results ranking higher than strong keyword results. If we could specify both a lower and upper bound, then we could set them to the known range coming back from the KNN search (i.e. lower=0.75, upper=1.0). In the previous example, the scores would then be normalised to [0.08, 0.08, 0.04, 0.0, 0.0] - preserving the fact that the results are weak matches and should be ranked lower. Open to other ideas on how one might solve this of course! |
@marcus-bcl thanks for your feedback, I see how raw min-max normalization results are not that great for your scenario. This is the fundamental limitation of the min-max normalization; it's challenging to improve it without knowing the potential upper bound. I believe our other normalization techniques, L2 and incoming z-score, will face similar issues. I've created a feature request for adding upper_bound parameter for min-max #1210, please +1 if that request is something you're looking for. |
Introduction
This document describes details of design for Explainability in Hybrid Query. This feature has been requested through GitHub issues #150 and #299.
Overview
Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In version 2.11 team released hybrid query which is part of the neural-search plugin. Main responsibility of the hybrid query is to return combined scores of multiple queries. In common scenario those queries represent different search types, like lexical and semantic based.
Hybrid query uses multiple techniques for preparing final list of matched document, main two types are score based normalization and rank base combination. For score base normalization, the most effective technique is min-max normalization. In scope of this proposal we want to improve search relevance of min-max normalization by allowing setting of lower bound.
Problem Statement
Min-max normalization technique is based on the usage of maximum and minimum scores from all matched documents with the following formula.
normalizedScore = (score - minScore) / (maxScore - minScore);
In context of OpenSearch, finding the minimum score is based on assumption that may be not the most effective one. While handling search request, system retrieves limited amount of matching documents from each shard, this limit is defined by the query parameter size. Minimum score will be identified as minimum from all score from all collected documents. In case overall number of matching documents is much higher then number of retrieved documents, then the delta between real and retrieved minimum scores can be significant. This will negatively influence the final normalized score.
Following graphs illustrate described scenario, in shard 1 retrieved min score is 4.0, while actual lower bound is 0.0. Similarly for shard 2 retrieved and lower bound scores are 2.0 and 1.0.
Requirements
Functional Requirements
We want to introduce a lower limit that reflects the actual minimum score for matching document.
Functional Requirements
Non functional requirements
Current state
Today normalization in hybrid query is performed by the normalization processor which is of phase results processor type. Search pipeline with this processor can be defined with following request, see more details of how normalization processor can be configured here
Following formula used in min-max technique
normalizedScore = (score - minScore) / (maxScore - minScore);
where min score is min(scores[0 ... size)) and max score is max(scores[0 ... size))
Min-max technique uses all matched documents from all shards to find maximum and minimum scores. OpenSearch retrieve up to size number of documents from each shard, for most scenarios they will be sorted in desc order, documents with lower scores will be dropped. While the maximum score will be used by the processor is one of the retrieved documents, real minimum score can be outside of the retrieved sub-set of documents.
Following table shows one example of such scenario when size for the query is set to 3, and there are more than size matching documents in each shard
Two major issues are:
Challenges
Retrieve actual lower bound score for the query
This is a problem when number of matched documents at shard level is greater then the size . In this case actual min score will not be the part of the document set, and in case if matched document number is higher then the max size limit it will no be even collected. Example is knn query, where for typical configuration any document will have some positive score.
We can perform exhaustive search and retrieve all matching documents at shard level. Problem with this approach is performance degradation, for big datasets latency can drop drastically, and memory consumption can be high. Based on these consideration we recommend to avoid exhaustive retrieval.
How to deal with documents that have score lower then the lower bound
If we use any lower bound value that is not based on actual data that can lead to scenario when document may have scores that are lower then the lower bound.
There are multiple way of howe to address this, we can:
Solution Overview
Implementation of the lower bound score in the context of calculations is straightforward, with our change calculation will look like this:
float normalizedScore = (score - customMinScore) / (maxScore - customMinScore);
Essentially we replaced the actual minScore by the user provided number. Value for minimum score can be set as new parameter for normalization technique. We can use format that is based on the position (index) of sub-query, similar to existing weights parameter.
These changes will be implemented at the phase result processor level. This component is responsible for running computations when the min-max technique is set up by the user. The following diagram shows the high-level components and the specific location where the change needs to be made.
To avoid confusion with existing OpenSearch feature with the same name min_score I suggest we pick different name.
Recommended name is
lower_bound
.Expert level configuration for lower bound
The configuration of the lower bound for the min-max normalization technique is considered an expert-level task. This feature is part of the search pipeline setup and should be used with caution, as improper configuration can lead to less relevant hybrid scores. Determining the optimal lower bound value requires a deep understanding of the data distribution and the specific search use case. It involves analyzing the score distribution and experimenting with different values to find the most effective lower bound. Incorrect configuration can significantly impact the relevance of the search results, and the computation of the lower bound may introduce additional latency and resource consumption. Users should be aware of these potential impacts and monitor their systems accordingly.
To give users maximum flexibility we are going to allow user to configure lower_bounds at sub-query level, and skip/don’t apply it if needed.
Option1: Configurable score retention or clipping [Recommended]
Pros:
Cons:
It’s possible that actual shard level scores are lower then the lower bound score we defined. In this case we can do one of following actions:
Following graphs illustrate ho the lower bound works when:
I have done POC and collected NDCG metric values for several datasets, following table shows these results
Summary of experiments
See appendix for detailed dataset statistics.
Based on the data from POC we recommend solution that uses lower bound for scores greater then the min_score, and uses actual score when it’s lower then the lower bound. Recommendation is to make this approach default, and clipping mode will be optional.
Solution with decay function that is based on IQR giving similar results, but is more computationally intense, os it’s not recommended.
API changes
New feature needs to be configurable by user. That can be done via technique parameters for the processor as part of search pipeline definition.
Parameter details
for completeness following request shows processor with all defaults, meaning we will apply lower bound with a min_score of 0.0 for all sub-queries
Option 2: Clipping
We can just clip the low scores, meaning return the lower bound score if actual score is less then lower bound.
Request will look like following:
Pros:
Cons:
Option 3: Clipping with IQR based decay
We can address low scores by keeping them but also applying penalty to low scores that are lower then “lower bound”.
There are many options for applying penalty to the low scores, some of the most popular and promising are:
I have done POC and collected NDCG metric values for several datasets, following table shows these results, check table for exact numbers. Most promising is approach that’s based on IQR.
Pros:
Cons:
User scenarios
Following data show how every solution option affects the final score. Initial score values are taken from this table that shows how score are calculated today
Lower bound [0.0, 00]
Lower bound [30.0, 2.0]
Lower bound [30.0, 2.0]
decay rate based on standard deviation
Low Level Design
How to setup
We need minor adjustments in the Factory class to read and parse new parameters ScoreNormalizationFactory.
How compute normalized scores
All logic related changes will be done in MinMaxScoreNormalizationTechnique class.
First we compute the minimum score depending on
mode
flag and min_score limit valueChanges for a single score normalization should be done in the normalizeSingleScore method
Potential Issues
Knowing the lower bound that gives the most relevant results can be a challenging to a user. Existing logic provides decent results in general, so this parameters should be an expert level setting rather then a default recommendation. We should think of some sort of heuristic to retrieve most effective lower bound from within the indexed data.
Metrics
Adding specific metric is not possible at the moment, we should add one once stats API for neural is ready. It’s in design phase #1104 and #1146. As per early reviews of stats API (draft design) adding new metric will be straightforward, as simple as making one call to the static method.
Backward Compatibility
New solution is backward compatible with today approach: if no details are specified for lower bounds then actual shard level min score will be used.
Testability
New functionality should be covered by unit tests and integration tests. Unit test will take care of computation logic and edge cases in input data. Integration test will test the end to end flow, on test should be enough for sanity check.
Need full scale benchmarking to measure how this feature affects the relevancy and resource utilization. Some benchmarks were done as part of the POC, using 4 data sets, average improvement of NDCG is 3.5%
Appendix A
Dataset statistics
References
Feedback Required
We greatly value feedback from the community to ensure that this proposal addresses real-world use cases effectively. Here are a few specific points where your input would be particularly helpful:
Defaults for lower bounds
We plan to use defaults for the lower_bound feature, applying the lower bound score without a penalty and setting the default min_score to 0.0.
Are these defaults suitable for all query types?
Do you have any suggestions for alternative defaults?
Need for extra features
Should we consider adding extra features such as an upper_bound score?
What other features do you think would be beneficial?
Benefit of other techniques
Currently, we are adding the lower_bound feature to
min-max
normalization but not toL2
normalization.Do you think it would be beneficial to add the lower_bound feature to
L2
normalization as well?Your insights will help us refine the proposal to better meet the needs of our users. Thank you for your valuable feedback!
The text was updated successfully, but these errors were encountered: