Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ZScore Normalization Technique #1224

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

owaiskazi19
Copy link
Member

Description

Adds ZScore Normalization Technique

Related Issues

Resolves #376 and #1209

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.87%. Comparing base (5f25d6c) to head (bc43a35).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1224      +/-   ##
============================================
+ Coverage     81.80%   81.87%   +0.06%     
+ Complexity     2606     1337    -1269     
============================================
  Files           190       96      -94     
  Lines          8922     4568    -4354     
  Branches       1520      787     -733     
============================================
- Hits           7299     3740    -3559     
+ Misses         1032      525     -507     
+ Partials        591      303     -288     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// case when technique is z score and combination is not arithmetic mean
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME)
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) {
throw new IllegalArgumentException("Z Score supports only arithmetic_mean combination technique");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why Z-Score doesn't support other combination techniques?

Comment on lines +79 to +80
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME)
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]
What combination techniques are supported with what Normalization technique should get abstracted in NormalizationTechnique class. Example there can be a function which says validateCombinationTechnique(), where you can validate what combination technique is valid for this NormalizationTechnique.

Otherwise these if else condition will just bloat up the whole code. Its a small refactoring but it will go long way in maintaining the code for long term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

.size();
}

static private float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static private float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) {
private static float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) {

and same for all the below functions

*/
@AllArgsConstructor
@Getter
private class ZScores {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with JDK 21 you can now use record key word with classes like this. Try it out. Ref: https://www.baeldung.com/java-record-keyword

Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed 1st round of review

@@ -50,8 +51,9 @@ public SearchPhaseResultsProcessor create(
) throws Exception {
Map<String, Object> normalizationClause = readOptionalMap(NormalizationProcessor.TYPE, tag, config, NORMALIZATION_CLAUSE);
ScoreNormalizationTechnique normalizationTechnique = ScoreNormalizationFactory.DEFAULT_METHOD;
String normalizationTechniqueName = MinMaxScoreNormalizationTechnique.TECHNIQUE_NAME;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we take a default value here?

@@ -73,6 +75,11 @@ public SearchPhaseResultsProcessor create(
TECHNIQUE,
ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME
);
// case when technique is z score and combination is not arithmetic mean
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract this validation out of this method to a separate one? Also add why we are blocking these techniques?

Comment on lines +79 to +80
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME)
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

import static org.opensearch.neuralsearch.processor.explain.ExplanationUtils.getDocIdAtQueryForNormalization;

@ToString(onlyExplicitlyIncluded = true)
public class ZScoreNormalizationTechnique implements ScoreNormalizationTechnique, ExplainableTechnique {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please Add javadoc of the class.

return sum;
}

private ZScores getZScoreResults(final List<CompoundTopDocs> queryTopDocs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a logic of determining Zscore in comments?

return meanPerSubQuery;
}

static private float[] findStdPerSubquery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does std stands for standard deviation? If yes could you rename this method more appropriate ?

return numberOfElementsPerSubQuery;
}

static private float[] findMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static private float[] findMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) {
static private float[] calculateMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) {

public class ZScoreNormalizationTechnique implements ScoreNormalizationTechnique, ExplainableTechnique {
@ToString.Include
public static final String TECHNIQUE_NAME = "z_score";
private static final float SINGLE_RESULT_SCORE = 1.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1.0 a max possible score for z-score? we do use 1.0 for min_max score because scores are in [0.0...1.0] interval

@ToString.Include
public static final String TECHNIQUE_NAME = "z_score";
private static final float SINGLE_RESULT_SCORE = 1.0f;
private static final float MIN_SCORE = 0.001f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to single result score, min_score is something we return for the lowest possible score, e.g. if raw scores 2.0 and 5.0 normalized to 0.0 and 1.0 then instead of 0.0 we return that score. Because scores are not in [0.0...1.0] interval we need to compute min score somehow

final long[] elementsPerSubquery,
final int numOfScores
) {
final double[] deltaSumPerSubquery = new double[numOfScores];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we're doing this manually instead of using function from some library, e.g. DescriptiveStatistics from apache-commons math? Reason being - if that's a well-known library I do have more confidence in it for the edge case scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement hybrid search v3.0.0 v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add z-score for the normalization processor
4 participants