-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ZScore Normalization Technique #1224
base: main
Are you sure you want to change the base?
Conversation
a245d6a
to
c94279d
Compare
Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Signed-off-by: Owais <owaiskazi19@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1224 +/- ##
============================================
+ Coverage 81.80% 81.87% +0.06%
+ Complexity 2606 1337 -1269
============================================
Files 190 96 -94
Lines 8922 4568 -4354
Branches 1520 787 -733
============================================
- Hits 7299 3740 -3559
+ Misses 1032 525 -507
+ Partials 591 303 -288 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
// case when technique is z score and combination is not arithmetic mean | ||
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME) | ||
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) { | ||
throw new IllegalArgumentException("Z Score supports only arithmetic_mean combination technique"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why Z-Score doesn't support other combination techniques?
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME) | ||
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]
What combination techniques are supported with what Normalization technique should get abstracted in NormalizationTechnique class. Example there can be a function which says validateCombinationTechnique()
, where you can validate what combination technique is valid for this NormalizationTechnique.
Otherwise these if else condition will just bloat up the whole code. Its a small refactoring but it will go long way in maintaining the code for long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
.size(); | ||
} | ||
|
||
static private float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static private float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) { | |
private static float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) { |
and same for all the below functions
*/ | ||
@AllArgsConstructor | ||
@Getter | ||
private class ZScores { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with JDK 21 you can now use record key word with classes like this. Try it out. Ref: https://www.baeldung.com/java-record-keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed 1st round of review
@@ -50,8 +51,9 @@ public SearchPhaseResultsProcessor create( | |||
) throws Exception { | |||
Map<String, Object> normalizationClause = readOptionalMap(NormalizationProcessor.TYPE, tag, config, NORMALIZATION_CLAUSE); | |||
ScoreNormalizationTechnique normalizationTechnique = ScoreNormalizationFactory.DEFAULT_METHOD; | |||
String normalizationTechniqueName = MinMaxScoreNormalizationTechnique.TECHNIQUE_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we take a default value here?
@@ -73,6 +75,11 @@ public SearchPhaseResultsProcessor create( | |||
TECHNIQUE, | |||
ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME | |||
); | |||
// case when technique is z score and combination is not arithmetic mean | |||
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract this validation out of this method to a separate one? Also add why we are blocking these techniques?
if (normalizationTechniqueName.equals(ZScoreNormalizationTechnique.TECHNIQUE_NAME) | ||
&& !combinationTechnique.equals(ArithmeticMeanScoreCombinationTechnique.TECHNIQUE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
import static org.opensearch.neuralsearch.processor.explain.ExplanationUtils.getDocIdAtQueryForNormalization; | ||
|
||
@ToString(onlyExplicitlyIncluded = true) | ||
public class ZScoreNormalizationTechnique implements ScoreNormalizationTechnique, ExplainableTechnique { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please Add javadoc of the class.
return sum; | ||
} | ||
|
||
private ZScores getZScoreResults(final List<CompoundTopDocs> queryTopDocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a logic of determining Zscore in comments?
return meanPerSubQuery; | ||
} | ||
|
||
static private float[] findStdPerSubquery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does std stands for standard deviation? If yes could you rename this method more appropriate ?
return numberOfElementsPerSubQuery; | ||
} | ||
|
||
static private float[] findMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static private float[] findMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) { | |
static private float[] calculateMeanPerSubquery(final float[] sumPerSubquery, final long[] elementsPerSubquery) { |
public class ZScoreNormalizationTechnique implements ScoreNormalizationTechnique, ExplainableTechnique { | ||
@ToString.Include | ||
public static final String TECHNIQUE_NAME = "z_score"; | ||
private static final float SINGLE_RESULT_SCORE = 1.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 1.0 a max possible score for z-score? we do use 1.0 for min_max score because scores are in [0.0...1.0] interval
@ToString.Include | ||
public static final String TECHNIQUE_NAME = "z_score"; | ||
private static final float SINGLE_RESULT_SCORE = 1.0f; | ||
private static final float MIN_SCORE = 0.001f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to single result score, min_score is something we return for the lowest possible score, e.g. if raw scores 2.0 and 5.0 normalized to 0.0 and 1.0 then instead of 0.0 we return that score. Because scores are not in [0.0...1.0] interval we need to compute min score somehow
final long[] elementsPerSubquery, | ||
final int numOfScores | ||
) { | ||
final double[] deltaSumPerSubquery = new double[numOfScores]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why we're doing this manually instead of using function from some library, e.g. DescriptiveStatistics from apache-commons math? Reason being - if that's a well-known library I do have more confidence in it for the edge case scenarios.
Description
Adds ZScore Normalization Technique
Related Issues
Resolves #376 and #1209
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.