Skip to content

Commit b754232

Browse files
vibrantvarungithub-actions[bot]
authored andcommitted
[Feature] Enable sorting and search_after features in hybrid search (#827)
* Fix jdk version for CI test secure cluster action (#801) (#806) Signed-off-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Martin Gaievski <gaievski@amazon.com> * [Part 1] Collector for Sorting Results (#797) * [Part 2] Normalization Phase for Sorting (#802) * Normalization Phase for Sorting Signed-off-by: Varun Jain <varunudr@amazon.com> * Fixing compile test issue Signed-off-by: Varun Jain <varunudr@amazon.com> * Optimize code Signed-off-by: Varun Jain <varunudr@amazon.com> * Add method description Signed-off-by: Varun Jain <varunudr@amazon.com> * [Part 1] Collector for Sorting Results (#797) * HybridSearchSortUtil class Signed-off-by: Varun Jain <varunudr@amazon.com> * Add Integ Tests Signed-off-by: Varun Jain <varunudr@amazon.com> * Add Sorting Integ tests Signed-off-by: Varun Jain <varunudr@amazon.com> * Add integ test for Sorting Signed-off-by: Varun Jain <varunudr@amazon.com> * Refactoring normalization processor workflow Signed-off-by: Varun Jain <varunudr@amazon.com> * Fix Unit Tests Signed-off-by: Varun Jain <varunudr@amazon.com> * Refactoring Signed-off-by: Varun Jain <varunudr@amazon.com> * Refactoring Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Optimising Normalization Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Vijay comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Vijay Comments Signed-off-by: Varun Jain <varunudr@amazon.com> --------- Signed-off-by: Varun Jain <varunudr@amazon.com> * Update bwc workflow to include 2.16.0-SNAPSHOT (#809) (#810) * Increment BWC version * Append 2.16.0-SNAPSHOTn in restart upgrade tests --------- Signed-off-by: Varun Jain <varunudr@amazon.com> * [Part 3] Concurrent segment search bug in Sorting (#808) * Cherry picking Concurrent Segment Search Bug Commit Signed-off-by: Varun Jain <varunudr@amazon.com> * Fix Concurrent Segment Search Bug in Sorting Signed-off-by: Varun Jain <varunudr@amazon.com> * Functional Interface Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Removing comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Addressing Martin commnents Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> * Address Martin Comments Signed-off-by: Varun Jain <varunudr@amazon.com> --------- Signed-off-by: Varun Jain <varunudr@amazon.com> Co-authored-by: Martin Gaievski <gaievski@amazon.com> * Rebasing with main (#826) * Adds method_parameters in neural search query to support ef_search (#787) (#814) Signed-off-by: Tejas Shah <shatejas@amazon.com> * Add BWC for batch ingestion (#769) * Add BWC for batch ingestion Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Update Changelog Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Fix spotlessLicenseCheck Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Fix comments Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Reuse the same code Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Rename some functions Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Rename a function Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Minor change to trigger rebuild Signed-off-by: Liyun Xiu <xiliyun@amazon.com> --------- Signed-off-by: Liyun Xiu <xiliyun@amazon.com> * Neural sparse query two-phase search processor's bwc test (#777) * Poc of pipeline Signed-off-by: conggguan <congguan@amazon.com> * Complete some settings for two phase pipeline. Signed-off-by: conggguan <congguan@amazon.com> * Change the implement of two-phase from QueryBuilderVistor to custom process funciton. Signed-off-by: conggguan <congguan@amazon.com> * Add It and fix some bug on the state of multy same neuralsparsequerybuilder. Signed-off-by: conggguan <congguan@amazon.com> * Simplify some logic, and correct some format. Signed-off-by: conggguan <congguan@amazon.com> * Optimize some format. Signed-off-by: conggguan <congguan@amazon.com> * Add some test case. Signed-off-by: conggguan <congguan@amazon.com> * Optimize some logic for zhichao-aws's comments. Signed-off-by: conggguan <congguan@amazon.com> * Optimize a line without application. Signed-off-by: conggguan <congguan@amazon.com> * Add some comments, remove some redundant lines, fix some format. Signed-off-by: conggguan <congguan@amazon.com> * Remove a redundant null check, fix a if format. Signed-off-by: conggguan <congguan@amazon.com> * Fix a typo for a comment, camelcase format for some variable. Signed-off-by: conggguan <congguan@amazon.com> * Add some comments to illustrate the influence of the modify on 2-phase search pipeline to neural sparse query builder. Signed-off-by: conggguan <congguan@amazon.com> * Add restart and rolling upgrade bwc test for neural sparse two phase processor. Signed-off-by: conggguan <congguan@amazon.com> * Spotless on qa. Signed-off-by: conggguan <congguan@amazon.com> * Update change log for two-phase BWC test. Signed-off-by: conggguan <congguan@amazon.com> * Remove redundant lines of two-phase BWC test. Signed-off-by: conggguan <congguan@amazon.com> * Add changelog. Signed-off-by: conggguan <congguan@amazon.com> * Add the PR link and number for the CHANGELOG.md. Signed-off-by: conggguan <congguan@amazon.com> * [Fix] NeuralSparseTwoPhaseProcessorIT created wrong ingest pipeline, fix it to correct API. Signed-off-by: conggguan <congguan@amazon.com> --------- Signed-off-by: conggguan <congguan@amazon.com> Signed-off-by: conggguan <157357330+conggguan@users.noreply.github.com> * Enable '.' for nested field in text embedding processor (#811) * Added nested structure for text embed processor mapping Signed-off-by: Martin Gaievski <gaievski@amazon.com> * Fix linux build CI error due to action runner env upgrade node 20 (#821) * Fix linux build CI error due to action runner env upgrade node 20 Signed-off-by: Varun Jain <varunudr@amazon.com> * Fix linux build on additional integ tests Signed-off-by: Varun Jain <varunudr@amazon.com> --------- Signed-off-by: Varun Jain <varunudr@amazon.com> --------- Signed-off-by: Tejas Shah <shatejas@amazon.com> Signed-off-by: Liyun Xiu <xiliyun@amazon.com> Signed-off-by: conggguan <congguan@amazon.com> Signed-off-by: conggguan <157357330+conggguan@users.noreply.github.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com> Signed-off-by: Varun Jain <varunudr@amazon.com> Co-authored-by: Tejas Shah <shatejas@amazon.com> Co-authored-by: Liyun Xiu <chishui2@gmail.com> Co-authored-by: conggguan <157357330+conggguan@users.noreply.github.com> Co-authored-by: Martin Gaievski <gaievski@amazon.com> * Add changelog Signed-off-by: Varun Jain <varunudr@amazon.com> --------- Signed-off-by: Martin Gaievski <gaievski@amazon.com> Signed-off-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Tejas Shah <shatejas@amazon.com> Signed-off-by: Liyun Xiu <xiliyun@amazon.com> Signed-off-by: conggguan <congguan@amazon.com> Signed-off-by: conggguan <157357330+conggguan@users.noreply.github.com> Co-authored-by: Martin Gaievski <gaievski@amazon.com> Co-authored-by: Tejas Shah <shatejas@amazon.com> Co-authored-by: Liyun Xiu <chishui2@gmail.com> Co-authored-by: conggguan <157357330+conggguan@users.noreply.github.com> (cherry picked from commit d22e1b8)
1 parent 1acdc67 commit b754232

36 files changed

+3521
-211
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
1414

1515
## [Unreleased 2.x](https://github.com/opensearch-project/neural-search/compare/2.15...2.x)
1616
### Features
17+
- Enable sorting and search_after features in Hybrid Search [#827](https://github.com/opensearch-project/neural-search/pull/827)
1718
### Enhancements
1819
- Adds dynamic knn query parameters efsearch and nprobes [#814](https://github.com/opensearch-project/neural-search/pull/814/)
1920
- Enable '.' for nested field in text embedding processor ([#811](https://github.com/opensearch-project/neural-search/pull/811))

src/main/java/org/opensearch/neuralsearch/processor/CompoundTopDocs.java

+36-15
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,17 @@
44
*/
55
package org.opensearch.neuralsearch.processor;
66

7+
import org.apache.lucene.search.FieldDoc;
8+
import org.apache.lucene.search.ScoreDoc;
9+
import org.apache.lucene.search.TotalHits;
10+
import org.apache.lucene.search.TopDocs;
11+
import org.apache.lucene.search.TopFieldDocs;
712
import static org.opensearch.neuralsearch.search.util.HybridSearchResultFormatUtil.isHybridQueryDelimiterElement;
813
import static org.opensearch.neuralsearch.search.util.HybridSearchResultFormatUtil.isHybridQueryStartStopElement;
914

1015
import java.util.ArrayList;
11-
import java.util.Arrays;
1216
import java.util.List;
1317
import java.util.Objects;
14-
import java.util.stream.Collectors;
15-
16-
import org.apache.lucene.search.ScoreDoc;
17-
import org.apache.lucene.search.TopDocs;
18-
import org.apache.lucene.search.TotalHits;
1918

2019
import lombok.AllArgsConstructor;
2120
import lombok.Getter;
@@ -39,14 +38,14 @@ public class CompoundTopDocs {
3938
@Setter
4039
private List<ScoreDoc> scoreDocs;
4140

42-
public CompoundTopDocs(final TotalHits totalHits, final List<TopDocs> topDocs) {
43-
initialize(totalHits, topDocs);
41+
public CompoundTopDocs(final TotalHits totalHits, final List<TopDocs> topDocs, final boolean isSortEnabled) {
42+
initialize(totalHits, topDocs, isSortEnabled);
4443
}
4544

46-
private void initialize(TotalHits totalHits, List<TopDocs> topDocs) {
45+
private void initialize(TotalHits totalHits, List<TopDocs> topDocs, boolean isSortEnabled) {
4746
this.totalHits = totalHits;
4847
this.topDocs = topDocs;
49-
scoreDocs = cloneLargestScoreDocs(topDocs);
48+
scoreDocs = cloneLargestScoreDocs(topDocs, isSortEnabled);
5049
}
5150

5251
/**
@@ -74,9 +73,13 @@ private void initialize(TotalHits totalHits, List<TopDocs> topDocs) {
7473
* 0, 9549511920.4881596047
7574
*/
7675
public CompoundTopDocs(final TopDocs topDocs) {
76+
boolean isSortEnabled = false;
77+
if (topDocs instanceof TopFieldDocs) {
78+
isSortEnabled = true;
79+
}
7780
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
7881
if (Objects.isNull(scoreDocs) || scoreDocs.length < 2) {
79-
initialize(topDocs.totalHits, new ArrayList<>());
82+
initialize(topDocs.totalHits, new ArrayList<>(), isSortEnabled);
8083
return;
8184
}
8285
// skipping first two elements, it's a start-stop element and delimiter for first series
@@ -88,17 +91,22 @@ public CompoundTopDocs(final TopDocs topDocs) {
8891
if (isHybridQueryDelimiterElement(scoreDoc) || isHybridQueryStartStopElement(scoreDoc)) {
8992
ScoreDoc[] subQueryScores = scoreDocList.toArray(new ScoreDoc[0]);
9093
TotalHits totalHits = new TotalHits(subQueryScores.length, TotalHits.Relation.EQUAL_TO);
91-
TopDocs subQueryTopDocs = new TopDocs(totalHits, subQueryScores);
94+
TopDocs subQueryTopDocs;
95+
if (isSortEnabled) {
96+
subQueryTopDocs = new TopFieldDocs(totalHits, subQueryScores, ((TopFieldDocs) topDocs).fields);
97+
} else {
98+
subQueryTopDocs = new TopDocs(totalHits, subQueryScores);
99+
}
92100
topDocsList.add(subQueryTopDocs);
93101
scoreDocList.clear();
94102
} else {
95103
scoreDocList.add(scoreDoc);
96104
}
97105
}
98-
initialize(topDocs.totalHits, topDocsList);
106+
initialize(topDocs.totalHits, topDocsList, isSortEnabled);
99107
}
100108

101-
private List<ScoreDoc> cloneLargestScoreDocs(final List<TopDocs> docs) {
109+
private List<ScoreDoc> cloneLargestScoreDocs(final List<TopDocs> docs, boolean isSortEnabled) {
102110
if (docs == null) {
103111
return null;
104112
}
@@ -113,7 +121,20 @@ private List<ScoreDoc> cloneLargestScoreDocs(final List<TopDocs> docs) {
113121
maxScoreDocs = topDoc.scoreDocs;
114122
}
115123
}
124+
116125
// do deep copy
117-
return Arrays.stream(maxScoreDocs).map(doc -> new ScoreDoc(doc.doc, doc.score, doc.shardIndex)).collect(Collectors.toList());
126+
List<ScoreDoc> scoreDocs = new ArrayList<>();
127+
for (ScoreDoc scoreDoc : maxScoreDocs) {
128+
scoreDocs.add(deepCopyScoreDoc(scoreDoc, isSortEnabled));
129+
}
130+
return scoreDocs;
131+
}
132+
133+
private ScoreDoc deepCopyScoreDoc(final ScoreDoc scoreDoc, final boolean isSortEnabled) {
134+
if (!isSortEnabled) {
135+
return new ScoreDoc(scoreDoc.doc, scoreDoc.score, scoreDoc.shardIndex);
136+
}
137+
FieldDoc fieldDoc = (FieldDoc) scoreDoc;
138+
return new FieldDoc(fieldDoc.doc, fieldDoc.score, fieldDoc.fields, fieldDoc.shardIndex);
118139
}
119140
}

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java

+65-11
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@
1515

1616
import org.apache.lucene.search.ScoreDoc;
1717
import org.apache.lucene.search.TopDocs;
18+
import org.apache.lucene.search.Sort;
19+
import org.apache.lucene.search.TopFieldDocs;
20+
import org.apache.lucene.search.FieldDoc;
1821
import org.opensearch.common.lucene.search.TopDocsAndMaxScore;
22+
import org.opensearch.neuralsearch.processor.combination.CombineScoresDto;
1923
import org.opensearch.neuralsearch.processor.combination.ScoreCombinationTechnique;
2024
import org.opensearch.neuralsearch.processor.combination.ScoreCombiner;
2125
import org.opensearch.neuralsearch.processor.normalization.ScoreNormalizationTechnique;
@@ -27,6 +31,8 @@
2731

2832
import lombok.AllArgsConstructor;
2933
import lombok.extern.log4j.Log4j2;
34+
import static org.opensearch.neuralsearch.processor.combination.ScoreCombiner.MAX_SCORE_WHEN_NO_HITS_FOUND;
35+
import static org.opensearch.neuralsearch.search.util.HybridSearchSortUtil.evaluateSortCriteria;
3036

3137
/**
3238
* Class abstracts steps required for score normalization and combination, this includes pre-processing of incoming data
@@ -62,13 +68,20 @@ public void execute(
6268
log.debug("Do score normalization");
6369
scoreNormalizer.normalizeScores(queryTopDocs, normalizationTechnique);
6470

71+
CombineScoresDto combineScoresDTO = CombineScoresDto.builder()
72+
.queryTopDocs(queryTopDocs)
73+
.scoreCombinationTechnique(combinationTechnique)
74+
.querySearchResults(querySearchResults)
75+
.sort(evaluateSortCriteria(querySearchResults, queryTopDocs))
76+
.build();
77+
6578
// combine
6679
log.debug("Do score combination");
67-
scoreCombiner.combineScores(queryTopDocs, combinationTechnique);
80+
scoreCombiner.combineScores(combineScoresDTO);
6881

6982
// post-process data
7083
log.debug("Post-process query results after score normalization and combination");
71-
updateOriginalQueryResults(querySearchResults, queryTopDocs);
84+
updateOriginalQueryResults(combineScoresDTO);
7285
updateOriginalFetchResults(querySearchResults, fetchSearchResultOptional, unprocessedDocIds);
7386
}
7487

@@ -96,7 +109,23 @@ private List<CompoundTopDocs> getQueryTopDocs(final List<QuerySearchResult> quer
96109
return queryTopDocs;
97110
}
98111

99-
private void updateOriginalQueryResults(final List<QuerySearchResult> querySearchResults, final List<CompoundTopDocs> queryTopDocs) {
112+
private void updateOriginalQueryResults(final CombineScoresDto combineScoresDTO) {
113+
final List<QuerySearchResult> querySearchResults = combineScoresDTO.getQuerySearchResults();
114+
final List<CompoundTopDocs> queryTopDocs = getCompoundTopDocs(combineScoresDTO, querySearchResults);
115+
final Sort sort = combineScoresDTO.getSort();
116+
for (int index = 0; index < querySearchResults.size(); index++) {
117+
QuerySearchResult querySearchResult = querySearchResults.get(index);
118+
CompoundTopDocs updatedTopDocs = queryTopDocs.get(index);
119+
TopDocsAndMaxScore updatedTopDocsAndMaxScore = new TopDocsAndMaxScore(
120+
buildTopDocs(updatedTopDocs, sort),
121+
maxScoreForShard(updatedTopDocs, sort != null)
122+
);
123+
querySearchResult.topDocs(updatedTopDocsAndMaxScore, querySearchResult.sortValueFormats());
124+
}
125+
}
126+
127+
private List<CompoundTopDocs> getCompoundTopDocs(CombineScoresDto combineScoresDTO, List<QuerySearchResult> querySearchResults) {
128+
final List<CompoundTopDocs> queryTopDocs = combineScoresDTO.getQueryTopDocs();
100129
if (querySearchResults.size() != queryTopDocs.size()) {
101130
throw new IllegalStateException(
102131
String.format(
@@ -107,17 +136,42 @@ private void updateOriginalQueryResults(final List<QuerySearchResult> querySearc
107136
)
108137
);
109138
}
110-
for (int index = 0; index < querySearchResults.size(); index++) {
111-
QuerySearchResult querySearchResult = querySearchResults.get(index);
112-
CompoundTopDocs updatedTopDocs = queryTopDocs.get(index);
113-
float maxScore = updatedTopDocs.getTotalHits().value > 0 ? updatedTopDocs.getScoreDocs().get(0).score : 0.0f;
139+
return queryTopDocs;
140+
}
114141

115-
// create final version of top docs with all updated values
116-
TopDocs topDocs = new TopDocs(updatedTopDocs.getTotalHits(), updatedTopDocs.getScoreDocs().toArray(new ScoreDoc[0]));
142+
/**
143+
* Get Max score on Shard
144+
* @param updatedTopDocs updatedTopDocs compound top docs on a shard
145+
* @param isSortEnabled if sort is enabled or disabled
146+
* @return max score
147+
*/
148+
private float maxScoreForShard(CompoundTopDocs updatedTopDocs, boolean isSortEnabled) {
149+
if (updatedTopDocs.getTotalHits().value == 0 || updatedTopDocs.getScoreDocs().isEmpty()) {
150+
return MAX_SCORE_WHEN_NO_HITS_FOUND;
151+
}
152+
if (isSortEnabled) {
153+
float maxScore = MAX_SCORE_WHEN_NO_HITS_FOUND;
154+
// In case of sorting iterate over score docs and deduce the max score
155+
for (ScoreDoc scoreDoc : updatedTopDocs.getScoreDocs()) {
156+
maxScore = Math.max(maxScore, scoreDoc.score);
157+
}
158+
return maxScore;
159+
}
160+
// If it is a normal hybrid query then first entry of score doc will have max score
161+
return updatedTopDocs.getScoreDocs().get(0).score;
162+
}
117163

118-
TopDocsAndMaxScore updatedTopDocsAndMaxScore = new TopDocsAndMaxScore(topDocs, maxScore);
119-
querySearchResult.topDocs(updatedTopDocsAndMaxScore, null);
164+
/**
165+
* Get Top Docs on Shard
166+
* @param updatedTopDocs compound top docs on a shard
167+
* @param sort sort criteria
168+
* @return TopDocs which will be instance of TopFieldDocs if sort is enabled.
169+
*/
170+
private TopDocs buildTopDocs(CompoundTopDocs updatedTopDocs, Sort sort) {
171+
if (sort != null) {
172+
return new TopFieldDocs(updatedTopDocs.getTotalHits(), updatedTopDocs.getScoreDocs().toArray(new FieldDoc[0]), sort.getSort());
120173
}
174+
return new TopDocs(updatedTopDocs.getTotalHits(), updatedTopDocs.getScoreDocs().toArray(new ScoreDoc[0]));
121175
}
122176

123177
/**
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
/*
2+
* Copyright OpenSearch Contributors
3+
* SPDX-License-Identifier: Apache-2.0
4+
*/
5+
package org.opensearch.neuralsearch.processor.combination;
6+
7+
import java.util.List;
8+
import lombok.AllArgsConstructor;
9+
import lombok.Builder;
10+
import lombok.Getter;
11+
import lombok.NonNull;
12+
import org.apache.lucene.search.Sort;
13+
import org.opensearch.common.Nullable;
14+
import org.opensearch.neuralsearch.processor.CompoundTopDocs;
15+
import org.opensearch.search.query.QuerySearchResult;
16+
17+
/**
18+
* DTO object to hold data required for Score Combination.
19+
*/
20+
@AllArgsConstructor
21+
@Builder
22+
@Getter
23+
public class CombineScoresDto {
24+
@NonNull
25+
private List<CompoundTopDocs> queryTopDocs;
26+
@NonNull
27+
private ScoreCombinationTechnique scoreCombinationTechnique;
28+
@NonNull
29+
private List<QuerySearchResult> querySearchResults;
30+
@Nullable
31+
private Sort sort;
32+
}

0 commit comments

Comments
 (0)