Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a new parameter, max_analyzer_offset, for the highlighter #3893

Merged
merged 23 commits into from
Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f0d9805
#3842 adds a new parameter to the highlighter, the max_analyzer_offse…
hauck-jvsh Jul 13, 2022
6be297d
Adds a test for the new parameter
hauck-jvsh Jul 13, 2022
4c6ff70
Fix the test add in the previous commit;
hauck-jvsh Jul 13, 2022
595bb04
This was checking against the wrong field
hauck-jvsh Jul 13, 2022
1b63fad
Only runs the test for the correct version
hauck-jvsh Jul 15, 2022
ba381be
Skips the test in Elasticsearch as well;
hauck-jvsh Jul 15, 2022
c63d527
Remove elastic 3.0 to test
hauck-jvsh Jul 15, 2022
ed80f5f
Skips all versions
hauck-jvsh Jul 15, 2022
5a0e1dc
Remove unnecessary fields as pointed by @reta
hauck-jvsh Jul 15, 2022
6b5c529
Compute if fieldMaxAnalyzedIsNotValid in the constructor as suggest b…
hauck-jvsh Jul 15, 2022
f95599a
As discussed, it is better to throws different exceptions for when th…
hauck-jvsh Jul 15, 2022
bf612f0
hint what to do to allow highlight of bigger documents
hauck-jvsh Jul 15, 2022
f8e3f2a
Let the user define the new parameter globally for all fields highlig…
hauck-jvsh Jul 15, 2022
53b3b87
Change the fieldMaxAnalyzedOffset Integer in order to use null when …
hauck-jvsh Jul 16, 2022
70cf4bb
Update javadocs and implements the stream methods for the new fields;
hauck-jvsh Jul 16, 2022
f21a416
builder.field do not accept null, so check before calling the method …
hauck-jvsh Jul 16, 2022
257cb18
Only send and read the new fields if the version supports it
hauck-jvsh Jul 16, 2022
7a03514
Merge branch 'opensearch-project:main' into main
hauck-jvsh Jul 16, 2022
cff3e7a
the previous commit was checking the wrong field
hauck-jvsh Jul 16, 2022
e33b80e
Check for version 3.0.0 instead of current version
hauck-jvsh Jul 16, 2022
24dec06
Update server/src/main/java/org/apache/lucene/search/uhighlight/Custo…
hauck-jvsh Jul 17, 2022
f2ef792
Merge branch 'opensearch-project:main' into main
hauck-jvsh Jul 20, 2022
4b25f5c
Execute the test after version 3.0.0
hauck-jvsh Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.tests.index.RandomIndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
Expand All @@ -56,6 +55,7 @@
import org.apache.lucene.search.uhighlight.Snippet;
import org.apache.lucene.search.uhighlight.SplittingBreakIterator;
import org.apache.lucene.store.Directory;
import org.apache.lucene.tests.index.RandomIndexWriter;
import org.opensearch.common.Strings;
import org.opensearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
import org.opensearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedText;
Expand Down Expand Up @@ -136,7 +136,8 @@ private void assertHighlightOneDoc(
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
Integer.MAX_VALUE
Integer.MAX_VALUE,
-1
);
highlighter.setFieldMatcher((name) -> "text".equals(name));
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,17 @@ setup:
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Unified highlighter on a field WITHOUT OFFSETS using max_analyzer_offset should SUCCEED":
- skip:
version: "all"
hauck-jvsh marked this conversation as resolved.
Show resolved Hide resolved
reason: only starting supporting the parameter max_analyzer_offset on version 3.0
- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "quick"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {"max_analyzer_offset": 10}}}}
- match: {hits.hits.0.highlight.field1.0: "The <em>quick</em> brown fox went to the forest and saw another fox."}

---
"Plain highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset should FAIL":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter.HighlightFlag;
import org.apache.lucene.util.BytesRef;
import org.opensearch.common.CheckedSupplier;
import org.opensearch.common.Nullable;
Expand Down Expand Up @@ -79,6 +78,7 @@ public class CustomUnifiedHighlighter extends UnifiedHighlighter {
private final int noMatchSize;
private final FieldHighlighter fieldHighlighter;
private final int maxAnalyzedOffset;
private final int fieldMaxAnalyzedOffset;

/**
* Creates a new instance of {@link CustomUnifiedHighlighter}
Expand Down Expand Up @@ -113,7 +113,8 @@ public CustomUnifiedHighlighter(
int noMatchSize,
int maxPassages,
Predicate<String> fieldMatcher,
int maxAnalyzedOffset
int maxAnalyzedOffset,
Copy link
Collaborator

@reta reta Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions: do we need to keep track of both maxAnalyzedOffset && fieldMaxAnalyzedOffset? I think we could use Math.max(maxAnalyzedOffset, fieldMaxAnalyzedOffset), no? I believe the purpose of the fieldMaxAnalyzedOffset is to override the maxAnalyzedOffset on per query basis: it could be lower or higher value, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually with this implementation it could only be lower, I don't know if the maxAnalyzedOffset has an impact when indexing as it is an index option. Because of this if fieldMaxAnalyzedOffset is higher than maxAnalyzedOffset and the field lenght is higher than maxAnalyzedOffset an exception still occurs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than we could go with min(fieldMaxAnalyzedOffset, maxAnalyzedOffset), right?

Copy link
Contributor Author

@hauck-jvsh hauck-jvsh Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that will change the current behavior, because now if the text is bigger than the maxAnalyzedOffset an exception is throw, with the suggested approach it will pass, but will not highlight items after maxAnalyzedOffset. For me this behavior is better, but it will break some tests. This was my original idea, I changed to not break the tests.

Copy link
Collaborator

@reta reta Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, it looks confusing (to me at least), when I look how it is being used:

if ((fieldMaxAnalyzedOffset < 0 || fieldMaxAnalyzedOffset > maxAnalyzedOffset)
            && (offsetSource == OffsetSource.ANALYSIS)
            && (fieldValueLength > maxAnalyzedOffset)) {

(fieldMaxAnalyzedOffset < 0 || fieldMaxAnalyzedOffset > maxAnalyzedOffset) - this is static information, could be precomputed in the constructor ahead of time to true or false. Now, if fieldMaxAnalyzedOffset < maxAnalyzedOffset, this compound condition evaluates to 'false' no matter what. It looks to me the intent should be:

final boolean fieldMaxAnalyzedIsNotValid = (fieldMaxAnalyzedOffset < 0 || fieldMaxAnalyzedOffset > maxAnalyzedOffset);

if ((offsetSource == OffsetSource.ANALYSIS)
            && (fieldValueLength > maxAnalyzedOffset || fieldMaxAnalyzedIsNotValid)) {

But in this case, the exception is now hiding 2 problems: fieldValueLength > maxAnalyzedOffset and fieldMaxAnalyzedIsNotValid - we have to split these failures into 2 with all contextual information provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be done in the constructor, and I think that your if statement is exactly what I intend. I will update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should be
if ((offsetSource == OffsetSource.ANALYSIS) && (fieldValueLength > maxAnalyzedOffset && fieldMaxAnalyzedIsNotValid)) { as it must only throw the exception if the fieldMaxAnalyzedIsNotValid. This is because when it is valid the text will be cut fit maxAnalyzedOffset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what about fieldValueLength > maxAnalyzedOffset check?

Copy link
Contributor Author

@hauck-jvsh hauck-jvsh Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can cast another exception for when the fieldMaxAnalyzedOffset > maxAnalyzedOffset, I think that, indeed this should be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see if now it is more clear what is going on.

int fieldMaxAnalyzedOffset
) throws IOException {
super(searcher, analyzer);
this.offsetSource = offsetSource;
Expand All @@ -126,6 +127,7 @@ public CustomUnifiedHighlighter(
this.setFieldMatcher(fieldMatcher);
this.maxAnalyzedOffset = maxAnalyzedOffset;
fieldHighlighter = getFieldHighlighter(field, query, extractTerms(query), maxPassages);
this.fieldMaxAnalyzedOffset = fieldMaxAnalyzedOffset;
}

/**
Expand All @@ -141,7 +143,10 @@ public Snippet[] highlightField(LeafReader reader, int docId, CheckedSupplier<St
return null;
}
int fieldValueLength = fieldValue.length();
if ((offsetSource == OffsetSource.ANALYSIS) && (fieldValueLength > maxAnalyzedOffset)) {

if ((fieldMaxAnalyzedOffset < 0 || fieldMaxAnalyzedOffset > maxAnalyzedOffset)
&& (offsetSource == OffsetSource.ANALYSIS)
&& (fieldValueLength > maxAnalyzedOffset)) {
throw new IllegalArgumentException(
"The length of ["
+ field
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ public abstract class AbstractHighlighterBuilder<HB extends AbstractHighlighterB
public static final ParseField OPTIONS_FIELD = new ParseField("options");
public static final ParseField HIGHLIGHT_QUERY_FIELD = new ParseField("highlight_query");
public static final ParseField MATCHED_FIELDS_FIELD = new ParseField("matched_fields");
public static final ParseField MAX_ANALYZER_OFFSET_FIELD = new ParseField("max_analyzer_offset");

protected String[] preTags;

Expand Down Expand Up @@ -129,6 +130,8 @@ public abstract class AbstractHighlighterBuilder<HB extends AbstractHighlighterB

protected Boolean requireFieldMatch;

protected int maxAnalyzerOffset;

public AbstractHighlighterBuilder() {}

protected AbstractHighlighterBuilder(AbstractHighlighterBuilder<?> template, QueryBuilder queryBuilder) {
Expand All @@ -150,6 +153,7 @@ protected AbstractHighlighterBuilder(AbstractHighlighterBuilder<?> template, Que
phraseLimit = template.phraseLimit;
options = template.options;
requireFieldMatch = template.requireFieldMatch;
maxAnalyzerOffset = template.maxAnalyzerOffset;
}

/**
Expand Down Expand Up @@ -542,6 +546,21 @@ public Integer phraseLimit() {
return this.phraseLimit;
}

/**
* Sets the maximum offset for the highlighter
* @param maxAnalyzerOffset the maximum offset that the highlighter will consider
* @return this for chaining
*/
@SuppressWarnings("unchecked")
public HB maxAnalyzerOffset(int maxAnalyzerOffset) {
this.maxAnalyzerOffset = maxAnalyzerOffset;
return (HB) this;
}

public int maxAnalyzerOffset() {
return this.maxAnalyzerOffset;
}

/**
* Forces the highlighting to highlight fields based on the source even if fields are stored separately.
*/
Expand Down Expand Up @@ -623,6 +642,9 @@ void commonOptionsToXContent(XContentBuilder builder) throws IOException {
if (phraseLimit != null) {
builder.field(PHRASE_LIMIT_FIELD.getPreferredName(), phraseLimit);
}
if (maxAnalyzerOffset > 0) {
builder.field(MAX_ANALYZER_OFFSET_FIELD.getPreferredName(), maxAnalyzerOffset);
}
}

static <HB extends AbstractHighlighterBuilder<HB>> BiFunction<XContentParser, HB, HB> setupParser(ObjectParser<HB, Void> parser) {
Expand All @@ -642,6 +664,7 @@ static <HB extends AbstractHighlighterBuilder<HB>> BiFunction<XContentParser, HB
parser.declareInt(HB::noMatchSize, NO_MATCH_SIZE_FIELD);
parser.declareBoolean(HB::forceSource, FORCE_SOURCE_FIELD);
parser.declareInt(HB::phraseLimit, PHRASE_LIMIT_FIELD);
parser.declareInt(HB::maxAnalyzerOffset, MAX_ANALYZER_OFFSET_FIELD);
parser.declareObject(HB::options, (XContentParser p, Void c) -> {
try {
return p.map();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ public class HighlightBuilder extends AbstractHighlighterBuilder<HighlightBuilde
static final String[] DEFAULT_PRE_TAGS = new String[] { "<em>" };
/** the default closing tag */
static final String[] DEFAULT_POST_TAGS = new String[] { "</em>" };
static final int DEFAULT_MAX_ANALYZER_OFFSET = -1;

/** the default opening tags when {@code tag_schema = "styled"} */
public static final String[] DEFAULT_STYLED_PRE_TAG = {
Expand Down Expand Up @@ -126,6 +127,7 @@ public class HighlightBuilder extends AbstractHighlighterBuilder<HighlightBuilde
.boundaryScannerLocale(Locale.ROOT)
.noMatchSize(DEFAULT_NO_MATCH_SIZE)
.phraseLimit(DEFAULT_PHRASE_LIMIT)
.maxAnalyzerOffset(DEFAULT_MAX_ANALYZER_OFFSET)
.build();

private final List<Field> fields;
Expand Down Expand Up @@ -399,6 +401,9 @@ private static void transferOptions(
if (highlighterBuilder.highlightQuery != null) {
targetOptionsBuilder.highlightQuery(highlighterBuilder.highlightQuery.toQuery(context));
}
if (highlighterBuilder.maxAnalyzerOffset > 0) {
targetOptionsBuilder.maxAnalyzerOffset(highlighterBuilder.maxAnalyzerOffset);
}
}

static Character[] convertCharArray(char[] array) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,12 @@ public static class FieldOptions {

private int phraseLimit = -1;

private int maxAnalyzerOffset = -1;

public int maxAnalyzerOffset() {
return maxAnalyzerOffset;
}

public int fragmentCharSize() {
return fragmentCharSize;
}
Expand Down Expand Up @@ -333,6 +339,11 @@ Builder phraseLimit(int phraseLimit) {
return this;
}

Builder maxAnalyzerOffset(int maxAnalyzerOffset) {
fieldOptions.maxAnalyzerOffset = maxAnalyzerOffset;
return this;
}

Builder matchedFields(Set<String> matchedFields) {
fieldOptions.matchedFields = matchedFields;
return this;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@
package org.opensearch.search.fetch.subphase.highlight;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.miscellaneous.LimitTokenOffsetFilter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.uhighlight.BoundedBreakIteratorScanner;
Expand Down Expand Up @@ -133,13 +135,40 @@ public HighlightField highlight(FieldHighlightContext fieldContext) throws IOExc
return new HighlightField(fieldContext.fieldName, Text.convertFromStringArray(fragments));
}

public AnalyzerWrapper getLimitedOffsetAnalyzer(Analyzer a, int limit) {
hauck-jvsh marked this conversation as resolved.
Show resolved Hide resolved
return new AnalyzerWrapper(a.getReuseStrategy()) {

private Analyzer old = a;
private int maxOffset = limit;

@Override
protected Analyzer getWrappedAnalyzer(String fieldName) {
return old;
}

@Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
return new TokenStreamComponents(
components.getSource(),
new LimitTokenOffsetFilter(components.getTokenStream(), maxOffset)
);
}

};

}

CustomUnifiedHighlighter buildHighlighter(FieldHighlightContext fieldContext) throws IOException {
Encoder encoder = fieldContext.field.fieldOptions().encoder().equals("html")
? HighlightUtils.Encoders.HTML
: HighlightUtils.Encoders.DEFAULT;
int maxAnalyzedOffset = fieldContext.context.getIndexSettings().getHighlightMaxAnalyzedOffset();
int fieldMaxAnalyzedOffset = fieldContext.field.fieldOptions().maxAnalyzerOffset();
int numberOfFragments = fieldContext.field.fieldOptions().numberOfFragments();
Analyzer analyzer = getAnalyzer(fieldContext.context.mapperService().documentMapper());
if (fieldMaxAnalyzedOffset > 0) {
analyzer = getLimitedOffsetAnalyzer(analyzer, fieldMaxAnalyzedOffset);
}
PassageFormatter passageFormatter = getPassageFormatter(fieldContext.hitContext, fieldContext.field, encoder);
IndexSearcher searcher = fieldContext.context.searcher();
OffsetSource offsetSource = getOffsetSource(fieldContext.fieldType);
Expand Down Expand Up @@ -174,7 +203,8 @@ CustomUnifiedHighlighter buildHighlighter(FieldHighlightContext fieldContext) th
fieldContext.field.fieldOptions().noMatchSize(),
higlighterNumberOfFragments,
fieldMatcher(fieldContext),
maxAnalyzedOffset
maxAnalyzedOffset,
fieldMaxAnalyzedOffset
);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.tests.index.RandomIndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.CommonTermsQuery;
import org.apache.lucene.search.BooleanClause;
Expand All @@ -63,6 +62,7 @@
import org.apache.lucene.search.uhighlight.Snippet;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.tests.index.RandomIndexWriter;
import org.opensearch.common.Strings;
import org.opensearch.common.lucene.search.MultiPhrasePrefixQuery;
import org.opensearch.test.OpenSearchTestCase;
Expand Down Expand Up @@ -117,7 +117,8 @@ private void assertHighlightOneDoc(
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
Integer.MAX_VALUE
Integer.MAX_VALUE,
-1
);
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
assertEquals(snippets.length, expectedPassages.length);
Expand Down