Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support neural sentence highlighter #1193

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Feb 19, 2025

Description

This PR introduces the neural sentences highlighter framework from OpenSearch core Highlighter, enhancing the semantic highlighting capabilities. Key changes include:

  • Extended Highlighter framework from core into neural-search plugin.
  • Updated NeuralSearch and NeuralKNNQuery for integrating neural query with neural highlights.
  • Added sentence highlighting inference QA client function.

PR opensearch-project/ml-commons#3600 adds the functionality in ml-commons to use sentence highlighting QA model.

Related Issues

Part of #1182

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

public static final String NAME = "neural";
private static final String MODEL_ID_FIELD = "model_id";

private static MLCommonsClientAccessor mlCommonsClient;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static pattern is not ideal for unit test. Can we receive accessor thought constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


@Override
public boolean canHighlight(MappedFieldType fieldType) {
// TODO: Implement actual condition check in subsequent PR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be better to restrict it to TEXT field type and expand it later in subsequent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log will be flooded and obscure other important logs. Let't move it to debug level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to add a temp placeholder here, since client api function was planed in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc and a comment of all the important steps this method is trying to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic

// TODO: Implement actual highlighting logic in subsequent PR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all these TODO, it seems like this PR should not be merged to main now.
Let me stop here and review it when the code is fully completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. My original purpose was trying to make the PR more clear by different component(highlighter framework, ml client, text processing logic...). I could also combine all code in one big PR.


private String formatHighlight(String text) {
// TODO: Implement user provided format options in subsequent PR
return "<em>" + text + "</em>";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use String format for safer concatenation

}
}

return query.toString().replaceAll("\\w+:", "").replaceAll("\\s+", " ").trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment describing what you are trying to extract from string here

// For now, return a basic highlight of the field text
Text[] fragments = new Text[] { new Text(formatHighlight(fieldText)) };
return new HighlightField(fieldContext.fieldName, fragments);
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to narrow cases when we can have exceptions? Having global level try/catch is not that great for performance on search path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me try to adjust to narrow exception catch cases.

}

private String getFieldText(FieldHighlightContext fieldContext) {
Object value = fieldContext.hitContext.sourceLookup().extractValue(fieldContext.fieldName, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this second argument a fallback in case requested key is missing? If yes can't we return empty string here and avoid the null check on a line below?


Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc and a comment of all the important steps this method is trying to do.

Map<String, Object> options = fieldContext.field.fieldOptions().options();
String modelId = getModelId(options);
log.info("Using model ID: {}", modelId); // Will be replaced with actual model loading logic
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic
log.debug("Using ML client: {}", mlCommonsClient); // Will be replaced with actual model loading logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why do we need this log this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment along with this line code


private String extractOriginalQuery(Query query) {
if (query instanceof NeuralKNNQuery neuralQuery) {
String originalText = neuralQuery.getOriginalQueryText();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String originalText = neuralQuery.getOriginalQueryText();
String originalQuery = neuralQuery.getOriginalQueryText();

this.knnQuery = knnQuery;
this.originalQueryText = originalQueryText;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add BWC test to validate that it does not break anything in bwc environment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will have a check on bwc side

@@ -107,6 +109,11 @@ public Builder rescoreContext(RescoreContext rescoreContext) {
return this;
}

public Builder originalQueryText(String originalQueryText) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add javadoc. We already have an issue for reducing the warning logs #1095

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible please add the javadoc on all public methods of this class.

@junqiu-lei
Copy link
Member Author

junqiu-lei commented Mar 3, 2025

For reference in this PR, in ml-commons we have PR for sentence highlighting model support :
opensearch-project/ml-commons#3600

@junqiu-lei junqiu-lei force-pushed the highlighter-framework branch from 9704f6c to e690e91 Compare March 6, 2025 23:15
@junqiu-lei junqiu-lei changed the title Introduce neural highlighter framework Support neural sentence highlighter Mar 6, 2025
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
@junqiu-lei junqiu-lei force-pushed the highlighter-framework branch from e690e91 to eeafdff Compare March 7, 2025 22:45
Copy link

codecov bot commented Mar 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.85%. Comparing base (5f25d6c) to head (eeafdff).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1193      +/-   ##
============================================
- Coverage     81.80%   80.85%   -0.96%     
+ Complexity     2606     1321    -1285     
============================================
  Files           190       96      -94     
  Lines          8922     4587    -4335     
  Branches       1520      778     -742     
============================================
- Hits           7299     3709    -3590     
+ Misses         1032      572     -460     
+ Partials        591      306     -285     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement neural-search v3.0.0 v3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants