-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] add details on highlighting #28802
[DOCS] add details on highlighting #28802
Conversation
Hi, I'm not sure exactly about the process of commenting on open PRs, but I created the original ticket for this (#28681). This PR is excellent. It gives a lot of information. I have a few additional requests for info added there if possible.
Again thanks so much for the documentation. It helps immensely and if I had this last year when I was developing highlighting for my company it would have sped things up. |
@cexmmaqsood thanks for your comments, we will try to address them in the update |
fd2fd8c
to
547f9c5
Compare
Add additional information on inner working of highlighters Closes elastic#28681, elastic#28816
547f9c5
to
8fae9f8
Compare
@cexmmaqsood Thanks very much again for your feedback. Addressing your comments:
the highlighting documentation was updated accordingly.
An example of this complex query can be found in this issue: #28626
I am not sure what you mean by subfields here. In case of
The explanation that follows this line is the explanation of the algorithm.
This low-level match information in a simplified form presented in the example at the end of the highlighting documentation: |
This is such a useful explanation: thanks so much for writing it, and I'm very glad I stumbled upon it. One other question raised by this section of the docs:
How can I control this in a query? I am using a boosting query to prefer phrase matches to term matches, but my |
@mdcclv Thanks for the feedback!
For specific questions like this, please ask in https://discuss.elastic.co/ |
Thanks! But this new section says: The new documentation in this PR still seems to say that |
Add additional information on inner working of highlighters Closes elastic#28681, elastic#28816
ed6c860
to
f6e1990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova and sorry for the late review.
I left some comments, I think we should focus on the unified
highlighter which is the default in 6.0 and maybe have a separate page to describe the highlighter internals. This page is quite big already.
`minimum_should_match` etc.), parts of documents may be highlighted | ||
that don't correspond to query matches. The work for fixing this is | ||
currenly in progress. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should remove this part or say something like highlighters don't reflect the boolean logic of the query when extracting the terms to highlight...
. I am not sure that we're going to "fix" it and we should not add this statement in the docs. I see pros and cons to do that and the solution that @romseygeek implemented might not be applicable in all cases (term vectors highlighting for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thanks, Jim. I will rephrase this note as you suggested, and remove the part about fixing. I was asked to create this note, as there were several SDH and other issues, where highlighted fragments did not match a query, which confused users.
order:: Sorts highlighted fragments by score when set to `score`. By default, | ||
fragments will be output in the order they appear in the field (order: `none`). | ||
Setting this option to `score` will output the most relevant fragments first. | ||
Only `unified` highlighter truly calculates the score in a similar way the score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each highlighter has its own relevancy "sauce" but I wouldn't say that the unified
scoring is similar to the score of the query. It uses BM25 but that's just an internal detail, I think we should just say that each highlighter applies its own logic to compute the relevancy score and we can describe the details in the example below.
A highlighter uses `pre-tags`, `post-tags` to encode highlighted terms. | ||
|
||
|
||
===== An example of the work of the plain highlighter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe how the unified
highlighter works instead ? Maybe just the re-analysis mode (plain highlighting) since it is very similar to the plain
highlighter. We want to deprecate (and remove) the plain
highlighter so I'd prefer if we document something that will last longer.
{"token":"fox","start_offset":164,"end_offset":167,"position":35}, | ||
{"token":"world","start_offset":175,"end_offset":180,"position":38}, | ||
{"token":"you","start_offset":185,"end_offset":188,"position":40} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unified
highlighter does not index all terms but only those that can match the query. This is an issue currently in the plain
highlighter since it caches all these terms in memory so we shouldn't document this and rely on the unified
highlighter instead.
Add more explanation to some highlighting parameters Add a document describing how highlighters work internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a small comment but LGTM otherwise.
Thanks @mayya-sharipova !
Relevant settings: `pre-tags`, `post-tags`. | ||
|
||
The goal is to highlight only those terms that participated in generating the 'hit' on the document. | ||
For some complex boolean queries, this is still work in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you adapt the sentence that explains how highlighters don't reflect the boolean logic of a query and only extracts the leaf (terms, phrases, prefix, ...). We can change the note when we have an highlighter (or adapted the unified
) that is able to handle boolean queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Thanks for the review, Jim! I will change this sentence as you suggested.
@elasticmachine run sample packaging tests |
- add more explanation to some highlighting parameters - add a document describing how highlighters work internally
Add additional information on inner working of highlighters
Closes #28681