Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] add details on highlighting #28802

Merged

Conversation

mayya-sharipova
Copy link
Contributor

Add additional information on inner working of highlighters

Closes #28681

@cexmmaqsood
Copy link

cexmmaqsood commented Feb 28, 2018

Hi, I'm not sure exactly about the process of commenting on open PRs, but I created the original ticket for this (#28681).

This PR is excellent. It gives a lot of information. I have a few additional requests for info added there if possible.

  1. Can you please go into detail about how the order parameter works? I created a ticket for it (Highlight Field Order #26612) and a few details are unclear:
    -- What are the options available for the 'order' parameter?
    -- Can you confirm that the order:sort works as expected (i.e fragments get scored, highest scored fragments are returned first)
  2. Can you please provide an example of a complex query in regards to this?

The goal is to highlight only those terms that participated in generating the 'hit' on the document.
For some complex queries, this is still work in progress.

  1. Is it recommended by ES to always highlight on the same field as searched? As you know a field can have many subfields and those subfields can be analyzed in different ways, so it is possible right now (without an error thrown) to search on subfield A, for example, but highlight on subfield B. In my experience, this has resulted in unexpected behaviour but I had no concrete evidence/documentation to support that theory.
  2. Can you name/explain the algorithm that is used?
    Plain highlighter uses a very simple +algorithm to break the token stream into fragments.
  3. It would be really interesting if we quickly the low level information used is stated.
    Then this obtained low-level match information is used to score each individual fragment.

Again thanks so much for the documentation. It helps immensely and if I had this last year when I was developing highlighting for my company it would have sped things up.

@mayya-sharipova
Copy link
Contributor Author

@cexmmaqsood thanks for your comments, we will try to address them in the update

@mayya-sharipova mayya-sharipova force-pushed the update-highlighting-docs branch from fd2fd8c to 547f9c5 Compare March 7, 2018 17:52
Add additional information on inner working of highlighters

Closes elastic#28681, elastic#28816
@mayya-sharipova mayya-sharipova force-pushed the update-highlighting-docs branch from 547f9c5 to 8fae9f8 Compare March 7, 2018 21:41
@mayya-sharipova
Copy link
Contributor Author

@cexmmaqsood Thanks very much again for your feedback.

Addressing your comments:

  1. Can you please go into detail about how the order parameter works?

the highlighting documentation was updated accordingly.

  1. Can you please provide an example of a complex query in regards to this?
    The goal is to highlight only those terms that participated in generating the 'hit' on the document. For some complex queries, this is still work in progress.

An example of this complex query can be found in this issue: #28626

  1. Is it recommended by ES to always highlight on the same field as searched? As you know a field can have many subfields and those subfields can be analyzed in different ways, so it is possible right now (without an error thrown) to search on subfield A, for example, but highlight on subfield B. In my experience, this has resulted in unexpected behaviour but I had no concrete evidence/documentation to support that theory.

I am not sure what you mean by subfields here. In case of nested fields, we have inner_hits option that allows you to highlight nested docs. There is also an option to use highlight_query that could be different from a search query. Overall, it is possible to use one field for search, and another for highlighting.

  1. Can you name/explain the algorithm that is used? Plain highlighter uses a very simple algorithm to break the token stream into fragments.

The explanation that follows this line is the explanation of the algorithm.

  1. It would be really interesting if we quickly the low level information used is stated.
    Then this obtained low-level match information is used to score each individual fragment.

This low-level match information in a simplified form presented in the example at the end of the highlighting documentation:
onli -> positions(34, 35) weight:1
fox -> positions(34, 35) weight:1

@mdcclv
Copy link

mdcclv commented Mar 16, 2018

This is such a useful explanation: thanks so much for writing it, and I'm very glad I stumbled upon it.

One other question raised by this section of the docs:

Fast vector highlighter

Can assign different weights to matches at different positions allowing for things like phrase matches being sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches

How can I control this in a query? I am using a boosting query to prefer phrase matches to term matches, but my fvh highlights are coming through in document order.

@mayya-sharipova
Copy link
Contributor Author

@mdcclv Thanks for the feedback!

How can I control this in a query? I am using a boosting query to prefer phrase matches to term matches, but my fvh highlights are coming through in document order.

For specific questions like this, please ask in https://discuss.elastic.co/
Just briefly reply here, you should use "order": "score" to output fragments by score. It is score that will incorporate your boost.

@mdcclv
Copy link

mdcclv commented Mar 19, 2018

Thanks! But this new section says:
Only `unified` highlighter truly calculates the score, other highlighters with order: `sort` setting, will rank fragments by the number of query words found
does that mean that "order": "score" and "order": "sort" are both possibilities with fvh highlights?

The new documentation in this PR still seems to say that fvh highlighting only takes the "order": "sort" option, and only orders by number of query words, not by boosted values.

Add additional information on inner working of highlighters

Closes elastic#28681, elastic#28816
@mayya-sharipova mayya-sharipova force-pushed the update-highlighting-docs branch from ed6c860 to f6e1990 Compare March 19, 2018 20:18
@mayya-sharipova
Copy link
Contributor Author

@mdcclv Sorry for that. I see my mistake, I have updated the PR accordingly: f6e1990. Thanks for noticing that.

In short, order can only be score or none. With order:score, fvh will rank fragments by the number of query words found in them, but it will incorporate boost as well.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova and sorry for the late review.
I left some comments, I think we should focus on the unified highlighter which is the default in 6.0 and maybe have a separate page to describe the highlighter internals. This page is quite big already.

`minimum_should_match` etc.), parts of documents may be highlighted
that don't correspond to query matches. The work for fixing this is
currenly in progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove this part or say something like highlighters don't reflect the boolean logic of the query when extracting the terms to highlight.... I am not sure that we're going to "fix" it and we should not add this statement in the docs. I see pros and cons to do that and the solution that @romseygeek implemented might not be applicable in all cases (term vectors highlighting for instance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thanks, Jim. I will rephrase this note as you suggested, and remove the part about fixing. I was asked to create this note, as there were several SDH and other issues, where highlighted fragments did not match a query, which confused users.

order:: Sorts highlighted fragments by score when set to `score`. By default,
fragments will be output in the order they appear in the field (order: `none`).
Setting this option to `score` will output the most relevant fragments first.
Only `unified` highlighter truly calculates the score in a similar way the score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each highlighter has its own relevancy "sauce" but I wouldn't say that the unified scoring is similar to the score of the query. It uses BM25 but that's just an internal detail, I think we should just say that each highlighter applies its own logic to compute the relevancy score and we can describe the details in the example below.

A highlighter uses `pre-tags`, `post-tags` to encode highlighted terms.


===== An example of the work of the plain highlighter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe how the unified highlighter works instead ? Maybe just the re-analysis mode (plain highlighting) since it is very similar to the plain highlighter. We want to deprecate (and remove) the plain highlighter so I'd prefer if we document something that will last longer.

{"token":"fox","start_offset":164,"end_offset":167,"position":35},
{"token":"world","start_offset":175,"end_offset":180,"position":38},
{"token":"you","start_offset":185,"end_offset":188,"position":40}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unified highlighter does not index all terms but only those that can match the query. This is an issue currently in the plain highlighter since it caches all these terms in memory so we shouldn't document this and rely on the unified highlighter instead.

Add more explanation to some highlighting parameters
Add a document describing how highlighters work internally.
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a small comment but LGTM otherwise.
Thanks @mayya-sharipova !

Relevant settings: `pre-tags`, `post-tags`.

The goal is to highlight only those terms that participated in generating the 'hit' on the document.
For some complex boolean queries, this is still work in progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you adapt the sentence that explains how highlighters don't reflect the boolean logic of a query and only extracts the leaf (terms, phrases, prefix, ...). We can change the note when we have an highlighter (or adapted the unified) that is able to handle boolean queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Thanks for the review, Jim! I will change this sentence as you suggested.

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run sample packaging tests

@mayya-sharipova mayya-sharipova merged commit bf6cfff into elastic:master Apr 18, 2018
@mayya-sharipova mayya-sharipova deleted the update-highlighting-docs branch April 18, 2018 21:41
mayya-sharipova added a commit that referenced this pull request Apr 19, 2018
- add more explanation to some highlighting parameters
- add a document describing how highlighters work internally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants