Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878] #3952

Closed
asfimport opened this issue Jan 21, 2011 · 209 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Jan 21, 2011

Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that.
Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead.
To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec.

So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute.

I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).

The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)


Migrated from LUCENE-2878 by Simon Willnauer (@s1monw), 11 votes, resolved Apr 11 2018
Attachments: LUCENE-2878_trunk.patch (versions: 2), LUCENE-2878.patch (versions: 30), LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch (versions: 2)
Linked issues:

Sub-tasks:

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

here is the patch

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

I took a quick glance at the issue (not a proper review though: its too easy to get lost in the spans!)
Here are my thoughts/questions:

  • I like the idea of adding the positions iterator separate, this is one thing i don't like about the Spans API.
  • I don't like the MultiSpansWrapper... it only exists to support PayloadSpanUtil and tests? I don't like PayloadSpanUtil either, i think its a huge trap and it should move to src/test or be removed.
  • any ideas of the performance comparison of this patch as-is (SpanScorerWrapper over the Scorer/bulk enum) versus the existing Spans which doesn't use the bulk API?

Seems like if we could get the API how we like on this issue, and fix all codecs bulkpositions implementations to work with payloads, that this would be a nice step
we could make...

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I don't like PayloadSpanUtil either, i think its a huge trap and it should move to src/test or be removed.

It's not that bad of a trap :) I has warnings all over it - some people use it out there. I think it's more useful than trapful - if not just as demo code.

It is src/test quality at best though though. I never would have committed it to core - I expected to put it in contrib at best. But someone else committed it at the time, so I just did a 'meh' and put warnings about experimental on it. I forgot to take it out of the patch on the issue or something. I never even used it or tested it much myself - I just wanted to put in some example/test code for this use case - I know some users on the list have used it for some form of simple highlighting, where the coords to highlight are in the payloads. I wouldn't trust it these days though :)

It will be found less if its in src/tests, but that is probably a fine place for it - it would need some love/support above what it has to really make sense in any module/contrib. I've never really meant to maintain it myself - and I've never really felt it belonged in core (just never had the itch to squash it either).

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

It's not that bad of a trap I has warnings all over it - some people use it out there. I think it's more useful than trapful - if not just as demo code.

In this case maybe contrib?
And if we were careful/re-organized tests a bit, MultiSpansWrapper could be pkg-private.

@asfimport
Copy link
Author

Paul Elschot (migrated from JIRA)

A first impression, no time to read the whole patch yet.

I like Positions being abstract, instead of the current Spans interface.
The name Positions is understated though. It has a begin and an end that can be iterating/iterated in place, so it is actually "a" spans.
I wish I had a better name.
I hope that the addition to TermScorer does not slow it down.

The patch has some duplicates (see Scorer for example), but that is easy to fix.
It also contains a 2005 copyright for a new file. Time goes by.

More later...

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

In this case maybe contrib?

I dunno - I think I would be for it if I was willing to work on it and maintain it. But honestly, I am not. Some users have figured out how to use it for something, but I never really used it for anything. So my interest is low, and list of other stuff I'd love to tackle high. I'm still pro removing some stuff from contrib, so I don't want to be responsible for more cruft going in.

Perhaps anyone using it should just copy out the code to their project ... thats really the spirit it was written in anyway - here is sample code to get you started - showing what you can do with spans and payloads.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

First of all thanks for the reviews so far... seems like the overall feedback is positive though. I should have made my point more clear that the most of this patch is a PoC more than anything else. The changes to spans are rather to proof correctness and give the bulk api a little exercise.

I don't like the MultiSpansWrapper... it only exists to support PayloadSpanUtil and tests? I don't like PayloadSpanUtil either, i think its a huge trap and it should move to src/test or be removed.

The reason for MultSpansWrapper is pretty simple. MultiBulkEnum is not entirely implemented yet so I had to move the Spans to per-segment using AtomicReaderContext. What I needed was an easy way to cut over all the tests and that seemed to be straight forward. Its really just a testing class and should go away over time.

any ideas of the performance comparison of this patch as-is (SpanScorerWrapper over the Scorer/bulk enum) versus the existing Spans which doesn't use the bulk API?

not yet, once I get back to this I will run a benchmark

A first impression, no time to read the whole patch yet.

my plans are to cut over spans to AtomicREaderContext on trunk so this patch will be way way smaller and easier to read maybe you should just skip all the span stuff for now. those are just ARC based changes

PayloadSpanUtil

<dream> Spans as they exist today should go away anyway so this is not much of a deal for now.</dream>

The name Positions is understated though. It has a begin and an end that can be iterating/iterated in place, so it is actually "a" spans.

I agree we might need to go away from this start end thing and have a wrapper on top of Position(s)

I hope that the addition to TermScorer does not slow it down.

Those positions == null checks are well predictable and should be optimzed away - if hotspot plays tricks on us we can still specialize a scorer....

@asfimport
Copy link
Author

asfimport commented Jan 24, 2011

Simon Willnauer (@s1monw) (migrated from JIRA)

Attaching my current state. I committed #3956 today which contained all the cut over to SpanQuery#getSpans(AtomicReaderContext) that made this patch much smaller and easier to read now. I also fixed MemoryIndex such that now all tests pass with -Dtests.codec=Standard

Not much progress other than that - more tomorrow....

@asfimport
Copy link
Author

Grant Ingersoll (@gsingers) (migrated from JIRA)

I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality.

As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right?

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality.

Grant I completely agree! Any help here very much welcome. I am so busy fixing all the BulkEnums and spinnoffs from this issue but I hope I have a first sketch of how I think this should work by the end of the week!

As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right?

Yeah!

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

Attaching my current state - still rough & work in progress though....
In this patch I added a PositionsIntervalIterator returned from Scorer#positions() and implemented some usecases like filtering Near / Within (unordered) with Term & BooleanScorer2. BooleanScorer2 if in conjunction mode now has a PositionIntervalIter implementation that returns the minimal interval of its unordered query terms and can easily be filtered within a range of pos (range, first) or within a relative positions (near) simply by wrapping it in PositionFilterQuery. An example for this is in TestBooleanQuery. The PosIntervalIterator decouples positional operation nicely from Scoring / matching so a Boolean AND query gets the normal query score but can be restricted further based on positions. Even adding positional scoring can simple be plugged on top of it.

Further, with a specialized Collector implementation - positions could be fetched only if the score is within the top N to prevent positions matching for all documents.

One big problem is BooleanScorer which does the bucket based scoring - I will ignore that one for now. I need to run some benchmarks to see if that does any good though but haven't had time to do so. If somebody has time and take a look at this patch - feedback would be very much appreciated.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

just attaching my current state

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch looks awesome (and, enormous)! Finally we are making
progress merging Span* into their corresponding non-positional queries
:)

I like how you added payloads to the BulkPostings API, and how
someone is finally testing the bulk positions code.

So now I can run any Query, and ask it to enumerate its positions
(PositionInterval iterator), but not paying any price if I don't want
positions. And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads – great).

It's great that you can just emulate spans on the new api with
SpanScorerWrapper/MockSpanQuery, and use PositionFilterQuery to filter
positions from a query, eg to turn a BooleanQuery into whatever
SpanQuery is needed – very nice!

How does/should scoring work? EG do the SpanQueries score
according to the details of which position intervals match?

The part I'm wondering about is what API we should use for
communicating positions of the sub scorers in a BooleanQuery to
consumers like position filters (for matching) or eg Highlighter
(which really should be a core functionality that works w/ any query).
Multiplying out ("denormalizing") all combinations (into a flat stream
of PositionIntervals) is going to be too costly in general, I think?

Maybe, instead of the denormalized stream, we could present a
UnionPositionsIntervalIterator, which has multiple subs, where each
sub is its own PositionIntervalIterator? This way eg a NEAR query
could filter these subs in parallel (like a merge sort) looking for a
match, and (I think) then presenting its own union iterator to whoever
consumes it? Ie it'd only let through those positions of each sub
that satisfied the NEAR constraint.

@asfimport
Copy link
Author

Grant Ingersoll (@gsingers) (migrated from JIRA)

And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads – great).

Does it make sense that we could just want AttributeSources as we go here?

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

How does/should scoring work? EG do the SpanQueries score
according to the details of which position intervals match?

I didn't pay any attention to scoring yet. IMO scoring should be left to the query which is using the positions so a higher level query like a NearQuery could just put a custom scorer on top of a boolean conjunction and apply its own proximity based score. This should be done after we have the infrastructure to do it. I think that opens up some nice scoring improvements. I am not sure if we should add proximity scoring to existing queries, I rather aim towards making it easy to customize.

The part I'm wondering about is what API we should use for
communicating positions of the sub scorers in a BooleanQuery to
consumers like position filters (for matching) or eg Highlighter
(which really should be a core functionality that works w/ any query).
Multiplying out ("denormalizing") all combinations (into a flat stream
of PositionIntervals) is going to be too costly in general, I think?

I thought about that for a while and I think we should enrich the PosIntervalIterator API to enable the caller to pull the actual subintervals instead of an Interval from the next method. Something like this:

 public abstract class PositionIntervalIterator implements Serializable{
  public abstract PositionInterval next() throws IOException;
 /**
  *Returns all sub interval for the next accepted interval.
  **/
  public abstract PositionIntervalIterator nextSubIntervals() throws IOException;
  public abstract PositionIntervalIterator[] subs(boolean inOrder);

so that if you are interested in the plain positions for eventually each term like highlighting you can pull them per match occurence. That way you have positional matching and you can iterate the subs.

Maybe, instead of the denormalized stream, we could present a
UnionPositionsIntervalIterator, which has multiple subs, where each
sub is its own PositionIntervalIterator? This way eg a NEAR query
could filter these subs in parallel (like a merge sort) looking for a
match, and (I think) then presenting its own union iterator to whoever
consumes it? Ie it'd only let through those positions of each sub
that satisfied the NEAR constraint.

I don't get that entirely ;)

Does it make sense that we could just want AttributeSources as we go here?

you mean like we are not extending Scorer but add an AttributeSource to it? I think this is really a core API and should be supported directly

@asfimport
Copy link
Author

Liwei (migrated from JIRA)

What should I do for it?

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

it doesn't seem that this issue is worth staying so tight coupled to the bulkpostings branch. I originally did this on bulk postings since it had support for positions in termscorer (new bulk API) where on trunk we don't have positions there at all. Yet, I think bulk postings should be sorted out separately and we should rather move this over to trunk. On trunk we can get rid of all the low level bulk API hacks in the patch. the only thing that is missing here is a TermScorer that can score based on positions / payloads. I think since we have ScoreContext and how this works here in the patch we can simply implement a TermScorer that works on DocsEnumAndPositions and swap it in once positions are requested.

I think I can move this over to trunk soon.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch that applies to trunk. I added a simple maybe slowish PositionTermScorer that is used when pos are required. This is really work in progress but I am uploading it just in case somebody is interested.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

some more cleanups, all tests pass now on trunk

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

I've been fiddling with highlighter performance, and this looks a great step towards being able to do hl in an integrated way that doesn't require a lot of post-hoc recalculation etc. I worked up a hackish highlighter that uses it as a POC, partly just as a way of understanding what you've done, but this could eventually become usable.

Here are a few comments:

I found it convenient to add:
boolean Collector.needsPositions() and needsPayloads()
and modified
IndexSearcher.search(AtomicReaderContext[] leaves, Weight weight, Filter filter, Collector collector)
to set up the ScorerContext accordingly

And then I am accessing the scorer.positions() from Collector.collect(), which I think is a very natural use of this API? At least it was intuitive for me, and I am pretty new to all this.

I think that when it comes to traversing the tree of PositionsIntervalIterators, the API you propose above might have some issues. What would the status of the returned iterators be? Would they have to be copies of some sort in order to preserve the state of the iteration (so scoring isn't impacted by some other consumer of position intervals)? The iterators that are currently in flight shouldn't be advanced by the caller usually (ever?), or else the state of the dependent iterator (the parent) won't be updated correctly, I think? I wonder if (1) you could add
PositionInterval PositionIntervalIterator.current() and (2) return from subs() and nextSubIntervals() some unmodifiable wrappers - maybe a superclass of PII that would only provide current() and subs(), but not allow advancing the iterator.

I hope you'll be able to pick it up again soon, Simon!

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

Hey Mike,
great to see interest here! :)

boolean Collector.needsPositions() and needsPayloads()

+1 that makes lots of sense.

Let me give you some insight of the current patches state. This whole thing is still a prototype and needs lots of cleanups all over the place. I moved it to trunk lately since I don't want to wait for bulkpostings to move forward. I think there are lots of perf impacts with its current state but eventually I think it will be much better, more powerful and cleaner than spans after all.

And then I am accessing the scorer.positions() from Collector.collect(), which I think is a very natural use of this API? At least it was intuitive for me, and I am pretty new to all this.

this is one way of doing it for sure. The other way would be to wrap the top level scorer and do your work in there with a PositionScoringQueryWrapper or something like that which would set up the ScorerContext for you. The main question is what you want to do with positions. For matching based on positions you have to use some scorer I guess since you need to check every document if it is within your position constraints, something like near(a AND b). If you want to boost based on your positions I think you need to do a 2 phase collection, Phase 1 simply running the query collecting n + X results and Phase 2 re-ranking the results from Phase 1 by pulling the positions.

I think that when it comes to traversing the tree of PositionsIntervalIterators, the API you propose above might have some issues

I agree this is very flaky right now and I only tried to mimic the spans behavior here to show that this is as powerful as spans for now. But eventually we need a better API for this, so its good you are jumping in with a usecase!

What would the status of the returned iterators be?

currently if you pull an iterator you are depending on the state of your scorer. Let me give you an example on TermScorer, if you are on document X you can iterate the positions for this document if you exhaust them or not once the scorer is advanced your PositionInterator points to the documents position you advanced to. The same is true for all other Scorers that expose positions. Yet, some problems arise here with BooleanScorer (in contrast to BooleanScorer2) since it reads documents in blocks which makes it very hard (nearly impossible) to get efficient positions for this scorer (its used for OR queries only with NOT clauses < 32).
So PositionsInterators are never preserve positions for a document you pulled the interval for. You can basically pull the iterator only once and keep it until you scorer is exhausted. Bottom line here is that you are depending on the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your positions are advanced too. We could think of a separate Enum here that advances independently, hmm that could actually work too, lets keep that in mind.

(so scoring isn't impacted by some other consumer of position intervals)

there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

PositionInterval PositionIntervalIterator.current()

what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanSocorer return here?

(2) return from subs() and nextSubIntervals() some unmodifiable wrappers - maybe a superclass of PII that would only provide current() and subs(), but not allow advancing the iterator.

I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Currently this patch provides an AND (ordered & un-ordered) and a BLOCK PositionIterator based on this paper http://vigna.dsi.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics while the OR implementation is still missing so if you want to jump on that issue and help there is tons of space for improvements.

Eventually I think we can leave spans as they are right now and concentrate on the API / functionality, making things fast under the hood can be done later but getting things right to be flexible is the most important part here.

Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?

I hope you'll be able to pick it up again soon, Simon!

I would love to ASAP, currently I have so much DocValues stuff todo so this might take a while until I get back to this.

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

I guess I am coming at this from the perspective of a Highlighter; the Highlighter wants to iterate over the top-level Scorer, finding each of its matching positions, and then for each of those, it wants to iterate over all the individual terms' positions. Possibly some clever HL of the future will be interested in intermediate-level nodes in the tree as well, like highlighting a near-span, or coalescing phrases. The problem I see is that with the current API the only way to retrieve the lower-level positions is to advance their iterators, but if that is done directly (without the knowledge of the enclosing scorer and its iterator), the scoring will be messed up. I guess that's what I meant by multiple consumers - of course you are right, there should be only one "writer" consumer that can advance the iteration. My idea is that there could be many readers, though. In any case, I think it is typical for an iterator that you can read the current position as many times as you want, rather than "read once" and expect the caller to cache the value?

what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanScorer return here?

It has its own PI right? I think it is the minimum interval containing some terms that satisfy the boolean conditions.

I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Hmm I haven't yet looked at how BooleanScorer2 and BooleanScorer works, but I understand there is some additional complexity there. Perhaps if the only distinction is order/unordered there might be a special case for that when you create the Scorer, rather than exposing internals to the caller? But I don't know - would have to understand this better. Maybe there are other cases where that could be needed.

Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?

The PosHiglighter is a bit messy - filled with debugging and testing code and so on, and it's also slow because of the need to match positions->offsets in kind of a gross way.. Robert M had an idea for storing this mapping in the index which would improve things there, but I haven't done that. In any case, I'll be happy to share the patch when I get back home and can clean it up a bit. Maybe if I have a chance I will look into implementing OR-queries - I stumbled on that limitation right away!

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Attaching a patch with a simple highlighter using the positions() iterators. Includes a PositionTreeIterator for pulling out leaf positions. The names are getting a bit long: I almost wrote PositionIntervalIteratorTree? Maybe a PositionIntervalIterator could just be a Positions? The variables are all called positions...

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

updated patch provides positions() for Boolean OR (disjunction) queries, but only with n should match = 1. I refactored the ConjunctionPositionIterator and its Queue in order to do this without a lot of cut-and-paste.

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Updated PosHighlighter patch that actually works :) PosCollector now caches positions when collecting docs. This way highlighting can be decoupled, and doesn't have to be performed on docs that will eventually drop off the top n list. This seems to work ok, but it would be nice to come up with a way to make this usable with existing collectors.

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

So PositionsInterators are never preserve positions for a document you pulled the interval for. You can basically pull the iterator only once and keep it until you scorer is exhausted. Bottom line here is that you are depending on the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your positions are advanced too. We could think of a separate Enum here that advances independently, hmm that could actually work too, lets keep that in mind.

So after working with this a bit more (and reading the paper), I see now that it's really not necessary to cache positions in the iterators. So never mind all that! In the end, for some uses like highlighting I think somebody needs to cache positions (I put it in a ScorePosDoc created by the PosCollector), but I agree that doesn't belong in the "lower level" iterator.

Eventually I think we can leave spans as they are right now and concentrate on the API / functionality, making things fast under the hood can be done later but getting things right to be flexible is the most important part here.

As I'm learning more, I am beginning to see this is going to require sweeping updates. Basically everywhere we currently create a DocsEnum, we might now want to create a DocsAndPositionsEnum, and then the options (needs positions/payloads) have to be threaded through all the surrounding APIs. I wonder if it wouldn't make sense to encapsulate those options (needsPositions/needsPayloads) in some kind of EnumConfig object. Just in case, down the line, there is some other information that gets stored in the index, and wants to be made available during scoring, then the required change would be much less painful to implement.

I'm thinking for example (Robert M's idea), that it might be nice to have a positions->offsets map in the index (this would be better for highlighting than term vectors). Maybe this would just be part of payload, but maybe not? And it seems possible there could be other things like that we don't know about yet?

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

Mike, its so awesome that you help here. I will be back on wednesday and post comments / suggestions then.

simon

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Yeah that would be great - I hope I'm not being too irritating – enjoy your vacation! I worked on using the standard Highlighter with a TokenStream built on a PosIterator, and the results are promising in terms of performance, but I think I am still grappling with the right way to retrieve positions.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

hey Mike, I applied all your patches and walked through, this looks great. I mean this entire thing is far from committable but I think we should take this further and open a branch for it. I want to commit both your latest patch and the highlighter prototype and work from there.

So after working with this a bit more (and reading the paper), I see now that it's really not necessary to cache positions in the iterators. So never mind all that! In the end, for some uses like highlighting I think somebody needs to cache positions (I put it in a ScorePosDoc created by the PosCollector), but I agree that doesn't belong in the "lower level" iterator.

after looking into your patch I think I understand now what is needed to enable low level stuff like highlighting. what is missing here is a positions collector interface that you can pass in and that collects positions on the lowest levels like for pharses or simple terms. The PositionIterator itself (btw. i think we should call it Positions or something along those lines - try to not introduce spans in the name :) ) should accept this collector and simply call back each low level position if needed. For highlighting I think we should also go a two stage approach. First stage does the matching (with or without positions) and second stage takes the first stages resutls and does the highlighting. that way we don't slow down the query and the second one can even choose a different rewrite method (for MTQ this is needed as we don't have positions on filters)

As I'm learning more, I am beginning to see this is going to require sweeping updates. Basically everywhere we currently create a DocsEnum, we might now want to create a DocsAndPositionsEnum, and then the options (needs positions/payloads) have to be threaded through all the surrounding APIs. I wonder if it wouldn't make sense to encapsulate those options (needsPositions/needsPayloads) in some kind of EnumConfig object. Just in case, down the line, there is some other information that gets stored in the index, and wants to be made available during scoring, then the required change would be much less painful to implement.

what do you mean by sweeping updates? For the enum config I think we only have 2 or 3 places where we need to make the decision. 1. TermScorer 2. PhraseScorer (maybe 2. goes away anyway) so this is not needed for now I think?

I'm thinking for example (Robert M's idea), that it might be nice to have a positions->offsets map in the index (this would be better for highlighting than term vectors). Maybe this would just be part of payload, but maybe not? And it seems possible there could be other things like that we don't know about yet?

yeah this would be awesome... next step :)

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

For highlighting I think we should also go a two stage approach. First stage does the matching (with or without positions) and second stage takes the first stages resutls and does the highlighting. that way we don't slow down the query and the second one can even choose a different rewrite method (for MTQ this is needed as we don't have positions on filters)

I think this would be a good approach, its the same algorithm really that you generally want for positional scoring: score all the docs the 'fast' way then reorder only the top-N (e.g. first two pages of results), which will require using the positioniterator and doing some calculation that you typically add to the score.

So if we can generalize this in a way where you can do this in your collector, I think it would be reusable for this as well.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

+1

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

monumental effort thanks everybody involed

This was referenced Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment