Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields [LUCENE-8196] #9243

Closed
asfimport opened this issue Mar 8, 2018 · 37 comments

Comments

@asfimport
Copy link

asfimport commented Mar 8, 2018

This ticket proposes an alternative implementation of the SpanQuery family that uses minimum-interval semantics from http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf to implement positional queries across term-based fields.  Rather than using TermQueries to construct the interval operators, as in #3952 or the current Spans implementation, we instead use a new IntervalsSource object, which will produce IntervalIterators over a particular segment and field.  These are constructed using various static helper methods, and can then be passed to a new IntervalQuery which will return documents that contain one or more intervals so defined.


Migrated from LUCENE-8196 by Alan Woodward (@romseygeek), resolved Apr 03 2018
Attachments: LUCENE-8196.patch (versions: 5), LUCENE-8196-debug.patch
Linked issues:

Pull requests: apache/lucene-solr#334

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Alan. I agree that growing a separate hierarchy of objects might help land this feature. We might even want to put first iterations of this work in sandbox to give time for the API to stabilize before we move it to core or misc.

I have some questions/comments:

  • Do we need IntervalIterator.score()? It seems to be the same value on all implementations.
  • Do we need advanceTo? It seems to me that things would be simpler and as efficient if you documented that nextPosition() may only be called when the approximation is positioned and then advanceTo would be equivalent to checking the return value of nextInterval?
  • Let's make the IntervalFunction API an implementation detail?
  • The documentation of cost() says it is the cost of finding the next interval but given how you use it in the query it looks like it is actually more about the average cost of iterating over all intervals.
  • In terms of testing I would like some form of AssertingIntervalsSource to make sure that intervals are always consumed in legal ways and behave correctly.
  • More docs would help read the code. For instance IntervalsSource.intervals has no docs. By the way we might want to mention there that the same instance might be reused across calls.
  • TermIntervalsSource should check whether positions were indexed.
  • I was a bit annoyed to see the field masking hack but actually those intervals source do not need term statistics which makes the hack less horrible. Could you still document it to make sure users are aware it is a hack and explain it which circumstances it might be ok?

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

Do we need IntervalIterator\.score()?

Yes, terms and phrases return 1 instead than taking the overall width into account.  This is so that they score the same as TermQuery and PhraseQuery

Do we need advanceTo?

Unfortunately yes, or at least we need a way of resetting the iterator on each new document. It might be possible to avoid passing the doc down and having a return value though, I'll see what I can do.

I would like some form of AssertingIntervalsSource

This is a bit trickier, as it's not obvious where the wrapping would happen.

+1 to everything else, I'll work on a follow-up.
 

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

I was a bit annoyed to see the field masking hack but actually those intervals source do not need term statistics which makes the hack less horrible. Could you still document it to make sure users are aware it is a hack and explain it which circumstances it might be ok?

 
I think that the proposed API should be more restrictive regarding the targeted field. Could we restrict the IntervalsSource to work on a single field ? Something like:

public abstract class IntervalsSource {
 protected final String field;

 public IntervalsSource(String field) {
   this.field = field;
 }

 public abstract IntervalIterator intervals(LeafReaderContext ctx) throws IOException;
...

... and then we can check in each implementation that the sources are all targeting the same field.
I understand that it might be powerful to mix multiple fields in an interval query but with the current API that seems to be the norm rather than an exception. We can add the field masking hack afterward but for the first iteration I think it's better to focus on the main use case for this new query which is to provide a way to find the minimum intervals in a single field.

Regarding the score of the intervals, it seems that the patch uses the inverse length of the interval rather than the slop within the interval like the sloppy phrase scorer. Could we compute the total slop of the current interval (as the sum of the slop of each interval source that composed this interval) and use its inverse to score each ? This would make different interval query more comparable in terms of score since an interval with few terms and a slop>0 would score less that one with more terms but no slop.

I'll look deeper at the implementation of the different queries but I like the simplicity of the patch and the fact that there is a paper with a proof for each of them.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

I opened a pull request at apache/lucene-solr#334 to make this easier to review.  @jpountz I think I've addressed most of your feedback?

@jimczi I'd rather keep the API as it is, with the field being passed to IntervalQuery and then recursing down the IntervalSource tree.  Otherwise you end up having to declare the field on all the created sources, which seems redundant.  I've removed the cross-field hack entirely for the moment.

I'll see if I can improve the scoring next.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

I discussed scoring with @jimczi and @jpountz offline, and we decided to just use the inverse length of intervals as a sloppy frequency for now, as described in the Vigna paper linked above.  This means that we can't compare scores directly with existing phrase queries, but the query mechanism is quite different (particularly for SloppyPhraseScorer) so it makes sense that scores won't be the same either.

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

I'd rather keep the API as it is, with the field being passed to IntervalQuery and then recursing down the IntervalSource tree. Otherwise you end up having to declare the field on all the created sources, which seems redundant. I've removed the cross-field hack entirely for the moment.

+1 to remove the cross-field hack, thanks. Regarding the API it's ok since IntervalQuery limits all sources to one field so I am fine with that (I misunderstood how the IntervalQuery can be used).

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

This patch makes IntervalIterator extend DocIdSetIterator, and makes the per-document reset() function protected and called automatically on nextDoc() and advance(target).

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

This patch moves everything into the sandbox for now, and adds a package-info explaining how to use things.  I had to duplicate DisiWrapper, DisiPriorityQueue and DisjunctionDISIApproximation, but I don't think that's too much of a problem.  Having things in sandbox should reduce confusion with Span queries, and give us time to try and switch things over.

I think this is ready to be committed.

@asfimport
Copy link
Author

Adrien Grand (@jpountz) (migrated from JIRA)

Some comments:

  • ConjunctionIntervalIterator does so little, I suspect things would be easier to read if we removed it.
  • I'd like it better if we kept the API definition to a minimum on IntervalIterator and removed the constructor that takes an approximation and the reset() method. To me these should be implementation details?
  • Let's make the utility classes that you copied pkg-private?
  • Maybe let's put this in a sub package of search, ie. org.apache.lucene.search.intervals?

Otherwise +1

@asfimport
Copy link
Author

David Smiley (@dsmiley) (migrated from JIRA)

  • Nice package javadocs!
  • Maybe you will some day add a means of extracting offsets (e.g. for highlighting) or payloads?
  • Just curious, how did you arrive at the conclusion that you needed to specialize the PriorityQueue?
  • what if extractTerms took a Consumer<Term> instead of a Set<Term>? It's easy to invoke with a myset::add for the common case when you have a Set, and I've seen cases where you might want to provide a filter before storing it wherever.

 

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

+1 too, there are some places where you could initialize the current interval with[−∞ . . −∞] in order to avoid the nullity check.

Most of the operators algorithm seem good, though I don't understand why you change the order of the disjunction ? If you don't start with the smallest right interval from the queue you could miss a lot of minimum intervals that could be needed if the disjunction is used inside another operator ?

@asfimport
Copy link
Author

asfimport commented Mar 19, 2018

Alan Woodward (@romseygeek) (migrated from JIRA)

I moved the approximation and reset() from IntervalIterator, and it turned out that ConjunctionIntervalIterator was a convenient place to put them, so I kept that in.  I also cleaned up the utility classes (no need to check for TwoPhaseIterator or BitSetIterator when we know that the leaves are always PostingsEnum) and moved to org.apache.lucene.search.intervals.

Payload matches can be added trivially at a later point.  Payload scoring will be a more interesting one, but I'm not entirely happy with the way scoring works at the moment anyway, I need to read around a bit more on good ways of scoring proximity queries in general.  Offsets may not be trivial to add, as we have the same problem we originally had with Spans here, in that some of the algorithms advance leaf intervals before returning.  Something to consider later on, definitely.

The PQ specialization is just lifted directly from DisjunctionDISIApproximation in the existing core search package, so it wasn't my conclusion :)

re extractTerms() - I don't like this method being here in the first place really, see my comment about scoring above.  I think let's keep it simple for now, you can always filter afterwards.

@jimczi - where can I remove null checks?  I think I'm constrained by having to return [-1..-1] when iterator has moved to a new doc but hasn't been advanced over intervals yet.

The disjunction order is to address #8451.  If a disjunction is appearing within a block, then sorting for minimum intervals can miss valid matches.  The Vigna paper doesn't seem to discuss this case anywhere.  One possible solution that would work in all cases would be to add a boolean to IntervalsSource.getIntervals() that indicates whether or not the source is in a final position - if it is, then it should return minimal intervals, otherwise it should return wider ones.  This would only be applicable to disjunctions.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

I talked over disjunction ordering with @jimczi, and we agreed to revert back to the ordering specified in the original paper.  In future it might be worth contacting the authors and seeing if they've covered the case of prefix disjunctions elsewhere.

I think this is ready?

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

+1

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 974c03a in lucene-solr's branch refs/heads/branch_7x from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=974c03a

LUCENE-8196: Add IntervalQuery and IntervalsSource to the sandbox

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 00eab54 in lucene-solr's branch refs/heads/master from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=00eab54

LUCENE-8196: Add IntervalQuery and IntervalsSource to the sandbox

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

Thanks all!

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit b772b58 in lucene-solr's branch refs/heads/branch_7x from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b772b58

LUCENE-8196: Check that the TermIntervalsSource is positioned on the correct term

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 1d6502c in lucene-solr's branch refs/heads/master from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1d6502c

LUCENE-8196: Check that the TermIntervalsSource is positioned on the correct term

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 7fcaac8 in lucene-solr's branch refs/heads/branch_7x from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7fcaac8

LUCENE-8196: Check for a null input in LowpassIntervalsSource

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 7117b68 in lucene-solr's branch refs/heads/master from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7117b68

LUCENE-8196: Check for a null input in LowpassIntervalsSource

@asfimport
Copy link
Author

Matt Weber (@mattweber) (migrated from JIRA)

@romseygeek  This is great!  How would we prevent matching at the same interval?  In TestIntervalQuery, I would expect this to pass but it matches every doc with w3.

public void testUnorderedQueryNoSelfMatch() throws IOException {
  Query q = new IntervalQuery(field, Intervals.maxwidth(2, Intervals.unordered(Intervals.term("w3"), Intervals.term("w3"))));
  checkHits(q, new int[]{1});
}

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

How would we prevent matching at the same interval?

The original paper doesn't look like it addresses this. I'll try and work out the best way of dealing with things, I guess we'll need to keep track of the positions of internal intervals in the priority queue, and when we advance make sure that they don't collide.

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

I don't think we should prevent anything ;). unordered is a conjunction operator so it should match if all terms match (which is the case in your example) so these results are expected IMO. Maybe we should rename unordered to and in order to avoid confusion ?
If you want to match the same term within a max width the ordered query works fine:

Query q = new IntervalQuery(field, Intervals.maxwidth(2, Intervals.ordered(Intervals.term("w3"), Intervals.term("w3"))));

@romseygeek while I was playing with unordered I realized that we don't protect against sources that match but don't have intervals.
For instance:

Query q = new IntervalQuery(query, Intervals.unordered(Intervals.term("w2"), Intervals.ordered(Intervals.term("w3"),Intervals.term("w3"))));

does not work because the unordered query doesn't check if the sub source has intervals when it adds it in the queue.
I attached a patch that fixes this issue and added some tests that fail without the fix. Can you take a look ?

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

Good catch @jimczi, I'll commit that change. I like the idea of changing unordered to and as well - I think that makes sense @mattweber?

@asfimport
Copy link
Author

Matt Weber (@mattweber) (migrated from JIRA)

@jimczi @romseygeek  I think rename to and makes sense, however, I would still like a way to explicitly prevent the scenario I described . Maybe a minwith operator?  The width at the same position/interval should be 0 right?

@asfimport
Copy link
Author

Jim Ferenczi (@jimczi) (migrated from JIRA)

I don't think an operator can prevent anything here, a query for Intervals.ordered(Intervals.term("w3"), Intervals.term("w3")) should always return all intervals of the term "w3" (it will not interleave successive intervals of "w3"). @mattweber why do you think that this "scenario" should be prevented ? When I do "foo AND foo" I don't expect it to match only document that have foo twice ?

@asfimport
Copy link
Author

asfimport commented Apr 24, 2018

Matt Weber (@mattweber) (migrated from JIRA)

I use these queries to build query parsers and I am specifically thinking of an unordered near and how I can prevent it from matching the same term. I can't think of any situation where a user would think NEAR(a, a) would match documents with a single a and if we can't get that by default I would like a way to explicitly prevent it myself. Spans have the same issue as well, see #4193.

@asfimport
Copy link
Author

Matt Weber (@mattweber) (migrated from JIRA)

@jimczi @romseygeek

So given a single document with the value a b. The following queries would both match this document:

Intervals.unordered(Intervals.term("b"), Intervals.term("a")) 
Intervals.unordered(Intervals.term("b"), Intervals.term("b")) 

The first I think would have an interval width of 1 and the 2nd should have a width of 0. So if we have a minwidth operator we could use that to set the minimum width to 1 preventing the 2nd from matching? If both of these queries result in an interval with the same width then that feels wrong to me.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

I think minwidth() would run into problems with documents that have two instances of 'b', because unordered will always find the minimal intervals, so it would always end up with intervals of width 0, which would then be rejected by the filter, and you'd end up with missing matches.

What we really need here I think is a new source, something like 'unordered-non-overlapping', which checks that all of the internal intervals are separated. With a better name, of course :) . And we should rename 'unordered' to 'and' to make the semantics a bit clearer.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 345fdff in lucene-solr's branch refs/heads/branch_7x from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=345fdff

LUCENE-8196: Fix unordered case with non-matching subinterval

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 1a18acd in lucene-solr's branch refs/heads/master from @romseygeek
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a18acd

LUCENE-8196: Fix unordered case with non-matching subinterval

@asfimport
Copy link
Author

asfimport commented May 9, 2018

Alan Woodward (@romseygeek) (migrated from JIRA)

I opened #9347 to deal with unordered overlaps.

@asfimport
Copy link
Author

Martin Hermann (migrated from JIRA)

First of all, I really like this implementation and the ideas that went into it. But as I have spent quite some time with the old span queries and their problems, I'd like to comment on some things and maybe offer some fresh view points for old problems:
 
 
Obviously, maxwidth is not completely identical to specifying slop: Let's say we want to do some sort of synonym expansion and query for "("big bad" OR evil) wolf" (this is of course related to the prefix-problem we already know about ("genome editing"), but I think still slightly different).
With span queries, this would have been possible, as we just have to set slop to 0 in all queries, but now we have to do something like

Intervals.maxwidth(3,
        Intervals.ordered(
            Intervals.or(
                Intervals.maxwidth(2,
                    Intervals.ordered(Intervals.term("big"),Intervals.term("bad"))),
                Intervals.term("evil")),
            Intervals.term("wolf")));

which also matches "evil eyes wolf", which should not be a match. It would be possible to rewrite the query so that the disjunction is at the top level, something like

Intervals.or(
    Intervals.maxwidth(2,
        Intervals.ordered(Intervals.term("evil"),Intervals.term("wolf"))),
    Intervals.maxwidth(3,
        Intervals.ordreed(Intervals.term("big"),Intervals.term("bad"),Intervals.term("wolf"))));

which would work as expected, but I think we can agree that this is not really a nice solution (but I will come back to it later).

 

Now, we already know that "(big OR "big bad") wolf" would not match "big bad wolf" (this is exactly the genome editing thing), but I think it is worth to point out exactly why: It actually should not match, according to the definition of "minimum interval": Any match for "big bad" is also a match for big, so the first IntervalsSource only passes matches for "big", and then we get no match for "big wolf". This is a feature of the query semantic of the paper (and maybe the reason for the efficency and simplicity of the algorithms): The problems that spanQueries had are gone, because we define the unexpected behaviour to be correct*. As much as I like the IntervalQueries, I do not really think this is satisfactory.

 
There are actually other, similar cases with containing/containedBy: Let's say our document is "big bad big wolf" and we want "bad wolf" (slop 1) to be contained by "big wolf" (slop 2). We would get no match in this document, as the minimal match for the big interval is just "big wolf" (as the other match, "big bad big wolf" contains this one). At least to me this is counter intuitive and I would expect the document to match.
It really gets strange if we mix in some "OR":

"big wolf" (slop 1) contained in ("big wolf" (slop 1) OR "bad wolf") 

does not match "big bad wolf", in contrast to

"big wolf" (slop 1) contained in ("big wolf" (slop 1))

, which does. So we actually lose a match by adding a OR-clause, and I think we can agree that this is not really good. Of course these are not queries a human would write, but I think one major use case of span queries is some sort of automatic query generation, and that's where I think it is really important to meet at least some basic expectations (such as not losing matches by adding disjunctions).
 
I don't see a way to fix this that still follows minimal interval semantics, as all this is actually how it SHOULD work there, but this would mean we'd lose the correctness proofs. The only thing I can think of is some sort of query rewriting, pushing the disjunction as far top as neccessary, but this may be rather performance heavy and also does not solve the "bad wolf" (slop 1) contained by "big wolf" (slop 2) problem.
 
Any thoughts?

*A short theoretical aside: I think that most of the span query problems came from the fact that we want to have a "next match" function, i.e. some sort of ordering of matches, together with the nature of span query Matches, which are essentially a pair of numbers (start and end of match). This means we have to specify an order on pairs of numbers (which is possible, of course; the solution with span queries was a lexical order, i.e. the start always increases, and if it stays the same, the end increases). But I think it is not really possible to implement completly lazy behaviour with this ordering: Think of some ordered "((a OR b) followed by (c OR d)) with enough slop" and the document "a b c d" which should find "a b c d" before "b c" (as the start increases), but has to cache the match for "c", which (in the sub-query "(c OR d)) occurs before the one for "b". So the combination of ordering and lazy is where we get problems. The vigna paper is now quite clever, in that the very definition of "minimum interval" solves this problem, as the pair of start and end positions gets essentially reduced to a single number (e.g. start position), as it is not possible for one of them to increase with the other staying the same or even decreasing (as otherwise one of the matches would be contained in the other). I actually think that a possible solution would be to give up the "lazy" part instead and have some kind of caching, but that is not what this ticket is about.

@asfimport
Copy link
Author

Alan Woodward (@romseygeek) (migrated from JIRA)

Hi Martin Hermann

Thanks for the detailed feedback - this is very helpful!

  1. As with Spans, one way to fix the issue with OR intervals is to change the precedence rules so that longer intervals sort before their prefixes.  I need to go re-read the paper's proof concerning the OR operator, it would be interesting to see if this ends up causing problems elsewhere .  Another option would be to add a separate IntervalsSource with this behaviour, maybe triggered as a parameter on Intervals.or()

  2. Intervals don't really have the notion of 'slop' that Spans do, but we could add the idea of an 'internal slop' to ordered and unordered spans.  This would be measured as the space within an interval not taken up by the component intervals.  Your ("big bad" OR evil) wolf query I think can already be done using Intervals.phrase()?

  3. Spans have the notion of a 'gap' Span, which could be usefully added here.  This could help with avoiding minimization in your CONTAINS query

@asfimport
Copy link
Author

Martin Hermann (migrated from JIRA)

@romseygeek

  1. I agree that this might be a solution, but as it differs from the setting of the paper should be done very carefully.
     
  2. Internal slop seems like a great idea! You're right, my example wasn't very good and Intervals.phrase() already does that. But still, if you think of a bigger query and e.g. one slop (say, "a ("big bad" OR evil) wolf", one additional token allowed somewhere), the problem remains. I don't really see how 'internal slop' would differ from 'normal slop' (doesn't it measure the exact same thing?), but it seems rather easy to implement and like something that would be desirable and solve this issue.
     
  3. I'm not quite sure if I understand that correctly. Do you mean using a gap in the query and rewrite it to something like
"bad wolf" (slop 1) contained by "big GAP wolf" (slop 2)

or adding the gap automatically somewhere down the way? I think in the first case it'd still be possible to construct some (maybe a little bit more complicated) examples that can't be solved like that and where the minimal intervals behaviour does not match intuition.

Again, while a lot of these queries may seem quite exotic, I think that intervals will get used a lot various programmatically generated queries (as spans do now), and there pretty much anything can happen.

@asfimport
Copy link
Author

asfimport commented Sep 3, 2018

Alan Woodward (@romseygeek) (migrated from JIRA)

Martin Hermann seeing as this ticket is closed, I opened #9523 to continue discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants